CN112349289A

CN112349289A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN112349289A
Application number: CN202011054844.6A
Authority: CN
Inventors: 吴帅; 李健; 武卫东; 陈明
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-02-09
Anticipated expiration: 2040-09-28
Also published as: CN112349289B

Abstract

The application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, and relates to the technical field of voice recognition. The method can adaptively adjust the weight coefficients of the acoustic model and the language model in the voice decoding process according to the type of the voice audio to obtain the decoding mode most suitable for the current voice audio, and decode the current voice audio, thereby improving the accuracy of voice recognition. Inputting the acoustic characteristics of the voice audio to be recognized into a decoder; acquiring a candidate text generated by a decoder, and calculating the identification score of the candidate text; generating a feature matrix according to the acoustic features, the candidate texts, the recognition scores and the category features of the voice audio to be recognized; inputting the feature matrix into a weight adjustment model; returning the optimal weight output by the weight adjustment model to the decoder; updating a first combination weight of the acoustic model and the language model in the decoder according to the optimal weight; and acquiring the translation text output by the decoder after updating the first combining weight.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

With the development of artificial intelligence, Speech Recognition technology (ASR) has been widely applied to business fields such as conference transcription, real-time translation, Speech quality inspection, and intelligent customer service.

Speech recognition technology (ASR), which is a field of multidisciplinary intersection, is closely linked to numerous disciplines such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, etc. In brief, a speech recognition technology is a technology that recognizes and understands a speech signal through a machine or neural network and converts the speech signal into a corresponding text.

Acoustic models for predicting a pronunciation sequence of a speech audio based on acoustic features of the speech audio and language models for predicting and generating words or words of the speech audio based on the acoustic features are important parts of speech recognition technology.

In the prior art, the speech decoding stage generally combines the output results of the acoustic model and the language model according to a preset fixed weight. In fact, however, the preset fixed weights are not applicable to different occasions, different contexts, or different tasks. For example, when it is understood that the announcer and the ordinary person with a severe accent read the audio of the same text content, the content can be accurately determined by the pronunciation based on the audio collected by the announcer, and the content can be accurately determined by further combining the context or guessing more possible words or words based on the audio collected by the ordinary person with a severe accent.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, which can adaptively adjust the weight coefficients of an acoustic model and a language model in a voice decoding process according to the type of voice audio to obtain a decoding mode most suitable for the current voice audio, decode the current voice audio and further improve the accuracy of voice recognition.

A first aspect of an embodiment of the present application provides a speech recognition method, where the method includes:

acquiring acoustic characteristics of a voice audio to be recognized, and inputting the acoustic characteristics into a decoder; wherein the decoder comprises an acoustic model and a language model;

acquiring a candidate text generated by the decoder, and calculating the identification score of the candidate text;

generating a feature matrix according to the acoustic features, the candidate texts, the recognition scores and the category features of the voice audio to be recognized;

inputting the feature matrix into the weight adjustment model;

returning the optimal weight output by the weight adjustment model to the decoder;

updating a first combining weight of the acoustic model and the language model in the decoder according to the optimal weight;

and acquiring the translation text output by the decoder after updating the first combining weight.

Optionally, the method further comprises: acquiring a voice audio sample and text data corresponding to the voice audio sample; extracting an acoustic feature sample of the voice audio sample, and inputting the acoustic feature sample into a preset decoder; the preset decoder comprises a preset acoustic model and a preset language model; acquiring candidate text samples generated by the preset decoder, and calculating identification score samples of the candidate text samples; generating a feature matrix sample according to the acoustic feature sample, the candidate text sample, the recognition score sample and the class feature sample of the voice audio sample; inputting the characteristic matrix sample into the preset model; returning the optimal weight interval output by the preset model to the preset decoder; updating a second combination weight of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval, and acquiring a predicted text output by the preset decoder based on the updated second combination weight; and updating the parameters of the preset model at least once according to the degree of the difference between the predicted text and the text data to obtain the weight adjustment model.

Optionally, before obtaining the translated text output by the decoder after updating the first combining weight, the method further includes: setting a text screening score calculation formula according to the acoustic model, the language model and the optimal weight; acquiring the translation text output by the decoder after updating the first combining weight, wherein the method comprises the following steps: inputting the acoustic features of the voice audio to be recognized into the decoder after updating the first combination weight; acquiring a plurality of identification texts generated by the decoder; respectively calculating scores for the plurality of identification texts by using the text screening score calculation formula; and determining the recognition text with the highest score as the translation text.

Optionally, updating a second combining weight of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval includes: updating the second combining weight by sequentially utilizing each weight value of the optimal weight interval; acquiring a predicted text output by the preset decoder based on the updated second combining weight, wherein the acquiring comprises: after updating the second combining weight by using the weight value of the optimal weight interval each time, acquiring a predicted text output by the preset decoder; updating the parameters of the preset model at least once according to the degree of the difference between the predicted text and the text data to obtain the weight adjustment model, wherein the method comprises the following steps: sequentially calculating the error rate of the predicted text obtained each time compared with the text data; determining the predicted text corresponding to the minimum error rate as a target predicted text; determining the weight value used by the preset decoder when the target prediction text is output as an optimal weight reference value; and updating the parameters of the preset model at least once according to the optimal weight reference value to obtain the weight adjustment model.

Optionally, before returning the optimal weight interval output by the preset model to the preset decoder, the method further includes: searching to obtain the optimal weight interval according to the characteristic matrix sample by using a Newton iteration method; according to the optimal weight reference value, updating the parameters of the preset model at least once to obtain the weight adjustment model, and the method comprises the following steps: calculating a loss value between the optimal weight reference value and the optimal weight interval; and updating parameters used for executing the Newton iteration method in the preset model according to the loss value.

A second aspect of the embodiments of the present application provides a speech recognition apparatus, including:

the decoding module is used for acquiring the acoustic characteristics of the voice audio to be recognized and inputting the acoustic characteristics into the decoder; wherein the decoder comprises an acoustic model and a language model; the candidate text generation module is used for acquiring the candidate text generated by the decoder and calculating the identification score of the candidate text; the feature matrix generation module is used for generating a feature matrix according to the acoustic features, the candidate texts, the recognition scores and the category features of the voice audio to be recognized; a first input module, configured to input the feature matrix into the weight adjustment model; an optimal weight returning module, configured to return the optimal weight output by the weight adjustment model to the decoder; a first combining weight updating module, configured to update a first combining weight of the acoustic model and the language model in the decoder according to the optimal weight; and the translation text output module is used for acquiring the translation text output by the decoder after updating the first combination weight.

Optionally, the apparatus further comprises: the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a voice audio sample and text data corresponding to the voice audio sample; the extraction module is used for extracting an acoustic feature sample of the voice audio sample and inputting the acoustic feature sample into a preset decoder; the preset decoder comprises a preset acoustic model and a preset language model; the calculation module is used for acquiring candidate text samples generated by the preset decoder and calculating identification score samples of the candidate text samples; a feature matrix sample generation module, configured to generate a feature matrix sample according to the acoustic feature sample, the candidate text sample, the recognition score sample, and the category feature sample of the speech audio sample; the second input module is used for inputting the characteristic matrix sample into the preset model; the optimal weight interval returning module is used for returning the optimal weight interval output by the preset model to the preset decoder; the second combination weight updating module is used for updating a second combination weight of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval and acquiring a predicted text output by the preset decoder based on the updated second combination weight; and the parameter updating module is used for updating the parameters of the preset model at least once according to the degree of the phase difference of the predicted text compared with the text data to obtain the weight adjusting model.

Optionally, the apparatus further comprises: the setting module is used for setting a text screening score calculation formula according to the acoustic model, the language model and the optimal weight; the translation text output module includes: the decoding submodule is used for inputting the acoustic characteristics of the voice audio to be recognized into the decoder after the first combination weight is updated; the recognition text generation submodule is used for acquiring a plurality of recognition texts generated by the decoder; the screening submodule is used for respectively calculating scores of the recognition texts by utilizing the text screening score calculation formula; and the translation text obtaining submodule is used for determining the recognition text with the highest score as the translation text.

Optionally, the second combining weight updating module includes: a second combining weight updating submodule, configured to update the second combining weight by sequentially using each weight value of the optimal weight interval; the predicted text acquisition submodule is used for acquiring the predicted text output by the preset decoder after updating the second combination weight by utilizing the weight value of the optimal weight interval each time; the parameter updating module comprises: the calculation submodule is used for calculating the error rate of the predicted text obtained each time compared with the text data in sequence; the target predicted text determining submodule is used for determining the predicted text corresponding to the minimum error rate as a target predicted text; an optimal weight reference value determining submodule, configured to determine a weight value used by the preset decoder when the target prediction text is output as an optimal weight reference value; and the parameter updating submodule is used for updating the parameters of the preset model at least once according to the optimal weight reference value to obtain the weight adjusting model.

Optionally, the apparatus further comprises: the searching module is used for searching and obtaining the optimal weight interval according to the characteristic matrix sample by utilizing a Newton iteration method; the parameter updating submodule comprises: a calculating subunit, configured to calculate a loss value between the optimal weight reference value and the optimal weight interval; and the parameter updating subunit is used for updating the parameters used for executing the Newton iteration method in the preset model according to the loss value.

A third aspect of embodiments of the present application provides a readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the method according to the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect of the present application.

The voice recognition method provided by the embodiment of the application is divided into a pre-recognition stage and a voice recognition stage. Specifically, when a user inputs voice, the category characteristics of the voice input by the corresponding user are obtained, and the voice input by the user is pre-identified according to the original combination weight of a language model and an acoustic model in a decoder; and according to the result of the pre-recognition, namely based on the original weight coefficient of the language model and the original weight coefficient of the acoustic model in the decoder, the text obtained by translating the voice input by the user and the probability score of the text are obtained. And further utilizing a weight adjustment model, utilizing the result of pre-recognition, the acoustic characteristics of the voice input by the user and the class characteristics of the voice input by the user, calculating to obtain the optimal combination weight of the language model and the acoustic model which are matched with the objective scene where the user inputs the voice, and then identifying the voice input by the user according to the optimal combination weight of the language model and the acoustic model to obtain the translation text which is matched with the objective scene where the user inputs the voice.

According to the method and the device, the optimal weight value is selected from the optimal weight interval output by the preset model according to the standard text data corresponding to the voice audio sample, and then the optimal weight value is returned to the preset model to serve as a supervision condition for training the preset model to search the optimal weight interval. And adjusting parameters of each training to enable the preset model to adopt a Newton iteration method, and searching the obtained optimal weight interval to further approach the optimal weight coefficient of the combination of the acoustic model and the language model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a schematic diagram illustrating a combination of an acoustic model and a language model in speech recognition according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating steps of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating the steps of training a weight adjustment model according to an embodiment of the present application;

FIG. 4 is a flowchart of training a weight adjustment model according to an embodiment of the present application;

fig. 5 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic diagram of a combination of an acoustic model and a language model in speech recognition according to an embodiment of the present application. As shown in fig. 1:

in the first step, acoustic features including but not limited to mel cepstral coefficients MFCC, fundamental frequency F0, etc. are extracted from the audio signal of the speech audio using a feature extraction model. Secondly, inputting the acoustic features into an acoustic Model and a language Model respectively, predicting the pronunciation features by the acoustic Model, such as phoneme feature vectors PPGs corresponding to the voice audio or pinyin, and giving probability scores of the predicted pronunciation features, such as Hidden Markov Models (HMMs); the language model predicts a plurality of words or characters and gives scores for the predicted plurality of words or characters. And thirdly, inputting the pronunciation characteristics predicted by the acoustic model and the probability scores corresponding to the pronunciation characteristics and the language characteristics predicted by the language model and the probability scores corresponding to the language characteristics into the decoding search model, wherein the language characteristics can be a plurality of words or a plurality of words. And fourthly, the decoding search model predicts the text content of the voice audio expression according to the output of the acoustic model, the output of the language model and the pronunciation dictionary.

In the existing speech recognition system, the weighting coefficients of the acoustic model and the weighting coefficients of the language model are preset in the decoding and searching stage. In general, the weight coefficient of the acoustic model refers to a weight coefficient of the acoustic model (H) in a weighted finite state converter (HCLG), and the weight coefficient of the language model refers to a weight coefficient of the language model (G) in the HCLG.

The HCLG is formed by superposing a language model (G), a pronunciation dictionary (L), a context correlation analysis model (C) and an acoustic model (H).

However, designers of the speech recognition system have not realized that the usage scenario of the speech recognition system is not single, and may be in a hall with noisy human voice or in a silent room, and a user who may input speech audio is a speaker with a pronunciation standard, or a dialect is used by a user who may input speech audio, and at this time, the weighting coefficients assigned to the acoustic model and the language model in advance in the decoding search process, or the influence ratio of the acoustic model and the language model set in advance on the decoding search cannot be applied to every occasion.

In view of the above problems, an embodiment of the present application provides a speech recognition method, which adjusts weight coefficients of an acoustic model and a language model in a decoding search process by combining category features of a speech audio, acoustic features of the speech audio, and text features obtained by pre-recognizing the speech audio, so that a decoding stage of the speech recognition is more suitable for actual conditions of the speech audio, and further accuracy of the speech recognition is increased.

The voice recognition method provided by the application can be executed by the terminal equipment or the server.

Fig. 2 is a flowchart illustrating steps of a speech recognition method according to an embodiment of the present application. As shown in fig. 2, the voice recognition method includes:

step S21: acquiring acoustic characteristics of a voice audio to be recognized, and inputting the acoustic characteristics into a decoder; wherein the decoder comprises an acoustic model and a language model;

the audio to be recognized can be voice input by customer service personnel or a user in a customer service system; voice received by the navigation system; audio signals collected by the public opinion monitoring system in the office hall, and the like.

The decoder includes an acoustic model, a language model, a decoding search model (e.g., HCLG weighted finite state transformer), a pronunciation dictionary, and the like. The acoustic model and the language model which are trained can be directly adopted in the embodiment of the application.

The acoustic Model can be constructed by a Gaussian Mixture Model (GMM-HMM) and a Deep Neural network (Deep Neural Networks) and a hidden Markov Neural network (DNN-HMM), and is used for analyzing and calculating acoustic features of voice audio to obtain pronunciation features corresponding to the voice audio.

The method for obtaining the acoustic features of the speech audio includes, but is not limited to, performing cepstrum analysis, mel-frequency analysis and the like on a spectrogram of the speech audio, and the obtained acoustic features include, but are not limited to, mel-frequency cepstrum coefficients MFCC, fundamental frequency F0 and the like.

For the acoustic features extracted from the sound spectrum of the voice frequency to be recognized, further processing can be carried out, so that the acoustic features are more suitable for machine learning or calculation of a neural network. For example, the MFCC features may be normalized to 60 frames, and then converted into at least (60+3) -dimensional feature vectors in combination with the sampling rate, original format, code rate, etc. at which the MFCCs were extracted.

The method and the device can train a neural probability language model (N-gram), and a decoder is formed by the language model obtained after training, an acoustic model, a decoding search model and a pronunciation dictionary. The N-gram is a language model based on Markov assumptions. Markov assumption: the probability of an arbitrary word occurring is related to only the limited number n of words that it precedes.

Step S22: acquiring a candidate text generated by the decoder, and calculating the identification score of the candidate text;

for example, it is assumed that the text content corresponding to the speech audio 1 to be recognized received by the navigation system is "to promote meta way".

After the acoustic features extracted based on the speech audio 1 are input into the decoder, the acoustic model calculates the acoustic features to obtain pronunciation features, which can be expressed as: [ dao-82de-10tao 30; jin-60 sting-56 ji-13, yuan-80yan-60xuan 50, lu-60lv-45 nu-5.

The language features computed by the language model for the acoustic features may be represented as: [ to-72 panning-20 th 30; jin-20 jin-60 Jing-20, Yuan-24 Yuan-60 Yuan-16, Lu-60 Lu-25 anger-15.

And further inputting the pronunciation characteristics and the language characteristics into a decoding search model, and decoding the pronunciation characteristics and the language characteristics by combining a pronunciation dictionary to obtain a candidate text. The method for decoding the search model to obtain the candidate text by further decoding on the basis of the pronunciation characteristics and the language characteristics can be carried out in a way of searching a path in the HCLG weighted finite-state converter, and the specific implementation process of the path is not limited in the application.

The resulting candidate text may be a plurality of candidate texts, and continuing with the above example, the resulting candidate text may be: [ to gold Source road ]; [ to the jin Yuan Lu ]; [ do not promote the source anger ].

Because the pronunciation characteristic and the language characteristic have corresponding probability scores, the decoder can output the recognition scores of the corresponding candidate texts obtained by calculation while outputting the candidate texts.

Since the HCLG is a candidate text obtained by performing path search according to the weighting coefficients of the language model and the acoustic model, the decoder calculates the recognition score and combines the weighting coefficients of the language model and the acoustic model.

Taking the identification score of the candidate text [ to the gold source road ] as an example, the weight coefficient of the language model is λ, and the identification score is: λ × probability score of language model + (1- λ) probability score of acoustic model ═ λ × [72,20,24,60] + (1- λ) [82,60,60,60 ].

Step S23: generating a feature matrix according to the acoustic features, the candidate texts, the recognition scores and the category features of the voice audio to be recognized;

continuing with the example of a navigation system, when a user inputs speech, he may select that he is currently driving in the vehicle and choose mandarin, and the system determines the category characteristics with the information input by the user.

Step S24: inputting the feature matrix into the weight adjustment model;

step S25: returning the optimal weight output by the weight adjustment model to the decoder;

the weight adjustment model calculates to obtain the optimal weight according to the category characteristics of the voice to be recognized, the acoustic characteristics, the candidate texts and the recognition scores corresponding to the candidate texts;

step S26: updating a first combining weight of the acoustic model and the language model in the decoder according to the optimal weight;

step S27: and acquiring the translation text output by the decoder after updating the first combining weight.

The first combination weight is a weight coefficient of the language model in calculating the candidate text by combining the outputs of the language model and the acoustic model in the HCLG. The first combining weight is λ, the weight coefficient of the language model is λ, and it may be further determined that the weight coefficient of the acoustic model is 1- λ.

Assuming that the optimal weight determined by the weight adjustment model based on the feature matrix is λ₁At λ₁Replacing lambda, updating the weighting coefficient of language model at decoder to lambda₁The weight coefficient of the acoustic model is updated to 1-lambda₁And then, recalculating the input acoustic features to obtain the final translation text.

The translated text is a text which is obtained by translating a speech after machine recognition and can accurately represent the content of the speech.

The navigation system is taken as an example to explain the effect of the embodiment of the application, and the effect is that if the driver can input the voice to be recognized in Mandarin, the voice to be recognized can only be input in dialect by the copilot, when the vehicle is not started, the driver inputs the voice to be recognized in Mandarin, the system determines the weight coefficient of the acoustic model and the combination weight of the language model, and when the lane change is needed in the driving process, the voice to be recognized can only be input by the copilot.

When the translated text output by the decoder after updating the first combining weight is obtained, a plurality of recognition texts may be output in advance, and the translated text with the highest score may be selected as the final translated text.

Setting a text screening score calculation formula according to the acoustic model, the language model and the optimal weight;

the language model may be: p_lm(w_i|w₁,w₂,…,w_i-1) (ii) a Wherein, P_lmRepresenting a language model, w_iAnd (4) representing a word vector, wherein the value of i is calculated by a language model according to acoustic characteristics. Taking the voice audio 1 with the content "promote meta way" as an example, the value of i determined by the language model is 4.

The acoustic model may be: p_am(v_j,|v₁,v₂,…,v_j-1) (ii) a Wherein, P_amRepresenting an acoustic model, v_jRepresenting the phoneme vector, and calculating the value of j according to the acoustic features by an acoustic model. Taking the voice audio 1 with the content of "to promote element path" as an example, the value of j determined by the acoustic model may be 4 or 9.

The probability score of a further language model may be expressed as: s_lm＝∑_iln{P_lm[w_i]In which S is_lmRepresenting the probability score of the language model. The probability score of an acoustic model may be expressed as: s_am＝∑_jln{P_am[v_j(w₁,w₂,…,w_n)]In which S is_amRepresenting acoustic modesThe probability score of the type n is the same as that of i. Substituting the pronunciation dictionary v (w) into the acoustic model can obtain the probability score formula of the acoustic model.

The pronunciation dictionary v (w) is a mapping relation between Chinese pronunciation dictionary, generally phoneme or pinyin, and Chinese characters.

Therefore, according to the acoustic model, the language model and the optimal weight, the text screening score calculation formula is set as follows: λ Σ_jln{P_am[v_j(w₁,w₂,…,w_n)]}+(1-λ)ln{∑_iP_lm[w_i]}. Where P is the recognition score.

The text screening score calculation formula is used for calculating the identification scores of the candidate texts, and the candidate texts with the highest scores can be obtained by screening the candidate texts further according to the identification scores and serve as the translation texts.

Therefore, in step S22, the candidate texts generated by the decoder may be a plurality of texts, and before the decoder outputs the candidate texts, the recognition scores of the plurality of texts are respectively calculated by using the text filtering score calculation formula, and the plurality of texts and the corresponding recognition scores are output.

After the first combination weight of the acoustic model and the language model in the decoder is updated with the optimal weight in step S26, λ in the text filtering score calculation formula is also updated accordingly.

The specific implementation steps of outputting a plurality of recognition texts in advance after the combination weight is updated and then selecting the translation text with the highest score as the final translation text are as follows:

step S27-1: inputting the acoustic features of the voice audio to be recognized into the decoder after updating the first combination weight;

step S27-2: acquiring a plurality of identification texts generated by the decoder;

step S27-3: respectively calculating scores for the plurality of identification texts by using the text screening score calculation formula;

step S27-4: and determining the recognition text with the highest score as the translation text.

In the embodiment of the application, the decoder sets a text screening score calculation formula according to the combination weight of the acoustic model and the language model, generates a plurality of texts in a speech pre-recognition stage and a speech recognition stage respectively, and calculates the score of each generated text by using the text screening score calculation formula. In the pre-recognition stage, the candidate texts generated by the decoder can comprise a plurality of texts, the recognition scores corresponding to each text, namely the probability scores, are calculated according to a text screening score calculation formula, then a feature matrix is further generated for each text, the input weight adjustment model outputs the optimal weight according to the recognition scores, and the weight is adjusted according to the recognition scores calculated by the weight calculation; the combination weight of the acoustic model and the language model is adjusted in a self-adaptive mode to reach an optimal value through the process.

Another embodiment of the present application provides a method of training a weight adjustment model. Fig. 3 is a flowchart illustrating steps of training a weight adjustment model according to an embodiment of the present application, and fig. 4 is a flowchart illustrating steps of training a weight adjustment model according to an embodiment of the present application, where as shown in fig. 3 and 4, a method for training a weight adjustment model includes:

the conditions of the voice audio sample include scenes (quality control, conference, navigation) and fields (bank, insurance, map) of the input voice audio sample.

Step S31: acquiring a voice audio sample and text data corresponding to the voice audio sample;

the text data is the textual content of the speech audio sample. Generally, the audio of the text data read by the person may be collected as a voice audio sample, or the voice audio sample may also be listened to manually, and the text data may be obtained by labeling, or the voice audio sample and the text data corresponding to the voice audio sample may be obtained by other methods, which is not limited in the present application.

Step S32: extracting an acoustic feature sample of the voice audio sample, and inputting the acoustic feature sample into a preset decoder; the preset decoder comprises a preset acoustic model and a preset language model;

the method for extracting the acoustic feature samples from the voice audio samples can refer to the method for extracting the acoustic features from the voice audio to be recognized.

Meanwhile, a category characteristic sample of the voice audio sample is input, wherein the category characteristic sample can be obtained according to the voice content of the voice audio sample and can also be obtained according to a channel for obtaining the voice audio sample. For example, when the weight adjustment model is trained with speech audio samples of conference contents, it may be input that the class feature sample is a conference, and when the weight adjustment model is trained with speech audio samples collected from a bank sound source library, it may be input that the class feature sample is a bank.

The preset decoder is a decoder used in training the weight adjustment model, a preset decoder used in training the weight adjustment model, and a decoder used in applying the weight adjustment model to adaptively adjust the combining weight of the acoustic model and the language model in speech recognition, and may be the same.

The preset acoustic model refers to an acoustic model in a decoder when a weight adjustment model is trained. The preset language model refers to a language model in a decoder when the weight adjustment model is trained. Similarly, the preset acoustic model and the acoustic model may be the same, and the preset language model and the language model may also be the same.

Step S33: acquiring candidate text samples generated by the preset decoder, and calculating identification score samples of the candidate text samples;

the candidate text samples may include a plurality of text samples, each text sample corresponding to a respective recognition score sample.

Calculating a recognition score sample for the candidate text sample, and also utilizing a set text screening score calculation formula: λ Σ_jln{P_am[v_j(w₁,w₂,…,w_n)]}+(1-λ)∑_iln{P_lm[w_i]The calculation is carried out.

Step S34: generating a feature matrix sample according to the acoustic feature sample, the candidate text sample, the recognition score sample and the class feature sample of the voice audio sample;

and performing embedded splicing on the category feature samples to convert the category feature samples into at least 3-dimensional category feature vectors, multiplying the category feature vectors by a query matrix, reducing the dimensionality of the category feature vectors, and splicing the category feature vectors with reduced dimensionality to obtain a category matrix.

And obtaining the number of text samples in the candidate text samples, multiplying the number of the text samples by the number of the identification fraction samples, and splicing the fraction samples to obtain a fraction matrix because the identification fraction samples are in the form of probability vectors.

And converting the candidate text by word vectors or word vectors to obtain language feature vectors at least in (10 x (30+30+1) +1) dimensions, compressing and splicing the language feature vectors, namely multiplying the language feature vectors by a compression matrix to reduce the dimensions of the language feature vectors, and splicing the language feature vectors with reduced dimensions to obtain a language matrix.

And performing compression splicing on the acoustic features, including normalization and vector conversion, to obtain an acoustic feature vector with at least (60+3) dimensions.

And combining the acoustic feature vector, the language feature vector, the score matrix and the category feature vector input feature matrix to obtain a feature matrix sample.

Step S35: inputting the characteristic matrix sample into the preset model;

the preset model is a pre-built multilayer fully-connected network structure and can execute a Newton iterative algorithm.

Step S36: returning the optimal weight interval output by the preset model to the preset decoder;

the preset model is obtained by searching the optimal weight interval according to the characteristic matrix sample by using a Newton iteration method;

step S37: updating a second combination weight of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval, and acquiring a predicted text output by the preset decoder based on the updated second combination weight;

the second combining weight refers to a combining weight of a preset acoustic model and a preset language model in the training weight adjustment model.

The predictive text may also include multiple texts.

Step S38: and updating the parameters of the preset model at least once according to the degree of the difference between the predicted text and the text data to obtain the weight adjustment model.

Another embodiment of the present application provides a method for updating weight coefficients of a preset acoustic model and a preset language model in a preset decoder according to an optimal weight interval.

Updating a second combining weight of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval, wherein the second combining weight comprises: updating the second combining weight by sequentially utilizing each weight value of the optimal weight interval;

the weight section is a discrete number extracted in the optimal weight section according to the set calculation accuracy.

Firstly, updating the weight coefficients of the acoustic model and the language model by each weight value in the optimal weight interval. Assuming that the optimal weight interval is [ 1.5-2.6 ], the calculation accuracy is 1 digit after the decimal point, and 1.5 and 1.6 … … 2.6.6 are respectively used as the combined weight of the acoustic model and the language model.

Acquiring a predicted text output by the preset decoder based on the updated second combining weight, wherein the acquiring comprises: after updating the second combining weight by using the weight value of the optimal weight interval each time, acquiring a predicted text output by the preset decoder;

continuing with the example, assuming that the optimal weight interval is [ 1.5-2.6 ], the calculation accuracy is 1 digit after the decimal point, 10 weight values are obtained, each weight value is substituted into the HCLG weighted finite state converter, so that the preset acoustic model and the preset language model are sequentially subjected to value taking by the 11 weights, and the corresponding texts are obtained by decoding, and 11 predicted texts are obtained.

Updating the parameters of the preset model at least once according to the degree of the difference between the predicted text and the text data to obtain the weight adjustment model, wherein the method comprises the following steps: sequentially calculating the error rate of the predicted text obtained each time compared with the text data; determining the predicted text corresponding to the minimum error rate as a target predicted text; determining the weight value used by the preset decoder when the target prediction text is output as an optimal weight reference value; and updating the parameters of the preset model at least once according to the optimal weight reference value to obtain the weight adjustment model.

According to the embodiment of the application, the prediction text is compared with the text data to obtain the optimal weight value in the optimal weight interval, the optimal weight value is returned to the preset model, and the capacity of the preset model for searching the optimal weight interval is trained reversely.

Further, returning the optimal weight value to the preset model, and reversely training the capacity of the preset model to search the optimal weight interval as follows:

calculating a loss value between the optimal weight reference value and the optimal weight interval;

the optimal weight interval for obtaining the output of the preset model in step S36 is (λ)_a,λ_b) Calculating an optimal weight reference value and an optimal weight interval (lambda)_a,λ_b) The loss value of (a). Wherein,

at λ_cRepresenting the optimal weight reference value, the penalty function for calculating the penalty value is: min [ lambda ]_c-(λ_a-λ_b)/2]²。

And updating parameters used for executing the Newton iteration method in the preset model according to the loss value.

The preset model searching optimal weight interval can be searched based on a formula mincer (lambda), so that the word error rate of the text output by the decoder after updating the weight is ensured to be minimum compared with the text data in the obtained optimal weight interval.

According to the method and the device, the optimal weight value is selected from the optimal weight interval output by the preset model according to the standard text data corresponding to the voice audio sample, and then the optimal weight value is returned to the preset model to serve as a supervision condition for training the preset model to search the optimal weight interval. And adjusting parameters of each training to enable the preset model to adopt a Newton iteration method, and searching the obtained optimal weight interval to further approach the optimal weight coefficient of the combination of the acoustic model and the language model. The preset model after multiple times of training has the calculation capability of outputting the optimal weight under the current category. And returning the value of the optimal weight interval obtained by searching to the decoder, and updating the combination weight of the acoustic model and the language model by using the weight of the optimal weight interval to obtain the predicted text with the minimum error rate compared with the standard text.

Based on the same inventive concept, the embodiment of the application provides a voice recognition device. Fig. 5 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus may include:

the decoding module 51 is configured to obtain an acoustic feature of a speech audio to be recognized, and input the acoustic feature to a decoder; wherein the decoder comprises an acoustic model and a language model;

a candidate text generation module 52, configured to obtain a candidate text generated by the decoder, and calculate a recognition score of the candidate text;

a feature matrix generating module 53, configured to generate a feature matrix according to the acoustic feature, the candidate text, the recognition score, and the category feature of the speech audio to be recognized;

a first input module 54, configured to input the feature matrix into the weight adjustment model;

an optimal weight returning module 55, configured to return the optimal weight output by the weight adjustment model to the decoder;

a first combining weight updating module 56, configured to update a first combining weight of the acoustic model and the language model in the decoder according to the optimal weight;

and a translated text output module 57, configured to obtain the translated text output by the decoder after updating the first combining weight.

Optionally, the apparatus further comprises:

the setting module is used for setting a text screening score calculation formula according to the acoustic model, the language model and the optimal weight; the translation text output module includes: the decoding submodule is used for inputting the acoustic characteristics of the voice audio to be recognized into the decoder after the first combination weight is updated; the recognition text generation submodule is used for acquiring a plurality of recognition texts generated by the decoder; the screening submodule is used for respectively calculating scores of the recognition texts by utilizing the text screening score calculation formula; and the translation text obtaining submodule is used for determining the recognition text with the highest score as the translation text.

Optionally, the second combining weight updating module includes:

a second combining weight updating submodule, configured to update the second combining weight by sequentially using each weight value of the optimal weight interval; the predicted text acquisition submodule is used for acquiring the predicted text output by the preset decoder after updating the second combination weight by utilizing the weight value of the optimal weight interval each time; the parameter updating module comprises: the calculation submodule is used for calculating the error rate of the predicted text obtained each time compared with the text data in sequence; the target predicted text determining submodule is used for determining the predicted text corresponding to the minimum error rate as a target predicted text; an optimal weight reference value determining submodule, configured to determine a weight value used by the preset decoder when the target prediction text is output as an optimal weight reference value; and the parameter updating submodule is used for updating the parameters of the preset model at least once according to the optimal weight reference value to obtain the weight adjusting model.

Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the speech recognition method according to any of the above-mentioned embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the steps in the speech recognition method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive or descriptive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is given to a speech recognition method, apparatus, device and storage medium provided by the present application, and the above description of the embodiments is only used to help understanding the method and its core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of speech recognition, the method comprising:

inputting the feature matrix into a weight adjustment model;

2. The method of claim 1, further comprising:

acquiring a voice audio sample and text data corresponding to the voice audio sample;

extracting an acoustic feature sample of the voice audio sample, and inputting the acoustic feature sample into a preset decoder; the preset decoder comprises a preset acoustic model and a preset language model;

acquiring candidate text samples generated by the preset decoder, and calculating identification score samples of the candidate text samples;

generating a feature matrix sample according to the acoustic feature sample, the candidate text sample, the recognition score sample and the class feature sample of the voice audio sample;

inputting the characteristic matrix sample into the preset model;

returning the optimal weight interval output by the preset model to the preset decoder;

updating a second combination weight of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval, and acquiring a predicted text output by the preset decoder based on the updated second combination weight;

and updating the parameters of the preset model at least once according to the degree of the difference between the predicted text and the text data to obtain the weight adjustment model.

3. The method of claim 1, wherein prior to obtaining the translated text of the decoder output after updating the first combining weight, the method further comprises:

acquiring the translation text output by the decoder after updating the first combining weight, wherein the method comprises the following steps:

inputting the acoustic features of the voice audio to be recognized into the decoder after updating the first combination weight;

acquiring a plurality of identification texts generated by the decoder;

respectively calculating scores for the plurality of identification texts by using the text screening score calculation formula;

and determining the recognition text with the highest score as the translation text.

4. The method of claim 2, wherein updating the second combining weights of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval comprises:

updating the second combining weight by sequentially utilizing each weight value of the optimal weight interval;

acquiring a predicted text output by the preset decoder based on the updated second combining weight, wherein the acquiring comprises:

after updating the second combining weight by using the weight value of the optimal weight interval each time, acquiring a predicted text output by the preset decoder;

updating the parameters of the preset model at least once according to the degree of the difference between the predicted text and the text data to obtain the weight adjustment model, wherein the method comprises the following steps:

sequentially calculating the error rate of the predicted text obtained each time compared with the text data;

determining the predicted text corresponding to the minimum error rate as a target predicted text;

determining the weight value used by the preset decoder when the target prediction text is output as an optimal weight reference value;

and updating the parameters of the preset model at least once according to the optimal weight reference value to obtain the weight adjustment model.

5. The method of claim 4, wherein before returning the optimal weight interval of the preset model output to the preset decoder, the method further comprises:

searching to obtain the optimal weight interval according to the characteristic matrix sample by using a Newton iteration method;

according to the optimal weight reference value, updating the parameters of the preset model at least once to obtain the weight adjustment model, and the method comprises the following steps:

6. A speech recognition apparatus, characterized in that the apparatus comprises:

the decoding module is used for acquiring the acoustic characteristics of the voice audio to be recognized and inputting the acoustic characteristics into the decoder; wherein the decoder comprises an acoustic model and a language model;

the candidate text generation module is used for acquiring the candidate text generated by the decoder and calculating the identification score of the candidate text;

the feature matrix generation module is used for generating a feature matrix according to the acoustic features, the candidate texts, the recognition scores and the category features of the voice audio to be recognized;

a first input module, configured to input the feature matrix into the weight adjustment model;

an optimal weight returning module, configured to return the optimal weight output by the weight adjustment model to the decoder;

a first combining weight updating module, configured to update a first combining weight of the acoustic model and the language model in the decoder according to the optimal weight;

and the translation text output module is used for acquiring the translation text output by the decoder after updating the first combination weight.

7. The apparatus of claim 6, further comprising:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a voice audio sample and text data corresponding to the voice audio sample;

the extraction module is used for extracting an acoustic feature sample of the voice audio sample and inputting the acoustic feature sample into a preset decoder; the preset decoder comprises a preset acoustic model and a preset language model;

the calculation module is used for acquiring candidate text samples generated by the preset decoder and calculating identification score samples of the candidate text samples;

a feature matrix sample generation module, configured to generate a feature matrix sample according to the acoustic feature sample, the candidate text sample, the recognition score sample, and the category feature sample of the speech audio sample;

the second input module is used for inputting the characteristic matrix sample into the preset model;

the optimal weight interval returning module is used for returning the optimal weight interval output by the preset model to the preset decoder;

the second combination weight updating module is used for updating a second combination weight of the preset acoustic model and the preset language model in the preset decoder according to the optimal weight interval and acquiring a predicted text output by the preset decoder based on the updated second combination weight;

and the parameter updating module is used for updating the parameters of the preset model at least once according to the degree of the phase difference of the predicted text compared with the text data to obtain the weight adjusting model.

8. The apparatus of claim 7, further comprising:

the setting module is used for setting a text screening score calculation formula according to the acoustic model, the language model and the optimal weight;

the translation text output module includes:

the decoding submodule is used for inputting the acoustic characteristics of the voice audio to be recognized into the decoder after the first combination weight is updated;

the recognition text generation submodule is used for acquiring a plurality of recognition texts generated by the decoder;

the screening submodule is used for respectively calculating scores of the recognition texts by utilizing the text screening score calculation formula;

and the translation text obtaining submodule is used for determining the recognition text with the highest score as the translation text.

9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-5.