WO2021051628A1

WO2021051628A1 - Method, apparatus and device for constructing speech recognition model, and storage medium

Info

Publication number: WO2021051628A1
Application number: PCT/CN2019/119128
Authority: WO
Inventors: 王健宗; 贾雪丽
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-19
Filing date: 2019-11-18
Publication date: 2021-03-25
Also published as: CN110751944A

Abstract

A method, apparatus and device for constructing a speech recognition model and a storage medium, relating to the field of artificial intelligence. The method comprises: obtaining a plurality of training speech samples (101); constructing a speech recognition model by means of an independent convolution layer, a convolution residual layer, a fully connected layer, and an output layer (102); inputting training speech information to the speech recognition model, updating a weight of neurons in the speech recognition model with the speech information and a text tag corresponding to the speech information by means of a natural language processing (NLP) technology, and then obtaining a target model (103); evaluating an error of the target model by means of L(S) = -ln[Pi]_(h(x) _{, z)} _{being an element of a set S}p(z|h(x))= -sigma_(h(x) _{, z)} _{being an element of a set S}ln p(z|h(x)) (104); adjusting the weight of the neurons in the target model until the error is less than a threshold, and setting the weight of the neurons with the error less than the threshold as an ideal weight (105); and deploying the target model and the ideal weight on a client (106). The present invention reduces influence of tone in the speech information on a predicted text and computation burden in recognition process in the speech recognition model.

Description

Method, device, equipment and storage medium for constructing speech recognition model

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 19, 2019, the application number is 201910884620.9, and the invention title is "Methods, Apparatus, Equipment, and Storage Media for Constructing Speech Recognition Models", all of which are approved The reference is incorporated in the application.

Technical field

This application relates to the field of intelligent decision-making, and in particular to a method, device, equipment and storage medium for constructing a voice recognition model.

Background technique

Speech recognition is used to convert speech into text. With the continuous development of deep learning technology, the application range of speech recognition has become wider and wider.

At present, deep neural networks (DNN) have become a research hotspot in the field of automatic speech recognition. Convolutional neural networks (CNN) and recurrent neural networks (RNN) have achieved relatively good results in the creation of speech recognition models, and deep learning has become the mainstream solution for speech recognition.

In deep neural networks, the depth of the network is often closely related to the accuracy of recognition, because traditional deep neural networks can extract low-level, middle-level, and high-level (low/mid/high-level) multi-level features, and the number of layers in the network The more, it means that the extracted features are richer. However, as the network level continues to deepen, the "degradation phenomenon" of deep neural networks also begins to appear, causing the accuracy of speech recognition to quickly reach saturation. The deeper the network level, the higher the error rate. In addition, the existing speech recognition model needs to align the speech training samples before training, and align the speech data of each frame with the corresponding label to ensure that the loss function used in the training can accurately estimate the speech recognition model. Training error. However, the inventor realizes that the alignment process of the speech training samples is cumbersome and complicated, and requires a lot of time and cost.

Summary of the invention

In the example of this application, by acquiring features of unlabeled data, the acquired features are introduced into supervised learning, so that the available sample data is expanded, the utilization efficiency of unlabeled images is improved, and the accuracy of model prediction is increased.

In the first aspect, this application provides a method for constructing a speech recognition model, including:

Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;

The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers;

A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as a target model;

Through L(S)=-lnΠ _{(h(x), z)∈S} p(z|h(x))=-∑ _(h(x),z)∈S ln p(z|h(x)) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model;

Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;

Deploy the target model and the ideal weight to the client.

In some possible designs, before the inputting the multiple voice samples into the voice recognition model, the method further includes:

Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;

The sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.

In some possible designs, the processing of the training voice information in frames according to preset framing parameters includes:

Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;

The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:

The expression of f(m) is:

The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f _l is the lowest frequency of the frequency range of the band-pass filter, and the f _h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f _s is the sampling frequency of the band-pass filter, the F _mel function is F _mel =1125ln(1+f/700), the inverse of Fmel The function is:

b is an integer;

according to

0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.

In some possible designs, the fully connected layer includes a classification function, and the classification function refers to

The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) _j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.

In some possible designs, the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is:

y=F(x, w _i )+w _s x, the F(x, w _i ) is the output of the independent convolutional layer, and the w _s is the weight of the residual module.

In some possible designs, the F(x, w _i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x ),

In some possible designs, the adjusting the weight of the neuron of the target model includes:

The weight of the neuron is adjusted by the stochastic gradient descent method.

In the second aspect, the present application provides an apparatus for constructing a speech recognition model, which has the function of implementing the method corresponding to the method for constructing a speech recognition model platform provided in the first aspect. The function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules may be software and/or hardware.

The device for constructing a speech recognition model includes:

An acquisition module for acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;

The processing module is used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual The difference stack layer includes a plurality of sequentially connected residual modules, the residual module includes a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers, through the input and output modules A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as the target model, through L(S)=-lnΠ _{(h(x), z)∈S} p(z|h(x))=-∑ _{(h(x), z) ∈ S} ln p(z|h(x)) evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, p(z |h(x)) is the similarity between the predicted text and the text label, S is the multiple training voice samples, and the predicted text refers to the input of the voice information into the target model, The target model calculates the output text information according to the weight of the neuron;

Adjust the weight of the neuron of the target model until the error is less than the threshold, set the weight of the neuron with the error less than the threshold as the ideal weight, and deploy the target model and the ideal weight to the client end.

In some possible designs, the processing module is also used to:

The expression of f(m) is:

The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f _l is the lowest frequency in the frequency range of the band-pass filter, and the f _h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f _s is the sampling frequency of the band-pass filter, the F _mel function is F _mel =1125ln(1+f/700), the inverse of Fmel The function is:

b is an integer;

according to

In some possible designs, the processing module is further configured to: the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is:

In some possible designs, the F(x, w _i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x ).

The weight of the neuron is adjusted by the stochastic gradient descent method.

Another aspect of the present application provides a device for constructing a speech recognition model, which includes at least one connected processor, a memory, and an input and output unit, wherein the memory is used to store program code, and the processor is used to call the The program code in the memory executes the methods described in the above aspects.

Another aspect of the present application provides a computer storage medium in which computer instructions are stored. When the computer instructions are run on a computer, the computer executes the method for verifying synchronization data between primary and secondary storage volumes. step.

This application directly bypasses the input information x to the output of the hidden layer through the bypass channel. The bypass channel has no weight, which protects the integrity of the input information x, making the neural network training deeper, and the entire neural network only needs training input , The output difference part, that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train. As the depth of the neural network increases, the voice The performance of the recognition model is gradually getting better, and the CTC loss function is used to evaluate the predicted text of the speech recognition model. There is no need to consider the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed. Then the speech recognition model can be trained, saving the production cost of the training speech sample set. In addition, a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formants of the original sound, and prevent the tones in the voice information from affecting the speech recognition model's predicted text. Influence, and reduce the amount of calculation of speech information in the process of speech recognition model recognition.

Description of the drawings

FIG. 1 is a schematic flowchart of a method for constructing a speech recognition model in an embodiment of the application;

Figure 2 is a schematic structural diagram of an apparatus for constructing a speech recognition model in an embodiment of the application;

Fig. 3 is a schematic structural diagram of a device for constructing a speech recognition model in an embodiment of the application.

detailed description

It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. The terms "first" and "second" in the specification and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those clearly listed. Those steps or modules may include other steps or modules that are not clearly listed or are inherent to these processes, methods, products, or equipment. The division of modules in this application is only a logical division In actual applications, there may be other divisions when implemented. For example, multiple modules may be combined or integrated in another system, or some features may be ignored or not implemented.

In order to solve the above technical problems, this application mainly provides the following technical solutions:

This application directly bypasses the input information x to the output of the hidden layer through the bypass channel. The bypass channel has no weight, which protects the integrity of the input information x, making the neural network training deeper, and the entire neural network only needs training input , The output difference part, that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train. As the depth of the neural network increases, the voice The performance of the recognition model is gradually getting better, and the CTC loss function is used to evaluate the predicted text of the speech recognition model. There is no need to consider the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed. Then the speech recognition model can be trained, saving the production cost of the training speech sample set. In addition, a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formant of the original sound, avoid the influence of the tone in the voice information on the predicted text of the speech recognition model, and reduce The amount of calculation of speech information in the process of speech recognition model recognition.

Please refer to Fig. 1, the following is an example of a method for constructing a speech recognition model provided by this application, and the method includes:

101. Obtain multiple training voice samples.

The training speech samples include speech information and text labels corresponding to the speech information.

The text label is used to label the pronunciation phonemes of the training speech information.

According to the voice information in question, the recorded content is written into text according to the pre-recorded voice; the words in the text are numbered according to the sequence of the words, and each word is marked according to its pronunciation phoneme to obtain a text label. Each pronunciation phoneme in the text label corresponds to one or more frames of data in the recording.

102. Construct a speech recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer.

The convolutional residual layer includes a plurality of residual stacked layers connected in sequence. The residual stacking layer contains multiple residual modules connected in sequence. The residual module includes multiple hidden layers connected in sequence and bypass channels bypassing multiple weighted layers connected in sequence.

The independent convolutional layer is used to extract acoustic features from voice information, eliminate non-maximum values in acoustic features, and reduce the complexity of acoustic features. Acoustic features include the pronunciation of specific syllables, the user's continuous reading habits, and the speech spectrum.

The convolution residual layer is used to map the acoustic features to the hidden layer feature space.

The fully connected layer is used to integrate the acoustic features mapped to the hidden layer feature space to obtain the meaning of the acoustic features, and output the probabilities corresponding to various text types according to the meaning.

The output layer is used to output the text corresponding to the voice information according to the probabilities corresponding to various text types.

In the speech recognition model in this embodiment, bypass channels are added to several hidden layers that are sequentially connected, so as to solve the problem of lower and lower training accuracy of traditional neural networks as the number of network layers increases. The convolution residual layer of the speech recognition model has many bypass channels. The bypass channel is used as a branch of the hidden layer to realize the cross-layer connection between the hidden layers. That is, the input of the hidden layer is directly connected to the next layer, making the next layer Levels can directly learn residuals.

Specifically, as shown in Figure 2, in a residual module, the cross-layer connection generally only spans 2 to 3 hidden layers, but it does not exclude spanning more hidden layers. It is of little significance to cross only one hidden layer, and the experimental effect is not ideal.

Assuming that the input of the residual module is x, the expected output is H(x), that is, H(x) is the desired complex latent mapping, but usually H(x) is very difficult to learn; if the input x is directly passed to the output as The initial result, then the goal that the residual module needs to learn at this time is F(x)=H(x)-x. Therefore, compared with the traditional neural network, the speech recognition model in this embodiment is equivalent to changing the learning goal. It is no longer learning a complete output, but learning the optimal solution H(X) and the congruent mapping x. Difference, that is, residual: F(x)=H(x)-x.

From the overall function point of view, if {w _i } is used to represent the ownership value of the residual module, then the actual output result calculated by the residual module is:

y=F(x,{w _i })+x

Taking two hidden layers as an example, in the case of ignoring the bias, F(x,{w _i })=w ₂ δ(w ₁ x)=w ₂ ReLU(w ₁ x), where the ReLU function is The activation function of the residual module.

It is understandable that F(x, {w _i }) and x need to have the same dimensionality. If their dimensions are not the same, you can introduce an additional weight matrix w _s to linearly project x, so that F(x,{w _i }) has the same dimension as x, correspondingly, the calculation of the residual module The result is: y=F(x,{w _i })+w _s x

Input multiple speech samples into the speech recognition model in turn, use the speech information and the text labels corresponding to the speech information as the input and output of the speech recognition model respectively, and continuously train the neuron weights of the speech recognition model through the input and output until the speech The samples have been input to the speech recognition model, and the training of the speech recognition model is ended. After training, the speech recognition model with trained neuron weights is used as the target model.

In the training process, the weights of neurons in the speech recognition model are randomly initialized, and then the training speech information is used as the input of the speech recognition model, and the text label for the training speech information is used as the output reference of the speech recognition model. The training speech information is propagated forward in the speech recognition model. The speech recognition model uses the initialized neurons of each layer to randomly classify the training speech information, and finally obtains the predicted text corresponding to the training speech information. Then update the weight of the neuron according to the gap between the predicted text output by the speech recognition model and the text label, and then continue the next iteration until the weight of the neuron approaches the required value.

103. Through L(S)=-lnΠ _{(h(x), z)∈S} p(z|h(x))=-∑ _{(h(x), z)∈S} ln p(z|h(x) )) Evaluate the error of the target model.

Among them, L(S) is the error, x is the speech information, z is the text label, p(z|h(x)) is the similarity between the predicted text and the text label, and S is the multiple training speech samples. Predictive text refers to text information that is calculated and output by the target model according to neuron weights after voice information is input to the target model.

The CTC loss function is used to estimate the degree of inconsistency between the predicted text output by the speech recognition model and the real text label. Its advantage is that it does not require forced alignment of input data and output data. Different from the cross-entropy criterion of frame-level alignment between input features and target tags, the CTC loss function can automatically learn the alignment between speech data and tag sequences (for example, phonemes or characters, etc.), which eliminates the need for mandatory data. The need for alignment, and the length of the input data and the label may not be the same. The CTC loss function is used to evaluate the predicted text of the speech recognition model, without considering the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed to train the speech recognition model, saving training The production cost of the voice sample set.

104. Adjust the weight of the neuron of the target model until the error is less than the threshold, and set the weight of the neuron with the error less than the threshold as the ideal weight.

According to the CTC loss function, the error of the corresponding training speech sample set is calculated, and the back propagation error in the speech recognition model by the gradient descent algorithm is used to update the target parameters such as the weight and threshold in the speech recognition model, and continuously improve the speech recognition model speech The accuracy of the recognition until the convergence requirement is reached.

105. Deploy the target model and ideal weights to the client.

Compared with the prior art, this application directly bypasses the input information x to the output of the hidden layer through the bypass channel. The bypass channel has no weight, which protects the integrity of the input information x and makes the neural network training deeper. The entire neural network only needs to train the difference between the input and output, that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train. With the increase in the depth of the neural network, the performance of the speech recognition model is gradually getting better, and the predicted text of the speech recognition model is evaluated with the CTC loss function, without considering the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed to train the speech recognition model, which saves the production cost of the training speech sample set. In addition, a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formant of the original sound, avoid the influence of the tone in the voice information on the predicted text of the speech recognition model, and reduce The amount of calculation of speech information in the process of speech recognition model recognition.

In some embodiments, before inputting multiple speech samples into the speech recognition model, the method further includes:

The training voice information is processed in frames according to preset framing parameters to obtain sentences corresponding to the training voice information. The preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;

In some implementation manners, processing the training voice information in frames according to preset framing parameters includes:

Perform discrete Fourier transform on the two-dimensional voice information to obtain the linear frequency spectrum X(k) corresponding to the two-dimensional voice information;

Filter the linear spectrum through the preset band-pass filter to obtain the target linear spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:

The expression of f(m) is:

The band-pass filter includes multiple band-pass filters with triangular filtering characteristics, f _l is the lowest frequency in the frequency range of the band-pass filter, f _h is the highest frequency in the frequency range of the band-pass filter, and N is the length of the DFT. f _s is the sampling frequency of the band-pass filter, the F _mel function is F _mel =1125ln(1+f/700), and the inverse function of Fmel is:

b is an integer;

according to

0≤m≤M Calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain the spectrogram, X(k) is the linear frequency spectrum.

In the above embodiment, the human response to the sound pressure is logarithmic, and the human is less sensitive to subtle changes in high sound pressure than to low sound pressure. In addition, the use of logarithms can reduce the sensitivity of the extracted features to changes in the input sound energy, because the distance between the sound and the microphone changes, so the sound energy collected by the microphone also changes. The spectrogram is a visual expression of the time-frequency distribution of sound energy, which effectively utilizes the correlation between the time and frequency domains. The feature vector sequence obtained through the analysis of the spectrogram has a better effect on the extraction of acoustic features. Input into the speech recognition model to make subsequent calculations more accurate. And a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, and highlight the formant of the original sound. Therefore, the pitch or pitch of a sound in the training speech information will not be reflected in the acoustic characteristics, that is to say, the speech recognition model will not be affected by the difference in the pitch of the speech information and have an impact on the predicted text; and the speech recognition is reduced. The amount of calculation of voice information in the process of model recognition.

In some embodiments, the fully connected layer includes a classification function. Classification function refers to

j is a natural number. The classification function compresses the K-dimensional speech and audio signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) _j , so that the range of each element is (0, 1) , And the sum of all elements is 1.

In some embodiments, the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is:

y=F(x, w _i )+w _s x. F(x, w _i ) is the output of the independent convolutional layer, and w _s is the weight of the residual module.

In the above-mentioned embodiment, the speech recognition model in this embodiment adds bypass channels to several hidden layers connected in sequence, so as to solve the problem that the training accuracy of traditional neural networks is getting lower and lower as the number of network layers increases. problem. The convolution residual layer of the speech recognition model has many bypass channels. The bypass channel is used as a branch of the hidden layer to realize the cross-layer connection between the hidden layers. That is, the input of the hidden layer is directly connected to the next layer, making the next layer Levels can directly learn residuals.

Specifically, in a residual module, the cross-layer connection generally only spans 2 to 3 hidden layers, but it does not exclude spanning more hidden layers. It is of little significance to cross only one hidden layer, and the experimental effect is not ideal.

Assuming that the input of the residual module is x, the expected output is H(x), that is, H(x) is the desired complex latent mapping, but usually H(x) is very difficult to learn; if the input x is directly passed to the output as The initial result, then the goal that the residual module needs to learn at this time is F(x)=H(x)-x. Therefore, compared with the traditional neural network, the speech recognition model in this embodiment is equivalent to changing the learning goal. It is no longer learning a complete output, but learning the optimal solution H(X) and the congruent mapping x. Difference, that is, residual: F(x)=H(x)-x. From the overall function point of view, if {w _i } is used to represent the ownership value of the residual module, then the actual output result calculated by the residual module is: y=F(x,{w _i })+x, which spans 2 Take the hidden layer as an example. In the case of ignoring the bias, F(x,{w _i })=w ₂ δ(w ₁ x)=w ₂ ReLU(w ₁ x), where ReLU() is the residual module Activation function.

In some embodiments, the _{ReLU function of F(x, w i} ) is used as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x).

In the above embodiment, the neural network can be trained through the above formula.

In some embodiments, adjusting the weight of the neuron of the target model includes:

The weights of neurons are adjusted by stochastic gradient descent method.

In the above-mentioned embodiment, the stochastic gradient descent algorithm can effectively avoid redundant calculation and consume less time. Of course, those skilled in the art can also use other algorithms.

As shown in FIG. 2, a schematic structural diagram of an apparatus 20 for constructing a speech recognition model can be applied to construct a speech recognition model. The apparatus for constructing a speech recognition model in the embodiment of the present application can implement the steps corresponding to the method for constructing a speech recognition model executed in the embodiment corresponding to FIG. 1 above. The functions implemented by the device 20 for constructing a speech recognition model can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules may be software and/or hardware. The apparatus for constructing a speech recognition model may include an input and output module 201 and a processing module 202. The functional realization of the processing module 202 and the input and output module 201 can refer to the operations performed in the embodiment corresponding to FIG. Go into details. The input/output module 201 can be used to control the input, output, and acquisition operations of the input/output module 201.

In some embodiments, the input and output module 201 may be used to obtain multiple training voice samples, where the training voice samples include voice information and text labels corresponding to the voice information;

The processing module 202 can be used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers. , The residual stack layer includes a plurality of sequentially connected residual modules, the residual module includes a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers; A plurality of the voice samples are sequentially input to the voice recognition model through the input and output module, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model, and The input and the output continuously train the neuron weights of the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is ended. After the training is over , Take the speech recognition model with trained neuron weights as the target model; pass L(S)=-lnΠ _{(h(x), z)∈S} p(z|h(x))=-∑ _{(h(x), z)∈S} ln p(z|h(x)) evaluate the error of the target model; where L(S) is the error, x is the voice information, and z is the Text label, p(z|h(x)) is the similarity between the predicted text and the text label, S is the multiple training speech samples, and the predicted text refers to the input of the speech information to the After the target model, the target model calculates the output text information according to the neuron weight; adjusts the neuron weight of the target model until the error is less than the threshold, and the error is less than the threshold of the neuron weight Set as ideal weight. Deploy the target model and the ideal weight to the client.

In some implementation manners, the processing module 202 is further configured to:

The expression of f(m) is:

b is an integer;

according to

0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum;

In some embodiments, the fully connected layer includes a classification function, and the classification function refers to

The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector, so that the range of each element is between (0, 1) And the sum of all elements is 1.

In some embodiments, the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is: y=F(x, w _i )+w _s x, the F(x, w _i ) is the output of the independent convolutional layer, and the w _s is the weight of the residual module.

In some embodiments, the F(x, w _i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x).

In some embodiments, the adjusting the weight of the neuron of the target model includes:

The weight of the neuron is adjusted by the stochastic gradient descent method.

The above describes the creation device in the embodiment of the present application from the perspective of modular functional entities. The following describes a device for building a speech recognition model from the perspective of hardware, as shown in Figure 3, which includes: processor, memory, input and output A unit (can also be a transceiver, not identified in Figure 3) and a computer program stored in the memory and running on the processor. For example, the computer program may be a program corresponding to the method of constructing a speech recognition model in the embodiment corresponding to FIG. 1. For example, when the device for constructing a speech recognition model implements the function of the device 20 for constructing a speech recognition model as shown in FIG. 2, the processor executes the computer program to implement the speech recognition model in the embodiment corresponding to FIG. Each step in the method of constructing a voice recognition model executed by the device 20 for recognizing a model. Alternatively, when the processor executes the computer program, the function of each module in the apparatus 20 for constructing a speech recognition model of the embodiment corresponding to FIG. 2 is realized. For another example, the computer program may be a program corresponding to the method for constructing a speech recognition model in the embodiment corresponding to FIG. 1.

The so-called processor may be a central processing unit (CPU), other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (ASICs), ready-made Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc. The processor is the control center of the computer device, and various interfaces and lines are used to connect various parts of the entire computer device.

The memory may be used to store the computer program and/or module, and the processor implements the computer by running or executing the computer program and/or module stored in the memory and calling data stored in the memory. Various functions of the device. The memory may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store Data created based on the use of mobile phones (such as audio data, video data, etc.), etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input and output units can also be replaced by receivers and transmitters, and they can be the same or different physical entities. When they are the same physical entity, they can be collectively referred to as input and output units. The input and output can be a transceiver.

The memory may be integrated in the processor, or may be provided separately from the processor.

The application also provides a computer storage medium. The computer storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:

The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;

Through L(S)=-lnΠ _{(h(x), z)∈S} p(z|h(x))=-∑ _(h(x),z)∈S ln p(z|h(x)) Evaluate the error of the target model, where L(S) is the error, x is the speech information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model;

Deploy the target model and the ideal weight to the client.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present application.

The embodiments of the application are described above with reference to the accompanying drawings, but the application is not limited to the above-mentioned specific embodiments. The above-mentioned specific embodiments are only illustrative and not restrictive. Those of ordinary skill in the art are Under the enlightenment of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can be made, any equivalent structure or equivalent process transformation made by using the content of the description and drawings of this application, or It is directly or indirectly used in other related technical fields, and these all fall within the protection of this application.

Claims

A method for constructing a speech recognition model, the method comprising:

Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;

The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;

A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as a target model;

Through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x) ) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model ；

Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;

Deploy the target model and the ideal weight to the client.
The method according to claim 1, before inputting a plurality of the speech samples into the speech recognition model, the method further comprises:

Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;

According to the preset two-dimensional parameters and the feature extraction of the filter bank, the sentence is converted into the sentence to obtain two-dimensional voice information.
The method according to claim 2, wherein the processing of the training voice information in frames according to preset framing parameters includes:

Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;

The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:

The expression of f(m) is:

The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/70), the inverse of Fmel The function is:
b is an integer;

according to
0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
The method according to claim 1, wherein the fully connected layer includes a classification function, and the classification function refers to
The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector
So that the range of each element is between (0,1), and the sum of all elements is 1.
The method according to claim 1, wherein the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:

y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
According to the method of claim 5, the ReLU function of the F(x, w i ) is used as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0 ,x).
The method according to claim 1, wherein the adjusting the weights of the neurons of the target model comprises:

The weight of the neuron is adjusted by the stochastic gradient descent method.
A device for constructing a speech recognition model, the device comprising:

The input and output module is used to obtain a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;

The processing module is used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual The difference stacking layer includes a plurality of sequentially connected residual modules, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers; through the input and output modules A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is ended; after the training is over, there will be a well-trained The speech recognition model of neuron weights is used as the target model; through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x) , Z) ∈ S ln p(z|h(x)) evaluate the error of the target model; where L(S) is the error, x is the speech information, z is the text label, p( z|h(x)) is the similarity between the predicted text and the text label, S is the multiple training voice samples; the predicted text refers to the voice information input to the target model, The target model calculates and outputs text information according to neuron weights;

Adjust the weight of the neuron of the target model until the error is less than the threshold, and set the weight of the neuron with the error less than the threshold as the ideal weight; deploy the target model and the ideal weight to the client end.
According to the apparatus for constructing a speech recognition model according to claim 8, the processing module is further configured to:

Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;

The sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
According to the apparatus for constructing a speech recognition model according to claim 9, the processing module is further configured to:

Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;

The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:

The expression of f(m) is:

The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/700), the inverse of Fmel The function is:
b is an integer;

according to
0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
The apparatus for constructing a speech recognition model according to claim 8, wherein the fully connected layer includes a classification function, and the classification function refers to
The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
According to the apparatus for constructing a speech recognition model according to claim 8, the processing module is further used for: the input of the residual module is x, and the output of the output residual module is y, then the mathematics of the residual module The expression is:

y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
The device for constructing a speech recognition model according to claim 12, wherein the F(x, w i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x) =max(0,x).
8. The apparatus for constructing a speech recognition model according to claim 8, wherein the adjusting the weight of the neuron of the target model comprises:

The weight of the neuron is adjusted by the stochastic gradient descent method.
A device for constructing a voice recognition model, the device for constructing a voice recognition model includes: at least one processor, a memory, and an input and output unit;

Wherein, the memory is used to store program code, and the processor is used to call the program code stored in the memory to perform the following steps:

Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;

The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;

A plurality of the voice samples are sequentially input to the voice recognition model, the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model, and the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is ended. After the training is over, there will be a well-trained The speech recognition model of neuron weights is used as a target model;

Through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x) ) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model ；

Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;

Deploy the target model and the ideal weight to the client.
The device for constructing a speech recognition model according to claim 15, wherein the processor is configured to call the program code stored in the memory to execute the input of the plurality of speech samples into the speech recognition model, further comprising the following step:

Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;

According to the preset two-dimensional parameters and the feature extraction of the filter bank, the sentence is converted into the sentence to obtain two-dimensional voice information.
The device for constructing a speech recognition model according to claim 16, wherein the processor is configured to call the program code stored in the memory to execute the framing processing of the training voice information according to the preset framing parameters, including the following step:

Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;

The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:

The expression of f(m) is:

The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/70), the inverse of Fmel The function is:
b is an integer;

according to
0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
The device for constructing a speech recognition model according to claim 15, wherein the fully connected layer includes a classification function, and the classification function refers to
The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
The device for constructing a speech recognition model according to claim 15, wherein the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:

y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
A computer storage medium in which computer instructions are stored. When the computer instructions are run on a computer, the computer is caused to perform the following steps:

Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;

The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;

A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as a target model;

Through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x) ) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model ；

Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;

Deploy the target model and the ideal weight to the client.