CN110751944A

CN110751944A - Method, device, equipment and storage medium for constructing voice recognition model

Info

Publication number: CN110751944A
Application number: CN201910884620.9A
Authority: CN
Inventors: 王健宗; 贾雪丽
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-02-04
Also published as: WO2021051628A1

Abstract

The application relates to the field of artificial intelligence, and provides a method, a device, equipment and a storage medium for constructing a voice recognition model, wherein the method comprises the following steps: obtaining a plurality of training voice samples; constructing a voice recognition model through the independent convolution layer, the convolution residual layer, the full connection layer and the output layer; inputting the training voice information into the voice recognition model, and updating the neuron weight of the voice recognition model through a Natural Language Processing (NLP) technology, the voice information and a text label corresponding to the voice information to obtain a target model; by L (S) ═ ln |_{(h(x)，z)∈S}p(z|h(x))＝‑∑_{(h(x)，z)∈S}lnp (z | h (x)) evaluating the error of the target model;adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight; and deploying the target model and the ideal weight to a client. The influence of the tone in the voice information on the predicted text and the operation amount in the voice recognition model recognition process are reduced.

Description

Method, device, equipment and storage medium for constructing voice recognition model

Technical Field

The present application relates to the field of intelligent decision making, and in particular, to a method, an apparatus, a device, and a storage medium for constructing a speech recognition model.

Background

Speech recognition is used to convert speech into text. With the continuous development of deep learning technology, the application range of speech recognition is wider and wider.

At present, Deep Neural Networks (DNNs) have become a hot spot of research in the field of automatic speech recognition. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have achieved relatively good results in speech recognition model creation, and deep learning has become the mainstream scheme of speech recognition.

In the deep neural network, the depth of the network is often closely related to the accuracy of recognition, because the traditional deep neural network can extract multi-level features of a low layer, a middle layer and a high layer (low/mid/high-level), and the more the number of the network layers is, the richer the extracted features are. However, as the network hierarchy is continuously deepened, a "degeneration phenomenon" of the deep neural network begins to appear, so that the accuracy rate of speech recognition quickly reaches saturation, and the deeper the network hierarchy is, the higher the error rate is. In addition, the existing speech recognition model needs to align the speech training samples before training, and aligns the speech data of each frame with the corresponding label, so as to ensure that the loss function used in the training can accurately estimate the training error of the speech recognition model. However, the alignment process of the voice training samples is tedious and complicated, and requires a large time cost.

Disclosure of Invention

In the embodiment of the invention, the characteristics of the unlabeled data are obtained, and the obtained characteristics are introduced into supervised learning, so that the usable sample data is expanded, the utilization efficiency of the unlabeled images is improved, and the accuracy of model prediction is improved.

In a first aspect, the present application provides a method for constructing a speech recognition model, including:

acquiring a plurality of training voice samples, wherein the training voice samples comprise voice information and text labels corresponding to the voice information;

the method comprises the steps that a voice recognition model is built through an independent convolution layer, a convolution residual layer, a full-connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are connected in sequence, the residual stacking layers comprise a plurality of residual modules which are connected in sequence, and each residual module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the plurality of weight layers which are connected in sequence;

sequentially inputting a plurality of voice samples into the voice recognition model, respectively taking the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input into the voice recognition model, finishing the training of the voice recognition model, and taking the voice recognition model with trained neuron weights as a target model after the training is finished;

by L (S) ═ ln Π_{(h(x)，z)∈S}p(z|h(x))＝-∑_{(h(x)，z)∈S}ln p (z | h (x)) evaluating an error of the target model, wherein l (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is a similarity between the predicted text and the text label, S is the training speech samples, and the predicted text is the text information output by the target model according to neuron weight calculation after the speech information is input to the target model;

adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight;

and deploying the target model and the ideal weight to a client.

In some possible designs, before the inputting the plurality of speech samples to the speech recognition model, the method further comprises:

framing the training voice information according to preset framing parameters to obtain sentences corresponding to the training voice information, wherein the preset framing parameters comprise frame duration, frame number and front and rear frame repetition duration;

and converting the statement according to a preset two-dimensional parameter and a filter bank characteristic extraction algorithm to obtain two-dimensional voice information.

In some possible designs, the framing the training speech information according to preset framing parameters includes:

performing discrete Fourier transform on the two-dimensional voice information to obtain a linear spectrum X (k) corresponding to the two-dimensional voice information;

filtering the linear spectrum through a preset band-pass filter to obtain a target linear spectrum, wherein when the center frequency of the band-pass filter is f (m), the transfer function of the band-pass filter is as follows:

the expression of f (m) is:

the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, f_lIs the lowest frequency of the frequency range of the band-pass filter, f_hFor the highest frequency with the pass filter frequency range, the N is the length at DFT, the f_sIs the sampling frequency of the band-pass filter, F_melFunction of F_mel1125ln (1+ f/700), the inverse function of Fmel is:

b is an integer;

according to

M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear spectrum is calculated, and a spectrogram is obtained, wherein X (k) is the linear spectrum.

In some possible designsThe full connection layer comprises a classification function, and the classification function refers to

J is a natural number, and the classification function compresses a K-dimensional voice frequency domain signal vector z output by the convolution residual error layer to another K-dimensional real vector delta (z)_jSo that the range of each element is between (0, 1) and the sum of all elements is 1.

In some possible designs, the input of the residual module is x, the output of the output residual module is y, and then the mathematical expression of the residual module is:

y＝F(x，w_i)+w_sx, the F (x, w)_i) For the output of the independent convolutional layer, the w_sAnd the weight value of the residual error module.

In some possible designs, the F (x, w)_i) As an activation function of the independent convolution layer, a ReLU function is used, whose mathematical expression is ReLU (x) max (0, x),

in some possible designs, the adjusting weights of the neurons of the target model includes:

and adjusting the weight of the neuron by a random gradient descent method.

In a second aspect, the present application provides an apparatus for constructing a speech recognition model, having functions of implementing a method corresponding to the platform for constructing a speech recognition model provided in the first aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

The device for constructing the speech recognition model comprises:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of training voice samples, and the training voice samples comprise voice information and text labels corresponding to the voice information;

a processing module for passing the independent convolution layer, the convolution residual layer, the full link layer and the output layerConstructing a voice recognition model, wherein the convolution residual layer comprises a plurality of residual stacked layers which are connected in sequence, the residual stacked layers comprise a plurality of residual modules which are connected in sequence, the residual modules comprise a plurality of hidden layers which are connected in sequence and bypass channels which bypass a plurality of weight layers which are connected in sequence, a plurality of voice samples are sequentially input into the voice recognition model through an input-output module, the voice information and text labels corresponding to the voice information are respectively used as the input and the output of the voice recognition model, the neuron weight of the voice recognition model is continuously trained through the input and the output until the voice samples are all input into the voice recognition model, the training of the voice recognition model is finished, and after the training is finished, the voice recognition model with the trained neuron weight is used as a target model, by L (S) ═ ln Π_{(h(x)，z)∈S}p(z|h(x))＝-∑_{(h(x)，z)∈S}ln p (z | h (x)) evaluating an error of the target model, wherein l (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is a similarity between the predicted text and the text label, S is the training speech samples, and the predicted text is the text information output by the target model according to neuron weight calculation after the speech information is input to the target model;

and adjusting the weight of the neuron of the target model until the error is smaller than a threshold, setting the weight of the neuron with the error smaller than the threshold as an ideal weight, and deploying the target model and the ideal weight to a client.

In some possible designs, the processing module is further to:

the expression of f (m) is:

the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, f_lIs the lowest frequency of the frequency range of the band-pass filter, f_hIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, f_sIs the sampling frequency of the band-pass filter, F_melFunction of F_mel1125ln (1+ f/700), the inverse function of Fmel is:

b is an integer;

according to

In some possible designs, the fully-connected layer includes a classification function, which refers to

In some possible designs, the processing module is further to: the input of the residual error module is x, the output of the output residual error module is y, and then the mathematical expression of the residual error module is:

In some possible designs, the F (x, w)_i) The mathematical expression of the ReLU function is ReLU (x) max (0, x) as the activation function of the independent convolution layer.

and adjusting the weight of the neuron by a random gradient descent method.

In yet another aspect, the present application provides an apparatus for constructing a speech recognition model, which includes at least one connected processor, a memory, and an input/output unit, wherein the memory is used for storing program codes, and the processor is used for calling the program codes in the memory to execute the method of the above aspects.

Yet another aspect of the present application provides a computer storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the above-described aspects.

According to the method, the input information x is directly detoured to the output of the hidden layer through the bypass channel, the bypass channel has no weight, the integrity of the input information x is protected, the neural network training is deeper, only the part with difference between input and output needs to be trained by the whole neural network, namely after the input information x is transmitted, each residual error module only learns the residual error F (x), the training target and difficulty are simplified, the neural network is stable and easy to train, along with the increase of the neural network depth, the performance of the voice recognition model is gradually improved, the prediction text of the voice recognition model is evaluated by a CTC loss function, the accurate mapping relation between pronunciation phonemes in a text label and sequences of training voice information does not need to be considered, the voice recognition model can be trained only by inputting the sequences and outputting the sequences, and the manufacturing cost of a training voice sample set is saved. In addition, a triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate harmonic waves in the training voice information, highlight formants of original sounds, avoid the influence of tones in the voice information on a prediction text of a voice recognition model, and reduce the computation amount of the voice information in the recognition process of the voice recognition model.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for constructing a speech recognition model according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an apparatus for constructing a speech recognition model according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus for constructing a speech recognition model in an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division that may be implemented in an actual application in a different manner, such that multiple modules may be combined or integrated into another system, or some features may be omitted, or may not be implemented.

In order to solve the technical problems, the application mainly provides the following technical scheme:

according to the method, the input information x is directly detoured to the output of the hidden layer through the bypass channel, the bypass channel has no weight, the integrity of the input information x is protected, the neural network training is deeper, only the part with difference between input and output needs to be trained by the whole neural network, namely after the input information x is transmitted, each residual error module only learns the residual error F (x), the training target and difficulty are simplified, the neural network is stable and easy to train, along with the increase of the neural network depth, the performance of the voice recognition model is gradually improved, the prediction text of the voice recognition model is evaluated by a CTC loss function, the accurate mapping relation between pronunciation phonemes in a text label and sequences of training voice information does not need to be considered, the voice recognition model can be trained only by inputting the sequences and outputting the sequences, and the manufacturing cost of a training voice sample set is saved. In addition, the triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate the harmonic waves in the training voice information, highlight the formants of the original sound, avoid the influence of the tones in the voice information on the predicted text of the voice recognition model, and reduce the computation amount of the voice information in the recognition process of the voice recognition model.

Referring to fig. 1, a method for constructing a speech recognition model according to the present application is illustrated, and the method includes:

101. a plurality of training speech samples is obtained.

The training speech samples include speech information and text labels corresponding to the speech information.

The text labels are used for marking pronunciation phonemes of the training voice information.

The voice information writes the recorded content into a text according to the pre-recorded voice; and numbering the words in the text according to the sequence of the words, and labeling each word according to the pronunciation phoneme of the word to obtain a text label. Each pronunciation phoneme in the text label corresponds to one or more frames of data in the recording.

102. And constructing a voice recognition model through the independent convolution layer, the convolution residual layer, the full connection layer and the output layer.

The convolutional residual layer includes a plurality of sequentially connected residual stack layers. The residual stack layer comprises a plurality of sequentially connected residual modules. The residual module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the weight layers which are connected in sequence.

The independent convolution layer is used for extracting acoustic features from the voice information, eliminating non-maximum values in the acoustic features and reducing the complexity of the acoustic features. The acoustic features include pronunciation of specific syllables, user read-through habits, and speech spectrum, among others.

The convolution residual layer is used to map the acoustic features to the hidden layer feature space.

The full connection layer is used for integrating the acoustic features mapped to the hidden layer feature space so as to obtain the meanings of the acoustic features, and the probabilities corresponding to various text types are output according to the meanings.

The output layer is used for outputting the text corresponding to the voice information according to the probability corresponding to each text type.

The voice recognition model in the embodiment adds the bypass channel for the plurality of hidden layers which are connected in sequence, so that the problem that the training accuracy rate of the traditional neural network is lower and lower along with the increase of the number of network layers is solved. The convolution residual layer of the voice recognition model is provided with a plurality of bypass channels, the bypass channels are used as branch lines of the hidden layers, cross-layer connection between the hidden layers is achieved, namely the input of the hidden layers is directly connected to the next layer, and the next layer can directly learn the residual error.

In particular, as shown in fig. 2, in one residual block, the cross-layer connection typically spans only 2 to 3 hidden layers, but does not exclude spanning more hidden layers. The significance of the situation of only spanning 1 hidden layer is not great, and the experimental effect is not ideal.

Assuming that the input of the residual module is x, the expected output is h (x), i.e., h (x) is the expected complex potential mapping, but h (x) is usually difficult to learn; if the input x is passed directly to the output as the initial result, then the target that the residual module needs to learn at this time is f (x) ═ h (x) -x. Thus, compared to the conventional neural network, the speech recognition model in this embodiment is equivalent to changing the learning objective, and instead of learning a complete output, the difference between the optimal solution h (x) and the congruent mapping x, i.e. the residual: f (x) h (x) -x.

From the overall function, if used, { w_iRepresents all weights of the residual module, then the output result actually calculated by the residual module is:

y＝F(x，{w_i})+x

taking the example of spanning 2 hidden layers, F (x, { w) with bias ignored_i})＝w₂δ(w₁x)＝w₂ReLU(w₁x), wherein the ReLU function is an activation function of the residual module.

It is understood that F (x, { w)_i}) need to have the same dimension as x. If their dimensions are not the same, an additional weight matrix w may be introduced_sLinearly projecting x such that F (x, { w)_i} is the same as the dimension of x, and accordingly, the calculation result of the residual module is: y ═ F (x, { w)_i})+w_sx

And sequentially inputting a plurality of voice samples into the voice recognition model, respectively taking the voice information and the text labels corresponding to the voice information as the input and the output of the voice recognition model, continuously training the neuron weights of the voice recognition model through the input and the output until the voice samples are all input into the voice recognition model, and finishing the training of the voice recognition model. And after training is finished, taking the voice recognition model with the trained neuron weight as a target model.

In the training process, the weight of a neuron in the speech recognition model is initialized randomly, then the training speech information is used as the input of the speech recognition model, and the text label corresponding to the training speech information is used as the output reference of the speech recognition model. The training voice information is transmitted in the forward direction in the operation of the voice recognition model, the voice recognition model carries out random classification on the training voice information by using the neurons after each layer of initialization, and finally, a prediction text corresponding to the training voice information is obtained. And then updating the weight of the neuron according to the difference between the predicted text output by the speech recognition model and the text label, and continuing the next iteration until the weight of the neuron approaches the required value.

103. By L (S) ═ ln |_{(h(x)，z)∈S}p(z|h(x))＝-∑_{(h(x)，z)∈S}ln p (z | h (x)) estimates the error of the target model.

Wherein, l (S) is an error, x is speech information, z is a text label, p (z | h (x)) is a similarity between a predicted text and the text label, and S is a plurality of training speech samples. The predicted text refers to the text information which is calculated and output by the target model according to the weight of the neuron after the voice information is input into the target model.

The CTC loss function is used to measure the degree of disparity between the predicted text output by the speech recognition model and the actual text labels, which has the advantage of not requiring a forced alignment of the input data with the output data. Unlike the cross-entropy criterion of frame-level alignment between input features and target tags, the CTC loss function can automatically learn the alignment between speech data and tag sequences (e.g., phonemes or characters, etc.), which eliminates the need for forced alignment of data, and the input data and tags are not necessarily the same length. The predicted text of the speech recognition model is evaluated by the CTC loss function, the accurate mapping relation between the pronunciation phonemes in the text label and the sequence of the training speech information is not needed to be considered, the speech recognition model can be trained only by inputting the sequence and outputting the sequence, and the manufacturing cost of a training speech sample set is saved.

104. And adjusting the weight of the neuron of the target model until the error is smaller than the threshold, and setting the weight of the neuron with the error smaller than the threshold as an ideal weight.

And calculating the error of the corresponding training voice sample set according to the CTC loss function, and updating target parameters such as weight, threshold value and the like in the voice recognition model through the back propagation error of the gradient descent algorithm in the voice recognition model, thereby continuously improving the accuracy of the voice recognition model until the convergence requirement is met.

105. And deploying the target model and the ideal weight to the client.

Compared with the prior art, the method and the device have the advantages that the input information x is directly bypassed to the output of the hidden layer through the bypass channel, the bypass channel has no weight, the integrity of the input information x is protected, so that the training of the neural network is deeper, the whole neural network only needs to train the part with the difference between the input and the output, i.e. after the input information x is transmitted, each residual module only learns the residual F (x), simplifies the training goal and difficulty, and the neural network is stable and easy to train, the performance of the speech recognition model gradually becomes better along with the increase of the depth of the neural network, and the predicted text of the speech recognition model is evaluated by the CTC loss function, the accurate mapping relation between the pronunciation phoneme in the text label and the sequence of the training speech information is not needed to be considered, the speech recognition model can be trained only by inputting the sequence and outputting the sequence, and the manufacturing cost of the training speech sample set is saved. In addition, the triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate the harmonic waves in the training voice information, highlight the formants of the original sound, avoid the influence of the tones in the voice information on the predicted text of the voice recognition model, and reduce the computation amount of the voice information in the recognition process of the voice recognition model.

In some embodiments, before inputting the plurality of speech samples to the speech recognition model, the method further comprises:

and extracting an algorithm conversion statement according to the preset two-dimensional parameters and the filter bank characteristics to obtain two-dimensional voice information.

In some embodiments, framing the training speech information according to the preset framing parameters includes:

performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X (k) corresponding to the two-dimensional voice information;

the expression of f (m) is:

the band-pass filter comprises a plurality of band-pass filters having triangular filter characteristics, f_lIs the lowest frequency, f, of the frequency range of the band-pass filter_hIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, f_sIs the sampling frequency, F, of a band-pass filter_melFunction of F_mel＝1125ln(1+f/700)，

The inverse function of Fmel is:

b is an integer;

according to

M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear frequency spectrum is calculated to obtain a spectrogram, and X (k) is the linear frequency spectrum.

In the above embodiment, the response of a person to sound pressure is logarithmic, and the sensitivity of a person to fine variations in high sound pressure is not as sensitive as low sound pressure. Furthermore, the use of logarithms may reduce the sensitivity of the extracted features to variations in the energy of the input sound, since the distance between the sound and the microphone varies, and thus the energy of the sound picked up by the microphone also varies. The spectrogram is a visual expression mode of sound energy time-frequency distribution, effectively utilizes the correlation between two time-frequency domains, has better extraction effect on acoustic features by a feature vector sequence obtained through spectrogram analysis, and is input into a voice recognition model, so that the subsequent operation accuracy is higher. And a triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate harmonic waves in the training voice information and highlight the formants of the original sound. Therefore, the tone or pitch of a section of sound in the training voice information can not be reflected in the acoustic characteristics, namely, the voice recognition model can not be influenced by the tone difference in the voice information to the predicted text; and the computation amount of the voice information in the recognition process of the voice recognition model is reduced.

In some embodiments, the fully-connected layer includes a classification function. The classification function meansj is a natural number, and the classification function compresses a K-dimensional speech frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector delta (z)_jSo that the range of each element is between (0, 1) and the sum of all elements is 1.

In some embodiments, the input of the residual block is x, the output of the output residual block is y, and the mathematical expression of the residual block is:

y＝F(x，w_i)+w_sx。F(x，w_i) As output of the independent convolution layer, w_sAnd the weight of the residual error module.

In the foregoing embodiment, the speech recognition model in this embodiment adds a bypass channel to a plurality of hidden layers connected in sequence, so as to solve the problem that the training accuracy of the conventional neural network is lower and lower as the number of network layers is increased. The convolution residual layer of the voice recognition model is provided with a plurality of bypass channels, the bypass channels are used as branch lines of the hidden layers, cross-layer connection between the hidden layers is achieved, namely the input of the hidden layers is directly connected to the next layer, and the next layer can directly learn the residual error.

In particular, in one residual module, cross-layer connections typically span only 2 to 3 hidden layers, but do not exclude spanning more hidden layers. The significance of the situation of only spanning 1 hidden layer is not great, and the experimental effect is not ideal.

Assuming that the input of the residual module is x, the expected output is h (x), i.e., h (x) is the expected complex potential mapping, but h (x) is usually difficult to learn; if directly transportingIf x is passed to the output as the initial result, then the target to be learned by the residual module is f (x) ═ h (x) -x. Thus, compared to the conventional neural network, the speech recognition model in this embodiment is equivalent to changing the learning objective, and instead of learning a complete output, the difference between the optimal solution h (x) and the congruent mapping x, i.e. the residual: f (x) h (x) -x. From the overall function, if used, { w_iRepresents all weights of the residual module, then the output result actually calculated by the residual module is: y ═ F (x, { w)_i}) + x, F (x, { w, { with bias ignored, for example, spanning 2 hidden layers_i})＝w₂δ(w₁x)＝w₂ReLU(w₁x), where ReLU () is the activation function of the residual module.

In some embodiments, F (x, w)_i) The mathematical expression of the ReLU function is ReLU (x) max (0, x).

In the above embodiment, the neural network can be trained by the above formula.

In some embodiments, adjusting weights of neurons of the target model comprises:

and adjusting the weight of the neuron by a random gradient descent method.

In the above embodiment, the random gradient descent algorithm is adopted to effectively avoid redundant computation, and the consumed time is shorter. Of course other algorithms may be used by those skilled in the art.

Fig. 2 is a schematic structural diagram of an apparatus 20 for constructing a speech recognition model, which can be applied to constructing a speech recognition model. The apparatus for constructing a speech recognition model in the embodiment of the present application can implement the steps corresponding to the method for constructing a speech recognition model performed in the embodiment corresponding to fig. 1. The functions performed by the speech recognition model building means 20 may be implemented by hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware. The apparatus for constructing a speech recognition model may include an input/output module 201 and a processing module 202, and the processing module 202 and the input/output module 201 may refer to operations executed in the embodiment corresponding to fig. 1 for realizing functions, which are not described herein again. The input-output module 201 may be used to control input, output, and acquisition operations of the input-output module 201.

In some embodiments, the input-output module 201 is operable to obtain a plurality of training speech samples, where the training speech samples include speech information and text labels corresponding to the speech information;

the processing module 202 may be configured to construct a speech recognition model by an independent convolutional layer, a convolutional residual layer, a fully-connected layer, and an output layer, where the convolutional residual layer includes a plurality of sequentially-connected residual stacked layers, the residual stacked layers include a plurality of sequentially-connected residual modules, and the residual modules include a plurality of sequentially-connected hidden layers and bypass channels that bypass the plurality of sequentially-connected weight layers; sequentially inputting a plurality of voice samples to the voice recognition model through the input/output module, respectively using the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input to the voice recognition model, finishing the training of the voice recognition model, and after the training is finished, using the voice recognition model with trained neuron weights as a target model; by L (S) ═ ln |_{(h(x)，z)∈S}p(z|h(x))＝-∑_{(h(x)，z)∈S}ln p (z | h (x)) evaluating the error of the target model; wherein L (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is the predicted text and the textThe similarity of the label is S, the plurality of training voice samples are obtained, and the predicted text refers to the text information which is calculated and output by the target model according to the weight of the neuron after the voice information is input to the target model; and adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight. And deploying the target model and the ideal weight to a client.

In some embodiments, the processing module 202 is further configured to:

the expression of f (m) is:

the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, f_lIs the lowest frequency of the frequency range of the band-pass filter, f_hIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, f_sFor filtering said band passSampling frequency of the device, said F_melFunction of F_mel1125ln (1+ f/700), the inverse function of Fmel is:

b is an integer;

according to

M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear spectrum is calculated, and a spectrogram is obtained, wherein X (k) is the linear spectrum;

in some embodiments, the fully-connected layer includes a classification function, which refers to

The j is a natural number, and the classification function compresses the K-dimensional speech frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector such that each element ranges between (0, 1) and the sum of all elements is 1.

In some embodiments, the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is: y ═ F (x, w)_i)+w_sx, the F (x, w)_i) For the output of the independent convolutional layer, the w_sAnd the weight value of the residual error module.

In some embodiments, wherein F (x, w) is_i) The mathematical expression of the ReLU function is ReLU (x) max (0, x) as the activation function of the independent convolution layer.

In some embodiments, the adjusting weights of the neurons of the target model includes:

and adjusting the weight of the neuron by a random gradient descent method.

The creating apparatus in the embodiment of the present application is described above from the perspective of the modular functional entity, and the following apparatus for constructing a speech recognition model is described from the perspective of hardware, as shown in fig. 3, and includes: a processor, a memory, an input-output unit (which may also be a transceiver, not identified in fig. 3), and a computer program stored in the memory and executable on the processor. For example, the computer program may be a program corresponding to the method for constructing the speech recognition model in the embodiment corresponding to fig. 1. For example, when the apparatus for building a speech recognition model implements the functions of the apparatus for building a speech recognition model 20 as shown in fig. 2, the processor executes the computer program to implement the steps of the method for building a speech recognition model executed by the apparatus for building a speech recognition model 20 in the embodiment corresponding to fig. 2. Alternatively, the processor implements the functions of the modules in the apparatus 20 for constructing a speech recognition model according to the embodiment corresponding to fig. 2 when executing the computer program. For another example, the computer program may be a program corresponding to the method for constructing the speech recognition model in the embodiment corresponding to fig. 1.

The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The input-output unit may also be replaced by a receiver and a transmitter, which may be the same or different physical entities. When they are the same physical entity, they may be collectively referred to as an input-output unit. The input and output may be a transceiver.

The memory may be integrated in the processor or may be provided separately from the processor.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes several instructions for enabling a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the drawings, but the present application is not limited to the above-mentioned embodiments, which are only illustrative and not restrictive, and those skilled in the art can make many changes and modifications without departing from the spirit and scope of the present application and the protection scope of the claims, and all changes and modifications that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A method of constructing a speech recognition model, the method comprising:

and deploying the target model and the ideal weight to a client.

2. The method of claim 1, wherein prior to inputting the plurality of speech samples to the speech recognition model, the method further comprises:

and extracting and converting the sentence into the statement according to the preset two-dimensional parameters and the characteristics of the filter bank to obtain two-dimensional voice information.

3. The method of claim 2, wherein the framing the training speech information according to preset framing parameters comprises:

the expression of f (m) is:

the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, f_lIs the lowest frequency of the frequency range of the band-pass filter, f_hIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, f_sIs the sampling frequency of the band-pass filter, F_melFunction of F_mel1125ln (1+ f/70), the inverse function of Fmel is:

b is an integer;

according to

4. The method of claim 1, wherein the fully-connected layer comprises a classification function, and wherein the classification function is defined as

5. The method of claim 1, wherein the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:

6. The method of claim 5, wherein F (x, w)_i) The mathematical expression of the ReLU function is ReLU (x) max (0, x) as the activation function of the independent convolution layer.

7. The method of claim 1, wherein the adjusting weights of the neurons of the target model comprises:

and adjusting the weight of the neuron by a random gradient descent method.

8. An apparatus for constructing a speech recognition model, the apparatus comprising:

the input and output module is used for acquiring a plurality of training voice samples, and the training voice samples comprise voice information and text labels corresponding to the voice information;

the processing module is used for constructing a voice recognition model through an independent convolution layer, a convolution residual layer, a full-connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are connected in sequence, the residual stacking layers comprise a plurality of residual modules which are connected in sequence, and each residual module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the plurality of weight layers which are connected in sequence; sequentially inputting a plurality of voice samples to the voice recognition model through an input/output module, respectively taking the voice information and a text label corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are all input to the voice recognition model, and finishing the training of the voice recognition model; after the training is finished, the voice recognition model with the trained neuron weight is used as a target model; by L (S) ═ ln Π_{(h(x)，z)∈S}p(z|h(x))＝-∑_{(h(x)，z)∈S}ln p (z | h (x)) evaluating the error of the target model; wherein L (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is the similarity between the predicted text and the text label, and S is the training speech samples; the predicted text refers to the text information which is calculated and output by the target model according to the weight of the neuron after the voice information is input to the target model;

adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight; and deploying the target model and the ideal weight to a client.

9. An apparatus for constructing a speech recognition model, comprising:

at least one processor, a memory, and an input-output unit;

wherein the memory is configured to store program code and the processor is configured to invoke the program code stored in the memory to perform the method of any of claims 1-7.

10. A computer storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-7.