CN110751944A - Method, device, equipment and storage medium for constructing voice recognition model - Google Patents

Method, device, equipment and storage medium for constructing voice recognition model Download PDF

Info

Publication number
CN110751944A
CN110751944A CN201910884620.9A CN201910884620A CN110751944A CN 110751944 A CN110751944 A CN 110751944A CN 201910884620 A CN201910884620 A CN 201910884620A CN 110751944 A CN110751944 A CN 110751944A
Authority
CN
China
Prior art keywords
recognition model
voice
training
residual
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910884620.9A
Other languages
Chinese (zh)
Inventor
王健宗
贾雪丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910884620.9A priority Critical patent/CN110751944A/en
Priority to PCT/CN2019/119128 priority patent/WO2021051628A1/en
Publication of CN110751944A publication Critical patent/CN110751944A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the field of artificial intelligence, and provides a method, a device, equipment and a storage medium for constructing a voice recognition model, wherein the method comprises the following steps: obtaining a plurality of training voice samples; constructing a voice recognition model through the independent convolution layer, the convolution residual layer, the full connection layer and the output layer; inputting the training voice information into the voice recognition model, and updating the neuron weight of the voice recognition model through a Natural Language Processing (NLP) technology, the voice information and a text label corresponding to the voice information to obtain a target model; by L (S) ═ ln |(h(x),z)∈Sp(z|h(x))=‑∑(h(x),z)∈Slnp (z | h (x)) evaluating the error of the target model;adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight; and deploying the target model and the ideal weight to a client. The influence of the tone in the voice information on the predicted text and the operation amount in the voice recognition model recognition process are reduced.

Description

Method, device, equipment and storage medium for constructing voice recognition model
Technical Field
The present application relates to the field of intelligent decision making, and in particular, to a method, an apparatus, a device, and a storage medium for constructing a speech recognition model.
Background
Speech recognition is used to convert speech into text. With the continuous development of deep learning technology, the application range of speech recognition is wider and wider.
At present, Deep Neural Networks (DNNs) have become a hot spot of research in the field of automatic speech recognition. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have achieved relatively good results in speech recognition model creation, and deep learning has become the mainstream scheme of speech recognition.
In the deep neural network, the depth of the network is often closely related to the accuracy of recognition, because the traditional deep neural network can extract multi-level features of a low layer, a middle layer and a high layer (low/mid/high-level), and the more the number of the network layers is, the richer the extracted features are. However, as the network hierarchy is continuously deepened, a "degeneration phenomenon" of the deep neural network begins to appear, so that the accuracy rate of speech recognition quickly reaches saturation, and the deeper the network hierarchy is, the higher the error rate is. In addition, the existing speech recognition model needs to align the speech training samples before training, and aligns the speech data of each frame with the corresponding label, so as to ensure that the loss function used in the training can accurately estimate the training error of the speech recognition model. However, the alignment process of the voice training samples is tedious and complicated, and requires a large time cost.
Disclosure of Invention
In the embodiment of the invention, the characteristics of the unlabeled data are obtained, and the obtained characteristics are introduced into supervised learning, so that the usable sample data is expanded, the utilization efficiency of the unlabeled images is improved, and the accuracy of model prediction is improved.
In a first aspect, the present application provides a method for constructing a speech recognition model, including:
acquiring a plurality of training voice samples, wherein the training voice samples comprise voice information and text labels corresponding to the voice information;
the method comprises the steps that a voice recognition model is built through an independent convolution layer, a convolution residual layer, a full-connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are connected in sequence, the residual stacking layers comprise a plurality of residual modules which are connected in sequence, and each residual module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the plurality of weight layers which are connected in sequence;
sequentially inputting a plurality of voice samples into the voice recognition model, respectively taking the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input into the voice recognition model, finishing the training of the voice recognition model, and taking the voice recognition model with trained neuron weights as a target model after the training is finished;
by L (S) ═ ln Π(h(x),z)∈Sp(z|h(x))=-∑(h(x),z)∈Sln p (z | h (x)) evaluating an error of the target model, wherein l (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is a similarity between the predicted text and the text label, S is the training speech samples, and the predicted text is the text information output by the target model according to neuron weight calculation after the speech information is input to the target model;
adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight;
and deploying the target model and the ideal weight to a client.
In some possible designs, before the inputting the plurality of speech samples to the speech recognition model, the method further comprises:
framing the training voice information according to preset framing parameters to obtain sentences corresponding to the training voice information, wherein the preset framing parameters comprise frame duration, frame number and front and rear frame repetition duration;
and converting the statement according to a preset two-dimensional parameter and a filter bank characteristic extraction algorithm to obtain two-dimensional voice information.
In some possible designs, the framing the training speech information according to preset framing parameters includes:
performing discrete Fourier transform on the two-dimensional voice information to obtain a linear spectrum X (k) corresponding to the two-dimensional voice information;
filtering the linear spectrum through a preset band-pass filter to obtain a target linear spectrum, wherein when the center frequency of the band-pass filter is f (m), the transfer function of the band-pass filter is as follows:
Figure BDA0002206921820000031
the expression of f (m) is:
Figure BDA0002206921820000032
the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, flIs the lowest frequency of the frequency range of the band-pass filter, fhFor the highest frequency with the pass filter frequency range, the N is the length at DFT, the fsIs the sampling frequency of the band-pass filter, FmelFunction of Fmel1125ln (1+ f/700), the inverse function of Fmel is:
Figure BDA0002206921820000033
b is an integer;
according to
Figure BDA0002206921820000034
M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear spectrum is calculated, and a spectrogram is obtained, wherein X (k) is the linear spectrum.
In some possible designsThe full connection layer comprises a classification function, and the classification function refers to
Figure BDA0002206921820000035
J is a natural number, and the classification function compresses a K-dimensional voice frequency domain signal vector z output by the convolution residual error layer to another K-dimensional real vector delta (z)jSo that the range of each element is between (0, 1) and the sum of all elements is 1.
In some possible designs, the input of the residual module is x, the output of the output residual module is y, and then the mathematical expression of the residual module is:
y=F(x,wi)+wsx, the F (x, w)i) For the output of the independent convolutional layer, the wsAnd the weight value of the residual error module.
In some possible designs, the F (x, w)i) As an activation function of the independent convolution layer, a ReLU function is used, whose mathematical expression is ReLU (x) max (0, x),
in some possible designs, the adjusting weights of the neurons of the target model includes:
and adjusting the weight of the neuron by a random gradient descent method.
In a second aspect, the present application provides an apparatus for constructing a speech recognition model, having functions of implementing a method corresponding to the platform for constructing a speech recognition model provided in the first aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.
The device for constructing the speech recognition model comprises:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of training voice samples, and the training voice samples comprise voice information and text labels corresponding to the voice information;
a processing module for passing the independent convolution layer, the convolution residual layer, the full link layer and the output layerConstructing a voice recognition model, wherein the convolution residual layer comprises a plurality of residual stacked layers which are connected in sequence, the residual stacked layers comprise a plurality of residual modules which are connected in sequence, the residual modules comprise a plurality of hidden layers which are connected in sequence and bypass channels which bypass a plurality of weight layers which are connected in sequence, a plurality of voice samples are sequentially input into the voice recognition model through an input-output module, the voice information and text labels corresponding to the voice information are respectively used as the input and the output of the voice recognition model, the neuron weight of the voice recognition model is continuously trained through the input and the output until the voice samples are all input into the voice recognition model, the training of the voice recognition model is finished, and after the training is finished, the voice recognition model with the trained neuron weight is used as a target model, by L (S) ═ ln Π(h(x),z)∈Sp(z|h(x))=-∑(h(x),z)∈Sln p (z | h (x)) evaluating an error of the target model, wherein l (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is a similarity between the predicted text and the text label, S is the training speech samples, and the predicted text is the text information output by the target model according to neuron weight calculation after the speech information is input to the target model;
and adjusting the weight of the neuron of the target model until the error is smaller than a threshold, setting the weight of the neuron with the error smaller than the threshold as an ideal weight, and deploying the target model and the ideal weight to a client.
In some possible designs, the processing module is further to:
framing the training voice information according to preset framing parameters to obtain sentences corresponding to the training voice information, wherein the preset framing parameters comprise frame duration, frame number and front and rear frame repetition duration;
and converting the statement according to a preset two-dimensional parameter and a filter bank characteristic extraction algorithm to obtain two-dimensional voice information.
In some possible designs, the processing module is further to:
performing discrete Fourier transform on the two-dimensional voice information to obtain a linear spectrum X (k) corresponding to the two-dimensional voice information;
filtering the linear spectrum through a preset band-pass filter to obtain a target linear spectrum, wherein when the center frequency of the band-pass filter is f (m), the transfer function of the band-pass filter is as follows:
the expression of f (m) is:
the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, flIs the lowest frequency of the frequency range of the band-pass filter, fhIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, fsIs the sampling frequency of the band-pass filter, FmelFunction of Fmel1125ln (1+ f/700), the inverse function of Fmel is:
Figure BDA0002206921820000053
b is an integer;
according to
Figure BDA0002206921820000054
M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear spectrum is calculated, and a spectrogram is obtained, wherein X (k) is the linear spectrum.
In some possible designs, the fully-connected layer includes a classification function, which refers to
Figure BDA0002206921820000055
J is a natural number, and the classification function compresses a K-dimensional voice frequency domain signal vector z output by the convolution residual error layer to another K-dimensional real vector delta (z)jSo that the range of each element is between (0, 1) and the sum of all elements is 1.
In some possible designs, the processing module is further to: the input of the residual error module is x, the output of the output residual error module is y, and then the mathematical expression of the residual error module is:
y=F(x,wi)+wsx, the F (x, w)i) For the output of the independent convolutional layer, the wsAnd the weight value of the residual error module.
In some possible designs, the F (x, w)i) The mathematical expression of the ReLU function is ReLU (x) max (0, x) as the activation function of the independent convolution layer.
In some possible designs, the adjusting weights of the neurons of the target model includes:
and adjusting the weight of the neuron by a random gradient descent method.
In yet another aspect, the present application provides an apparatus for constructing a speech recognition model, which includes at least one connected processor, a memory, and an input/output unit, wherein the memory is used for storing program codes, and the processor is used for calling the program codes in the memory to execute the method of the above aspects.
Yet another aspect of the present application provides a computer storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the above-described aspects.
According to the method, the input information x is directly detoured to the output of the hidden layer through the bypass channel, the bypass channel has no weight, the integrity of the input information x is protected, the neural network training is deeper, only the part with difference between input and output needs to be trained by the whole neural network, namely after the input information x is transmitted, each residual error module only learns the residual error F (x), the training target and difficulty are simplified, the neural network is stable and easy to train, along with the increase of the neural network depth, the performance of the voice recognition model is gradually improved, the prediction text of the voice recognition model is evaluated by a CTC loss function, the accurate mapping relation between pronunciation phonemes in a text label and sequences of training voice information does not need to be considered, the voice recognition model can be trained only by inputting the sequences and outputting the sequences, and the manufacturing cost of a training voice sample set is saved. In addition, a triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate harmonic waves in the training voice information, highlight formants of original sounds, avoid the influence of tones in the voice information on a prediction text of a voice recognition model, and reduce the computation amount of the voice information in the recognition process of the voice recognition model.
Drawings
FIG. 1 is a schematic flow chart illustrating a method for constructing a speech recognition model according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of an apparatus for constructing a speech recognition model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an apparatus for constructing a speech recognition model in an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division that may be implemented in an actual application in a different manner, such that multiple modules may be combined or integrated into another system, or some features may be omitted, or may not be implemented.
In order to solve the technical problems, the application mainly provides the following technical scheme:
according to the method, the input information x is directly detoured to the output of the hidden layer through the bypass channel, the bypass channel has no weight, the integrity of the input information x is protected, the neural network training is deeper, only the part with difference between input and output needs to be trained by the whole neural network, namely after the input information x is transmitted, each residual error module only learns the residual error F (x), the training target and difficulty are simplified, the neural network is stable and easy to train, along with the increase of the neural network depth, the performance of the voice recognition model is gradually improved, the prediction text of the voice recognition model is evaluated by a CTC loss function, the accurate mapping relation between pronunciation phonemes in a text label and sequences of training voice information does not need to be considered, the voice recognition model can be trained only by inputting the sequences and outputting the sequences, and the manufacturing cost of a training voice sample set is saved. In addition, the triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate the harmonic waves in the training voice information, highlight the formants of the original sound, avoid the influence of the tones in the voice information on the predicted text of the voice recognition model, and reduce the computation amount of the voice information in the recognition process of the voice recognition model.
Referring to fig. 1, a method for constructing a speech recognition model according to the present application is illustrated, and the method includes:
101. a plurality of training speech samples is obtained.
The training speech samples include speech information and text labels corresponding to the speech information.
The text labels are used for marking pronunciation phonemes of the training voice information.
The voice information writes the recorded content into a text according to the pre-recorded voice; and numbering the words in the text according to the sequence of the words, and labeling each word according to the pronunciation phoneme of the word to obtain a text label. Each pronunciation phoneme in the text label corresponds to one or more frames of data in the recording.
102. And constructing a voice recognition model through the independent convolution layer, the convolution residual layer, the full connection layer and the output layer.
The convolutional residual layer includes a plurality of sequentially connected residual stack layers. The residual stack layer comprises a plurality of sequentially connected residual modules. The residual module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the weight layers which are connected in sequence.
The independent convolution layer is used for extracting acoustic features from the voice information, eliminating non-maximum values in the acoustic features and reducing the complexity of the acoustic features. The acoustic features include pronunciation of specific syllables, user read-through habits, and speech spectrum, among others.
The convolution residual layer is used to map the acoustic features to the hidden layer feature space.
The full connection layer is used for integrating the acoustic features mapped to the hidden layer feature space so as to obtain the meanings of the acoustic features, and the probabilities corresponding to various text types are output according to the meanings.
The output layer is used for outputting the text corresponding to the voice information according to the probability corresponding to each text type.
The voice recognition model in the embodiment adds the bypass channel for the plurality of hidden layers which are connected in sequence, so that the problem that the training accuracy rate of the traditional neural network is lower and lower along with the increase of the number of network layers is solved. The convolution residual layer of the voice recognition model is provided with a plurality of bypass channels, the bypass channels are used as branch lines of the hidden layers, cross-layer connection between the hidden layers is achieved, namely the input of the hidden layers is directly connected to the next layer, and the next layer can directly learn the residual error.
In particular, as shown in fig. 2, in one residual block, the cross-layer connection typically spans only 2 to 3 hidden layers, but does not exclude spanning more hidden layers. The significance of the situation of only spanning 1 hidden layer is not great, and the experimental effect is not ideal.
Assuming that the input of the residual module is x, the expected output is h (x), i.e., h (x) is the expected complex potential mapping, but h (x) is usually difficult to learn; if the input x is passed directly to the output as the initial result, then the target that the residual module needs to learn at this time is f (x) ═ h (x) -x. Thus, compared to the conventional neural network, the speech recognition model in this embodiment is equivalent to changing the learning objective, and instead of learning a complete output, the difference between the optimal solution h (x) and the congruent mapping x, i.e. the residual: f (x) h (x) -x.
From the overall function, if used, { wiRepresents all weights of the residual module, then the output result actually calculated by the residual module is:
y=F(x,{wi})+x
taking the example of spanning 2 hidden layers, F (x, { w) with bias ignoredi})=w2δ(w1x)=w2ReLU(w1x), wherein the ReLU function is an activation function of the residual module.
It is understood that F (x, { w)i}) need to have the same dimension as x. If their dimensions are not the same, an additional weight matrix w may be introducedsLinearly projecting x such that F (x, { w)i} is the same as the dimension of x, and accordingly, the calculation result of the residual module is: y ═ F (x, { w)i})+wsx
And sequentially inputting a plurality of voice samples into the voice recognition model, respectively taking the voice information and the text labels corresponding to the voice information as the input and the output of the voice recognition model, continuously training the neuron weights of the voice recognition model through the input and the output until the voice samples are all input into the voice recognition model, and finishing the training of the voice recognition model. And after training is finished, taking the voice recognition model with the trained neuron weight as a target model.
In the training process, the weight of a neuron in the speech recognition model is initialized randomly, then the training speech information is used as the input of the speech recognition model, and the text label corresponding to the training speech information is used as the output reference of the speech recognition model. The training voice information is transmitted in the forward direction in the operation of the voice recognition model, the voice recognition model carries out random classification on the training voice information by using the neurons after each layer of initialization, and finally, a prediction text corresponding to the training voice information is obtained. And then updating the weight of the neuron according to the difference between the predicted text output by the speech recognition model and the text label, and continuing the next iteration until the weight of the neuron approaches the required value.
103. By L (S) ═ ln |(h(x),z)∈Sp(z|h(x))=-∑(h(x),z)∈Sln p (z | h (x)) estimates the error of the target model.
Wherein, l (S) is an error, x is speech information, z is a text label, p (z | h (x)) is a similarity between a predicted text and the text label, and S is a plurality of training speech samples. The predicted text refers to the text information which is calculated and output by the target model according to the weight of the neuron after the voice information is input into the target model.
The CTC loss function is used to measure the degree of disparity between the predicted text output by the speech recognition model and the actual text labels, which has the advantage of not requiring a forced alignment of the input data with the output data. Unlike the cross-entropy criterion of frame-level alignment between input features and target tags, the CTC loss function can automatically learn the alignment between speech data and tag sequences (e.g., phonemes or characters, etc.), which eliminates the need for forced alignment of data, and the input data and tags are not necessarily the same length. The predicted text of the speech recognition model is evaluated by the CTC loss function, the accurate mapping relation between the pronunciation phonemes in the text label and the sequence of the training speech information is not needed to be considered, the speech recognition model can be trained only by inputting the sequence and outputting the sequence, and the manufacturing cost of a training speech sample set is saved.
104. And adjusting the weight of the neuron of the target model until the error is smaller than the threshold, and setting the weight of the neuron with the error smaller than the threshold as an ideal weight.
And calculating the error of the corresponding training voice sample set according to the CTC loss function, and updating target parameters such as weight, threshold value and the like in the voice recognition model through the back propagation error of the gradient descent algorithm in the voice recognition model, thereby continuously improving the accuracy of the voice recognition model until the convergence requirement is met.
105. And deploying the target model and the ideal weight to the client.
Compared with the prior art, the method and the device have the advantages that the input information x is directly bypassed to the output of the hidden layer through the bypass channel, the bypass channel has no weight, the integrity of the input information x is protected, so that the training of the neural network is deeper, the whole neural network only needs to train the part with the difference between the input and the output, i.e. after the input information x is transmitted, each residual module only learns the residual F (x), simplifies the training goal and difficulty, and the neural network is stable and easy to train, the performance of the speech recognition model gradually becomes better along with the increase of the depth of the neural network, and the predicted text of the speech recognition model is evaluated by the CTC loss function, the accurate mapping relation between the pronunciation phoneme in the text label and the sequence of the training speech information is not needed to be considered, the speech recognition model can be trained only by inputting the sequence and outputting the sequence, and the manufacturing cost of the training speech sample set is saved. In addition, the triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate the harmonic waves in the training voice information, highlight the formants of the original sound, avoid the influence of the tones in the voice information on the predicted text of the voice recognition model, and reduce the computation amount of the voice information in the recognition process of the voice recognition model.
In some embodiments, before inputting the plurality of speech samples to the speech recognition model, the method further comprises:
framing the training voice information according to preset framing parameters to obtain sentences corresponding to the training voice information, wherein the preset framing parameters comprise frame duration, frame number and front and rear frame repetition duration;
and extracting an algorithm conversion statement according to the preset two-dimensional parameters and the filter bank characteristics to obtain two-dimensional voice information.
In some embodiments, framing the training speech information according to the preset framing parameters includes:
performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X (k) corresponding to the two-dimensional voice information;
filtering the linear spectrum through a preset band-pass filter to obtain a target linear spectrum, wherein when the center frequency of the band-pass filter is f (m), the transfer function of the band-pass filter is as follows:
the expression of f (m) is:
Figure BDA0002206921820000112
the band-pass filter comprises a plurality of band-pass filters having triangular filter characteristics, flIs the lowest frequency, f, of the frequency range of the band-pass filterhIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, fsIs the sampling frequency, F, of a band-pass filtermelFunction of Fmel=1125ln(1+f/700),
The inverse function of Fmel is:
Figure BDA0002206921820000113
b is an integer;
according to
Figure BDA0002206921820000114
M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear frequency spectrum is calculated to obtain a spectrogram, and X (k) is the linear frequency spectrum.
In the above embodiment, the response of a person to sound pressure is logarithmic, and the sensitivity of a person to fine variations in high sound pressure is not as sensitive as low sound pressure. Furthermore, the use of logarithms may reduce the sensitivity of the extracted features to variations in the energy of the input sound, since the distance between the sound and the microphone varies, and thus the energy of the sound picked up by the microphone also varies. The spectrogram is a visual expression mode of sound energy time-frequency distribution, effectively utilizes the correlation between two time-frequency domains, has better extraction effect on acoustic features by a feature vector sequence obtained through spectrogram analysis, and is input into a voice recognition model, so that the subsequent operation accuracy is higher. And a triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate harmonic waves in the training voice information and highlight the formants of the original sound. Therefore, the tone or pitch of a section of sound in the training voice information can not be reflected in the acoustic characteristics, namely, the voice recognition model can not be influenced by the tone difference in the voice information to the predicted text; and the computation amount of the voice information in the recognition process of the voice recognition model is reduced.
In some embodiments, the fully-connected layer includes a classification function. The classification function meansj is a natural number, and the classification function compresses a K-dimensional speech frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector delta (z)jSo that the range of each element is between (0, 1) and the sum of all elements is 1.
In some embodiments, the input of the residual block is x, the output of the output residual block is y, and the mathematical expression of the residual block is:
y=F(x,wi)+wsx。F(x,wi) As output of the independent convolution layer, wsAnd the weight of the residual error module.
In the foregoing embodiment, the speech recognition model in this embodiment adds a bypass channel to a plurality of hidden layers connected in sequence, so as to solve the problem that the training accuracy of the conventional neural network is lower and lower as the number of network layers is increased. The convolution residual layer of the voice recognition model is provided with a plurality of bypass channels, the bypass channels are used as branch lines of the hidden layers, cross-layer connection between the hidden layers is achieved, namely the input of the hidden layers is directly connected to the next layer, and the next layer can directly learn the residual error.
In particular, in one residual module, cross-layer connections typically span only 2 to 3 hidden layers, but do not exclude spanning more hidden layers. The significance of the situation of only spanning 1 hidden layer is not great, and the experimental effect is not ideal.
Assuming that the input of the residual module is x, the expected output is h (x), i.e., h (x) is the expected complex potential mapping, but h (x) is usually difficult to learn; if directly transportingIf x is passed to the output as the initial result, then the target to be learned by the residual module is f (x) ═ h (x) -x. Thus, compared to the conventional neural network, the speech recognition model in this embodiment is equivalent to changing the learning objective, and instead of learning a complete output, the difference between the optimal solution h (x) and the congruent mapping x, i.e. the residual: f (x) h (x) -x. From the overall function, if used, { wiRepresents all weights of the residual module, then the output result actually calculated by the residual module is: y ═ F (x, { w)i}) + x, F (x, { w, { with bias ignored, for example, spanning 2 hidden layersi})=w2δ(w1x)=w2ReLU(w1x), where ReLU () is the activation function of the residual module.
It is understood that F (x, { w)i}) need to have the same dimension as x. If their dimensions are not the same, an additional weight matrix w may be introducedsLinearly projecting x such that F (x, { w)i} is the same as the dimension of x, and accordingly, the calculation result of the residual module is: y ═ F (x, { w)i})+wsx
In some embodiments, F (x, w)i) The mathematical expression of the ReLU function is ReLU (x) max (0, x).
In the above embodiment, the neural network can be trained by the above formula.
In some embodiments, adjusting weights of neurons of the target model comprises:
and adjusting the weight of the neuron by a random gradient descent method.
In the above embodiment, the random gradient descent algorithm is adopted to effectively avoid redundant computation, and the consumed time is shorter. Of course other algorithms may be used by those skilled in the art.
Fig. 2 is a schematic structural diagram of an apparatus 20 for constructing a speech recognition model, which can be applied to constructing a speech recognition model. The apparatus for constructing a speech recognition model in the embodiment of the present application can implement the steps corresponding to the method for constructing a speech recognition model performed in the embodiment corresponding to fig. 1. The functions performed by the speech recognition model building means 20 may be implemented by hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware. The apparatus for constructing a speech recognition model may include an input/output module 201 and a processing module 202, and the processing module 202 and the input/output module 201 may refer to operations executed in the embodiment corresponding to fig. 1 for realizing functions, which are not described herein again. The input-output module 201 may be used to control input, output, and acquisition operations of the input-output module 201.
In some embodiments, the input-output module 201 is operable to obtain a plurality of training speech samples, where the training speech samples include speech information and text labels corresponding to the speech information;
the processing module 202 may be configured to construct a speech recognition model by an independent convolutional layer, a convolutional residual layer, a fully-connected layer, and an output layer, where the convolutional residual layer includes a plurality of sequentially-connected residual stacked layers, the residual stacked layers include a plurality of sequentially-connected residual modules, and the residual modules include a plurality of sequentially-connected hidden layers and bypass channels that bypass the plurality of sequentially-connected weight layers; sequentially inputting a plurality of voice samples to the voice recognition model through the input/output module, respectively using the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input to the voice recognition model, finishing the training of the voice recognition model, and after the training is finished, using the voice recognition model with trained neuron weights as a target model; by L (S) ═ ln |(h(x),z)∈Sp(z|h(x))=-∑(h(x),z)∈Sln p (z | h (x)) evaluating the error of the target model; wherein L (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is the predicted text and the textThe similarity of the label is S, the plurality of training voice samples are obtained, and the predicted text refers to the text information which is calculated and output by the target model according to the weight of the neuron after the voice information is input to the target model; and adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight. And deploying the target model and the ideal weight to a client.
In some embodiments, the processing module 202 is further configured to:
framing the training voice information according to preset framing parameters to obtain sentences corresponding to the training voice information, wherein the preset framing parameters comprise frame duration, frame number and front and rear frame repetition duration;
and converting the statement according to a preset two-dimensional parameter and a filter bank characteristic extraction algorithm to obtain two-dimensional voice information.
In some embodiments, the processing module 202 is further configured to:
performing discrete Fourier transform on the two-dimensional voice information to obtain a linear spectrum X (k) corresponding to the two-dimensional voice information;
filtering the linear spectrum through a preset band-pass filter to obtain a target linear spectrum, wherein when the center frequency of the band-pass filter is f (m), the transfer function of the band-pass filter is as follows:
Figure BDA0002206921820000141
the expression of f (m) is:
Figure BDA0002206921820000142
the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, flIs the lowest frequency of the frequency range of the band-pass filter, fhIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, fsFor filtering said band passSampling frequency of the device, said FmelFunction of Fmel1125ln (1+ f/700), the inverse function of Fmel is:
Figure BDA0002206921820000143
b is an integer;
according to
Figure BDA0002206921820000144
M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear spectrum is calculated, and a spectrogram is obtained, wherein X (k) is the linear spectrum;
in some embodiments, the fully-connected layer includes a classification function, which refers to
Figure BDA0002206921820000145
The j is a natural number, and the classification function compresses the K-dimensional speech frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector such that each element ranges between (0, 1) and the sum of all elements is 1.
In some embodiments, the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is: y ═ F (x, w)i)+wsx, the F (x, w)i) For the output of the independent convolutional layer, the wsAnd the weight value of the residual error module.
In some embodiments, wherein F (x, w) isi) The mathematical expression of the ReLU function is ReLU (x) max (0, x) as the activation function of the independent convolution layer.
In some embodiments, the adjusting weights of the neurons of the target model includes:
and adjusting the weight of the neuron by a random gradient descent method.
The creating apparatus in the embodiment of the present application is described above from the perspective of the modular functional entity, and the following apparatus for constructing a speech recognition model is described from the perspective of hardware, as shown in fig. 3, and includes: a processor, a memory, an input-output unit (which may also be a transceiver, not identified in fig. 3), and a computer program stored in the memory and executable on the processor. For example, the computer program may be a program corresponding to the method for constructing the speech recognition model in the embodiment corresponding to fig. 1. For example, when the apparatus for building a speech recognition model implements the functions of the apparatus for building a speech recognition model 20 as shown in fig. 2, the processor executes the computer program to implement the steps of the method for building a speech recognition model executed by the apparatus for building a speech recognition model 20 in the embodiment corresponding to fig. 2. Alternatively, the processor implements the functions of the modules in the apparatus 20 for constructing a speech recognition model according to the embodiment corresponding to fig. 2 when executing the computer program. For another example, the computer program may be a program corresponding to the method for constructing the speech recognition model in the embodiment corresponding to fig. 1.
The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The input-output unit may also be replaced by a receiver and a transmitter, which may be the same or different physical entities. When they are the same physical entity, they may be collectively referred to as an input-output unit. The input and output may be a transceiver.
The memory may be integrated in the processor or may be provided separately from the processor.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes several instructions for enabling a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the drawings, but the present application is not limited to the above-mentioned embodiments, which are only illustrative and not restrictive, and those skilled in the art can make many changes and modifications without departing from the spirit and scope of the present application and the protection scope of the claims, and all changes and modifications that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims (10)

1. A method of constructing a speech recognition model, the method comprising:
acquiring a plurality of training voice samples, wherein the training voice samples comprise voice information and text labels corresponding to the voice information;
the method comprises the steps that a voice recognition model is built through an independent convolution layer, a convolution residual layer, a full-connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are connected in sequence, the residual stacking layers comprise a plurality of residual modules which are connected in sequence, and each residual module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the plurality of weight layers which are connected in sequence;
sequentially inputting a plurality of voice samples into the voice recognition model, respectively taking the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input into the voice recognition model, finishing the training of the voice recognition model, and taking the voice recognition model with trained neuron weights as a target model after the training is finished;
by L (S) ═ ln Π(h(x),z)∈Sp(z|h(x))=-∑(h(x),z)∈Sln p (z | h (x)) evaluating an error of the target model, wherein l (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is a similarity between the predicted text and the text label, S is the training speech samples, and the predicted text is the text information output by the target model according to neuron weight calculation after the speech information is input to the target model;
adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight;
and deploying the target model and the ideal weight to a client.
2. The method of claim 1, wherein prior to inputting the plurality of speech samples to the speech recognition model, the method further comprises:
framing the training voice information according to preset framing parameters to obtain sentences corresponding to the training voice information, wherein the preset framing parameters comprise frame duration, frame number and front and rear frame repetition duration;
and extracting and converting the sentence into the statement according to the preset two-dimensional parameters and the characteristics of the filter bank to obtain two-dimensional voice information.
3. The method of claim 2, wherein the framing the training speech information according to preset framing parameters comprises:
performing discrete Fourier transform on the two-dimensional voice information to obtain a linear spectrum X (k) corresponding to the two-dimensional voice information;
filtering the linear spectrum through a preset band-pass filter to obtain a target linear spectrum, wherein when the center frequency of the band-pass filter is f (m), the transfer function of the band-pass filter is as follows:
Figure FDA0002206921810000021
the expression of f (m) is:
the band-pass filter includes a plurality of band-pass filters having triangular filter characteristics, flIs the lowest frequency of the frequency range of the band-pass filter, fhIs the highest frequency of the band-pass filter frequency range, N is the length at DFT, fsIs the sampling frequency of the band-pass filter, FmelFunction of Fmel1125ln (1+ f/70), the inverse function of Fmel is:
Figure FDA0002206921810000023
b is an integer;
according to
Figure FDA0002206921810000024
M is more than or equal to 0 and less than or equal to M, logarithmic energy corresponding to the target linear spectrum is calculated, and a spectrogram is obtained, wherein X (k) is the linear spectrum.
4. The method of claim 1, wherein the fully-connected layer comprises a classification function, and wherein the classification function is defined as
Figure FDA0002206921810000025
J is a natural number, and the classification function compresses a K-dimensional voice frequency domain signal vector z output by the convolution residual error layer to another K-dimensional real vector delta (z)jSo that the range of each element is between (0, 1) and the sum of all elements is 1.
5. The method of claim 1, wherein the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:
y=F(x,wi)+wsx, the F (x, w)i) For the output of the independent convolutional layer, the wsAnd the weight value of the residual error module.
6. The method of claim 5, wherein F (x, w)i) The mathematical expression of the ReLU function is ReLU (x) max (0, x) as the activation function of the independent convolution layer.
7. The method of claim 1, wherein the adjusting weights of the neurons of the target model comprises:
and adjusting the weight of the neuron by a random gradient descent method.
8. An apparatus for constructing a speech recognition model, the apparatus comprising:
the input and output module is used for acquiring a plurality of training voice samples, and the training voice samples comprise voice information and text labels corresponding to the voice information;
the processing module is used for constructing a voice recognition model through an independent convolution layer, a convolution residual layer, a full-connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are connected in sequence, the residual stacking layers comprise a plurality of residual modules which are connected in sequence, and each residual module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the plurality of weight layers which are connected in sequence; sequentially inputting a plurality of voice samples to the voice recognition model through an input/output module, respectively taking the voice information and a text label corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are all input to the voice recognition model, and finishing the training of the voice recognition model; after the training is finished, the voice recognition model with the trained neuron weight is used as a target model; by L (S) ═ ln Π(h(x),z)∈Sp(z|h(x))=-∑(h(x),z)∈Sln p (z | h (x)) evaluating the error of the target model; wherein L (S) is the error, x is the speech information, z is the text label, p (z | h (x)) is the similarity between the predicted text and the text label, and S is the training speech samples; the predicted text refers to the text information which is calculated and output by the target model according to the weight of the neuron after the voice information is input to the target model;
adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight; and deploying the target model and the ideal weight to a client.
9. An apparatus for constructing a speech recognition model, comprising:
at least one processor, a memory, and an input-output unit;
wherein the memory is configured to store program code and the processor is configured to invoke the program code stored in the memory to perform the method of any of claims 1-7.
10. A computer storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-7.
CN201910884620.9A 2019-09-19 2019-09-19 Method, device, equipment and storage medium for constructing voice recognition model Pending CN110751944A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910884620.9A CN110751944A (en) 2019-09-19 2019-09-19 Method, device, equipment and storage medium for constructing voice recognition model
PCT/CN2019/119128 WO2021051628A1 (en) 2019-09-19 2019-11-18 Method, apparatus and device for constructing speech recognition model, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910884620.9A CN110751944A (en) 2019-09-19 2019-09-19 Method, device, equipment and storage medium for constructing voice recognition model

Publications (1)

Publication Number Publication Date
CN110751944A true CN110751944A (en) 2020-02-04

Family

ID=69276643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910884620.9A Pending CN110751944A (en) 2019-09-19 2019-09-19 Method, device, equipment and storage medium for constructing voice recognition model

Country Status (2)

Country Link
CN (1) CN110751944A (en)
WO (1) WO2021051628A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354345A (en) * 2020-03-11 2020-06-30 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating speech model and speech recognition
CN111862942A (en) * 2020-07-28 2020-10-30 苏州思必驰信息科技有限公司 Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN112597764A (en) * 2020-12-23 2021-04-02 青岛海尔科技有限公司 Text classification method and device, storage medium and electronic device
CN113012706A (en) * 2021-02-18 2021-06-22 联想(北京)有限公司 Data processing method and device and electronic equipment
CN113053361A (en) * 2021-03-18 2021-06-29 北京金山云网络技术有限公司 Speech recognition method, model training method, device, equipment and medium
CN113744729A (en) * 2021-09-17 2021-12-03 北京达佳互联信息技术有限公司 Speech recognition model generation method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106910497A (en) * 2015-12-22 2017-06-30 阿里巴巴集团控股有限公司 A kind of Chinese word pronunciation Forecasting Methodology and device
CN108694951A (en) * 2018-05-22 2018-10-23 华南理工大学 A kind of speaker's discrimination method based on multithread hierarchical fusion transform characteristics and long memory network in short-term
CN108847223A (en) * 2018-06-20 2018-11-20 陕西科技大学 A kind of audio recognition method based on depth residual error neural network
CN109346061A (en) * 2018-09-28 2019-02-15 腾讯音乐娱乐科技(深圳)有限公司 Audio-frequency detection, device and storage medium
US20190114547A1 (en) * 2017-10-16 2019-04-18 Illumina, Inc. Deep Learning-Based Splice Site Classification
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
US20190130897A1 (en) * 2017-10-27 2019-05-02 Salesforce.Com, Inc. End-to-end speech recognition with policy learning
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure
CN109919005A (en) * 2019-01-23 2019-06-21 平安科技(深圳)有限公司 Livestock personal identification method, electronic device and readable storage medium storing program for executing
CN110010133A (en) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 Vocal print detection method, device, equipment and storage medium based on short text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108550375A (en) * 2018-03-14 2018-09-18 鲁东大学 A kind of emotion identification method, device and computer equipment based on voice signal

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106910497A (en) * 2015-12-22 2017-06-30 阿里巴巴集团控股有限公司 A kind of Chinese word pronunciation Forecasting Methodology and device
US20190114547A1 (en) * 2017-10-16 2019-04-18 Illumina, Inc. Deep Learning-Based Splice Site Classification
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
US20190130897A1 (en) * 2017-10-27 2019-05-02 Salesforce.Com, Inc. End-to-end speech recognition with policy learning
CN108694951A (en) * 2018-05-22 2018-10-23 华南理工大学 A kind of speaker's discrimination method based on multithread hierarchical fusion transform characteristics and long memory network in short-term
CN108847223A (en) * 2018-06-20 2018-11-20 陕西科技大学 A kind of audio recognition method based on depth residual error neural network
CN109346061A (en) * 2018-09-28 2019-02-15 腾讯音乐娱乐科技(深圳)有限公司 Audio-frequency detection, device and storage medium
CN109919005A (en) * 2019-01-23 2019-06-21 平安科技(深圳)有限公司 Livestock personal identification method, electronic device and readable storage medium storing program for executing
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure
CN110010133A (en) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 Vocal print detection method, device, equipment and storage medium based on short text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG XIANGYU.ELT: "Deep Residual Learning for Image Recognition", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), pages 770 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354345A (en) * 2020-03-11 2020-06-30 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating speech model and speech recognition
CN111862942A (en) * 2020-07-28 2020-10-30 苏州思必驰信息科技有限公司 Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN112597764A (en) * 2020-12-23 2021-04-02 青岛海尔科技有限公司 Text classification method and device, storage medium and electronic device
CN113012706A (en) * 2021-02-18 2021-06-22 联想(北京)有限公司 Data processing method and device and electronic equipment
CN113053361A (en) * 2021-03-18 2021-06-29 北京金山云网络技术有限公司 Speech recognition method, model training method, device, equipment and medium
CN113744729A (en) * 2021-09-17 2021-12-03 北京达佳互联信息技术有限公司 Speech recognition model generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2021051628A1 (en) 2021-03-25

Similar Documents

Publication Publication Date Title
CN110751944A (en) Method, device, equipment and storage medium for constructing voice recognition model
CN111276131B (en) Multi-class acoustic feature integration method and system based on deep neural network
CN111243620B (en) Voice separation model training method and device, storage medium and computer equipment
CN105118498B (en) The training method and device of phonetic synthesis model
Wöllmer et al. Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework
Hossain et al. Implementation of back-propagation neural network for isolated Bangla speech recognition
CN107680582A (en) Acoustic training model method, audio recognition method, device, equipment and medium
JP2654917B2 (en) Speaker independent isolated word speech recognition system using neural network
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN109147774B (en) Improved time-delay neural network acoustic model
CN110910283A (en) Method, device, equipment and storage medium for generating legal document
Bhosale et al. End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios.
Guo et al. Deep neural network based i-vector mapping for speaker verification using short utterances
Kreyssig et al. Improved TDNNs using deep kernels and frequency dependent Grid-RNNs
CN112183107A (en) Audio processing method and device
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
CN113822017A (en) Audio generation method, device, equipment and storage medium based on artificial intelligence
CN112233651A (en) Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
Ozerov et al. GMM-based classification from noisy features
Kadyan et al. Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system
CA2203649A1 (en) Decision tree classifier designed using hidden markov models
CN112951270B (en) Voice fluency detection method and device and electronic equipment
CN114937454A (en) Method, device and storage medium for preventing voice synthesis attack by voiceprint recognition
Daneshvar et al. Persian phoneme recognition using long short-term memory neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination