WO2021051628A1 - Method, apparatus and device for constructing speech recognition model, and storage medium - Google Patents

Method, apparatus and device for constructing speech recognition model, and storage medium Download PDF

Info

Publication number
WO2021051628A1
WO2021051628A1 PCT/CN2019/119128 CN2019119128W WO2021051628A1 WO 2021051628 A1 WO2021051628 A1 WO 2021051628A1 CN 2019119128 W CN2019119128 W CN 2019119128W WO 2021051628 A1 WO2021051628 A1 WO 2021051628A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition model
speech recognition
training
voice information
output
Prior art date
Application number
PCT/CN2019/119128
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
贾雪丽
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051628A1 publication Critical patent/WO2021051628A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • This application relates to the field of intelligent decision-making, and in particular to a method, device, equipment and storage medium for constructing a voice recognition model.
  • Speech recognition is used to convert speech into text. With the continuous development of deep learning technology, the application range of speech recognition has become wider and wider.
  • DNN deep neural networks
  • CNN convolutional neural networks
  • RNN recurrent neural networks
  • the depth of the network is often closely related to the accuracy of recognition, because traditional deep neural networks can extract low-level, middle-level, and high-level (low/mid/high-level) multi-level features, and the number of layers in the network The more, it means that the extracted features are richer.
  • the "degradation phenomenon" of deep neural networks also begins to appear, causing the accuracy of speech recognition to quickly reach saturation. The deeper the network level, the higher the error rate.
  • the existing speech recognition model needs to align the speech training samples before training, and align the speech data of each frame with the corresponding label to ensure that the loss function used in the training can accurately estimate the speech recognition model. Training error.
  • the inventor realizes that the alignment process of the speech training samples is cumbersome and complicated, and requires a lot of time and cost.
  • the acquired features are introduced into supervised learning, so that the available sample data is expanded, the utilization efficiency of unlabeled images is improved, and the accuracy of model prediction is increased.
  • this application provides a method for constructing a speech recognition model, including:
  • the training voice samples including voice information and text labels corresponding to the voice information
  • the speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer.
  • the convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers;
  • a plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model.
  • the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished.
  • the speech recognition model of neuron weights is used as a target model;
  • the method before the inputting the multiple voice samples into the voice recognition model, the method further includes:
  • the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
  • the sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
  • the processing of the training voice information in frames according to preset framing parameters includes:
  • the linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum.
  • the transfer function of the band-pass filter is:
  • the band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency of the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter.
  • the function is: b is an integer;
  • the fully connected layer includes a classification function
  • the classification function refers to The j is a natural number
  • the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector ⁇ (z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
  • the input of the residual module is x
  • the output of the output residual module is y
  • the mathematical expression of the residual module is:
  • the adjusting the weight of the neuron of the target model includes:
  • the weight of the neuron is adjusted by the stochastic gradient descent method.
  • the present application provides an apparatus for constructing a speech recognition model, which has the function of implementing the method corresponding to the method for constructing a speech recognition model platform provided in the first aspect.
  • the function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules may be software and/or hardware.
  • the device for constructing a speech recognition model includes:
  • An acquisition module for acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
  • the processing module is used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer.
  • the convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual
  • the difference stack layer includes a plurality of sequentially connected residual modules, the residual module includes a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers, through the input and output modules
  • a plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model.
  • h(x)) - ⁇ (h(x), z) ⁇ S ln p(z
  • the target model calculates the output text information according to the weight of the neuron;
  • Adjust the weight of the neuron of the target model until the error is less than the threshold set the weight of the neuron with the error less than the threshold as the ideal weight, and deploy the target model and the ideal weight to the client end.
  • the processing module is also used to:
  • the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
  • the sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
  • the processing module is also used to:
  • the linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum.
  • the transfer function of the band-pass filter is:
  • the band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter.
  • the function is: b is an integer;
  • the fully connected layer includes a classification function
  • the classification function refers to The j is a natural number
  • the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector ⁇ (z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
  • the processing module is further configured to: the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is:
  • the adjusting the weight of the neuron of the target model includes:
  • the weight of the neuron is adjusted by the stochastic gradient descent method.
  • Another aspect of the present application provides a device for constructing a speech recognition model, which includes at least one connected processor, a memory, and an input and output unit, wherein the memory is used to store program code, and the processor is used to call the The program code in the memory executes the methods described in the above aspects.
  • Another aspect of the present application provides a computer storage medium in which computer instructions are stored.
  • the computer instructions When the computer instructions are run on a computer, the computer executes the method for verifying synchronization data between primary and secondary storage volumes. step.
  • This application directly bypasses the input information x to the output of the hidden layer through the bypass channel.
  • the bypass channel has no weight, which protects the integrity of the input information x, making the neural network training deeper, and the entire neural network only needs training input ,
  • the output difference part that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train.
  • the voice is gradually getting better, and the CTC loss function is used to evaluate the predicted text of the speech recognition model. There is no need to consider the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed.
  • the speech recognition model can be trained, saving the production cost of the training speech sample set.
  • a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formants of the original sound, and prevent the tones in the voice information from affecting the speech recognition model's predicted text. Influence, and reduce the amount of calculation of speech information in the process of speech recognition model recognition.
  • FIG. 1 is a schematic flowchart of a method for constructing a speech recognition model in an embodiment of the application
  • Figure 2 is a schematic structural diagram of an apparatus for constructing a speech recognition model in an embodiment of the application
  • Fig. 3 is a schematic structural diagram of a device for constructing a speech recognition model in an embodiment of the application.
  • steps or modules may include other steps or modules that are not clearly listed or are inherent to these processes, methods, products, or equipment.
  • the division of modules in this application is only a logical division In actual applications, there may be other divisions when implemented. For example, multiple modules may be combined or integrated in another system, or some features may be ignored or not implemented.
  • this application mainly provides the following technical solutions:
  • This application directly bypasses the input information x to the output of the hidden layer through the bypass channel.
  • the bypass channel has no weight, which protects the integrity of the input information x, making the neural network training deeper, and the entire neural network only needs training input ,
  • the output difference part that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train.
  • the voice is gradually getting better, and the CTC loss function is used to evaluate the predicted text of the speech recognition model. There is no need to consider the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed.
  • a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formant of the original sound, avoid the influence of the tone in the voice information on the predicted text of the speech recognition model, and reduce The amount of calculation of speech information in the process of speech recognition model recognition.
  • Fig. 1 Please refer to Fig. 1, the following is an example of a method for constructing a speech recognition model provided by this application, and the method includes:
  • the training speech samples include speech information and text labels corresponding to the speech information.
  • the text label is used to label the pronunciation phonemes of the training speech information.
  • the recorded content is written into text according to the pre-recorded voice; the words in the text are numbered according to the sequence of the words, and each word is marked according to its pronunciation phoneme to obtain a text label.
  • Each pronunciation phoneme in the text label corresponds to one or more frames of data in the recording.
  • the convolutional residual layer includes a plurality of residual stacked layers connected in sequence.
  • the residual stacking layer contains multiple residual modules connected in sequence.
  • the residual module includes multiple hidden layers connected in sequence and bypass channels bypassing multiple weighted layers connected in sequence.
  • the independent convolutional layer is used to extract acoustic features from voice information, eliminate non-maximum values in acoustic features, and reduce the complexity of acoustic features.
  • Acoustic features include the pronunciation of specific syllables, the user's continuous reading habits, and the speech spectrum.
  • the convolution residual layer is used to map the acoustic features to the hidden layer feature space.
  • the fully connected layer is used to integrate the acoustic features mapped to the hidden layer feature space to obtain the meaning of the acoustic features, and output the probabilities corresponding to various text types according to the meaning.
  • the output layer is used to output the text corresponding to the voice information according to the probabilities corresponding to various text types.
  • bypass channels are added to several hidden layers that are sequentially connected, so as to solve the problem of lower and lower training accuracy of traditional neural networks as the number of network layers increases.
  • the convolution residual layer of the speech recognition model has many bypass channels.
  • the bypass channel is used as a branch of the hidden layer to realize the cross-layer connection between the hidden layers. That is, the input of the hidden layer is directly connected to the next layer, making the next layer Levels can directly learn residuals.
  • the cross-layer connection generally only spans 2 to 3 hidden layers, but it does not exclude spanning more hidden layers. It is of little significance to cross only one hidden layer, and the experimental effect is not ideal.
  • Input multiple speech samples into the speech recognition model uses the speech information and the text labels corresponding to the speech information as the input and output of the speech recognition model respectively, and continuously train the neuron weights of the speech recognition model through the input and output until the speech The samples have been input to the speech recognition model, and the training of the speech recognition model is ended. After training, the speech recognition model with trained neuron weights is used as the target model.
  • the weights of neurons in the speech recognition model are randomly initialized, and then the training speech information is used as the input of the speech recognition model, and the text label for the training speech information is used as the output reference of the speech recognition model.
  • the training speech information is propagated forward in the speech recognition model.
  • the speech recognition model uses the initialized neurons of each layer to randomly classify the training speech information, and finally obtains the predicted text corresponding to the training speech information. Then update the weight of the neuron according to the gap between the predicted text output by the speech recognition model and the text label, and then continue the next iteration until the weight of the neuron approaches the required value.
  • Predictive text refers to text information that is calculated and output by the target model according to neuron weights after voice information is input to the target model.
  • the CTC loss function is used to estimate the degree of inconsistency between the predicted text output by the speech recognition model and the real text label. Its advantage is that it does not require forced alignment of input data and output data. Different from the cross-entropy criterion of frame-level alignment between input features and target tags, the CTC loss function can automatically learn the alignment between speech data and tag sequences (for example, phonemes or characters, etc.), which eliminates the need for mandatory data. The need for alignment, and the length of the input data and the label may not be the same.
  • the CTC loss function is used to evaluate the predicted text of the speech recognition model, without considering the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed to train the speech recognition model, saving training The production cost of the voice sample set.
  • the error of the corresponding training speech sample set is calculated, and the back propagation error in the speech recognition model by the gradient descent algorithm is used to update the target parameters such as the weight and threshold in the speech recognition model, and continuously improve the speech recognition model speech The accuracy of the recognition until the convergence requirement is reached.
  • this application directly bypasses the input information x to the output of the hidden layer through the bypass channel.
  • the bypass channel has no weight, which protects the integrity of the input information x and makes the neural network training deeper.
  • the entire neural network only needs to train the difference between the input and output, that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train.
  • the performance of the speech recognition model is gradually getting better, and the predicted text of the speech recognition model is evaluated with the CTC loss function, without considering the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information.
  • a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formant of the original sound, avoid the influence of the tone in the voice information on the predicted text of the speech recognition model, and reduce The amount of calculation of speech information in the process of speech recognition model recognition.
  • the method before inputting multiple speech samples into the speech recognition model, the method further includes:
  • the training voice information is processed in frames according to preset framing parameters to obtain sentences corresponding to the training voice information.
  • the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
  • the sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
  • processing the training voice information in frames according to preset framing parameters includes:
  • the band-pass filter includes multiple band-pass filters with triangular filtering characteristics, f l is the lowest frequency in the frequency range of the band-pass filter, f h is the highest frequency in the frequency range of the band-pass filter, and N is the length of the DFT.
  • f s is the sampling frequency of the band-pass filter
  • the inverse function of Fmel is: b is an integer;
  • the human response to the sound pressure is logarithmic, and the human is less sensitive to subtle changes in high sound pressure than to low sound pressure.
  • the use of logarithms can reduce the sensitivity of the extracted features to changes in the input sound energy, because the distance between the sound and the microphone changes, so the sound energy collected by the microphone also changes.
  • the spectrogram is a visual expression of the time-frequency distribution of sound energy, which effectively utilizes the correlation between the time and frequency domains.
  • the feature vector sequence obtained through the analysis of the spectrogram has a better effect on the extraction of acoustic features. Input into the speech recognition model to make subsequent calculations more accurate.
  • a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, and highlight the formant of the original sound. Therefore, the pitch or pitch of a sound in the training speech information will not be reflected in the acoustic characteristics, that is to say, the speech recognition model will not be affected by the difference in the pitch of the speech information and have an impact on the predicted text; and the speech recognition is reduced. The amount of calculation of voice information in the process of model recognition.
  • the fully connected layer includes a classification function.
  • Classification function refers to j is a natural number.
  • the classification function compresses the K-dimensional speech and audio signal vector z output by the convolution residual layer to another K-dimensional real vector ⁇ (z) j , so that the range of each element is (0, 1) , And the sum of all elements is 1.
  • the input of the residual module is x
  • the output of the output residual module is y
  • the mathematical expression of the residual module is:
  • y F(x, w i )+w s x.
  • F(x, w i ) is the output of the independent convolutional layer, and w s is the weight of the residual module.
  • the speech recognition model in this embodiment adds bypass channels to several hidden layers connected in sequence, so as to solve the problem that the training accuracy of traditional neural networks is getting lower and lower as the number of network layers increases. problem.
  • the convolution residual layer of the speech recognition model has many bypass channels.
  • the bypass channel is used as a branch of the hidden layer to realize the cross-layer connection between the hidden layers. That is, the input of the hidden layer is directly connected to the next layer, making the next layer Levels can directly learn residuals.
  • the cross-layer connection generally only spans 2 to 3 hidden layers, but it does not exclude spanning more hidden layers. It is of little significance to cross only one hidden layer, and the experimental effect is not ideal.
  • the neural network can be trained through the above formula.
  • adjusting the weight of the neuron of the target model includes:
  • the weights of neurons are adjusted by stochastic gradient descent method.
  • the stochastic gradient descent algorithm can effectively avoid redundant calculation and consume less time.
  • those skilled in the art can also use other algorithms.
  • a schematic structural diagram of an apparatus 20 for constructing a speech recognition model can be applied to construct a speech recognition model.
  • the apparatus for constructing a speech recognition model in the embodiment of the present application can implement the steps corresponding to the method for constructing a speech recognition model executed in the embodiment corresponding to FIG. 1 above.
  • the functions implemented by the device 20 for constructing a speech recognition model can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules may be software and/or hardware.
  • the apparatus for constructing a speech recognition model may include an input and output module 201 and a processing module 202.
  • the functional realization of the processing module 202 and the input and output module 201 can refer to the operations performed in the embodiment corresponding to FIG. Go into details.
  • the input/output module 201 can be used to control the input, output, and acquisition operations of the input/output module 201.
  • the input and output module 201 may be used to obtain multiple training voice samples, where the training voice samples include voice information and text labels corresponding to the voice information;
  • the processing module 202 can be used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer.
  • the convolutional residual layer includes a plurality of sequentially connected residual stacked layers.
  • the residual stack layer includes a plurality of sequentially connected residual modules, the residual module includes a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;
  • a plurality of the voice samples are sequentially input to the voice recognition model through the input and output module, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model, and
  • the input and the output continuously train the neuron weights of the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is ended.
  • the target model calculates the output text information according to the neuron weight; adjusts the neuron weight of the target model until the error is less than the threshold, and the error is less than the threshold of the neuron weight Set as ideal weight. Deploy the target model and the ideal weight to the client.
  • the processing module 202 is further configured to:
  • the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
  • the sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
  • the processing module 202 is further configured to:
  • the linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum.
  • the transfer function of the band-pass filter is:
  • the band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter.
  • the function is: b is an integer;
  • the fully connected layer includes a classification function
  • the classification function refers to The j is a natural number
  • the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector, so that the range of each element is between (0, 1) And the sum of all elements is 1.
  • the input of the residual module is x
  • the output of the output residual module is y
  • the F(x, w i ) is the output of the independent convolutional layer
  • the w s is the weight of the residual module.
  • the adjusting the weight of the neuron of the target model includes:
  • the weight of the neuron is adjusted by the stochastic gradient descent method.
  • the above describes the creation device in the embodiment of the present application from the perspective of modular functional entities.
  • a device for building a speech recognition model from the perspective of hardware includes: processor, memory, input and output A unit (can also be a transceiver, not identified in Figure 3) and a computer program stored in the memory and running on the processor.
  • the computer program may be a program corresponding to the method of constructing a speech recognition model in the embodiment corresponding to FIG. 1.
  • the processor executes the computer program to implement the speech recognition model in the embodiment corresponding to FIG.
  • each step in the method of constructing a voice recognition model executed by the device 20 for recognizing a model executed by the device 20 for recognizing a model.
  • the processor executes the computer program
  • the function of each module in the apparatus 20 for constructing a speech recognition model of the embodiment corresponding to FIG. 2 is realized.
  • the computer program may be a program corresponding to the method for constructing a speech recognition model in the embodiment corresponding to FIG. 1.
  • the so-called processor may be a central processing unit (CPU), other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (ASICs), ready-made Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.
  • the processor is the control center of the computer device, and various interfaces and lines are used to connect various parts of the entire computer device.
  • the memory may be used to store the computer program and/or module, and the processor implements the computer by running or executing the computer program and/or module stored in the memory and calling data stored in the memory.
  • the memory may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store Data created based on the use of mobile phones (such as audio data, video data, etc.), etc.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • non-volatile memory such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the input and output units can also be replaced by receivers and transmitters, and they can be the same or different physical entities. When they are the same physical entity, they can be collectively referred to as input and output units.
  • the input and output can be a transceiver.
  • the memory may be integrated in the processor, or may be provided separately from the processor.
  • the application also provides a computer storage medium.
  • the computer storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • the training voice samples including voice information and text labels corresponding to the voice information
  • the speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer.
  • the convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;
  • a plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model.
  • the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished.
  • the speech recognition model of neuron weights is used as a target model;

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

A method, apparatus and device for constructing a speech recognition model and a storage medium, relating to the field of artificial intelligence. The method comprises: obtaining a plurality of training speech samples (101); constructing a speech recognition model by means of an independent convolution layer, a convolution residual layer, a fully connected layer, and an output layer (102); inputting training speech information to the speech recognition model, updating a weight of neurons in the speech recognition model with the speech information and a text tag corresponding to the speech information by means of a natural language processing (NLP) technology, and then obtaining a target model (103); evaluating an error of the target model by means of L(S) = -ln[Pi](h(x) , z) being an element of a set Sp(z|h(x))= -sigma(h(x) , z) being an element of a set Sln p(z|h(x)) (104); adjusting the weight of the neurons in the target model until the error is less than a threshold, and setting the weight of the neurons with the error less than the threshold as an ideal weight (105); and deploying the target model and the ideal weight on a client (106). The present invention reduces influence of tone in the speech information on a predicted text and computation burden in recognition process in the speech recognition model.

Description

构建语音识别模型的方法、装置、设备和存储介质Method, device, equipment and storage medium for constructing speech recognition model
本申请要求于2019年9月19日提交中国专利局、申请号为201910884620.9、发明名称为“构建语音识别模型的方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 19, 2019, the application number is 201910884620.9, and the invention title is "Methods, Apparatus, Equipment, and Storage Media for Constructing Speech Recognition Models", all of which are approved The reference is incorporated in the application.
技术领域Technical field
本申请涉及智能决策领域,尤其涉及一种构建语音识别模型的方法、装置、设备和存储介质。This application relates to the field of intelligent decision-making, and in particular to a method, device, equipment and storage medium for constructing a voice recognition model.
背景技术Background technique
语音识别用于将语音转换为文本。随着深度学习技术的不断发展,语音识别的应用范围也越来越广。Speech recognition is used to convert speech into text. With the continuous development of deep learning technology, the application range of speech recognition has become wider and wider.
目前,深度神经网络(deep neural networks,DNN)已经成为自动语音识别领域研究的热点。卷积神经网络(convolutional neural networks,CNN)、循环神经网络(recurrent neural networks,RNN)在语音识别模型创建上都取得了比较好的效果,深度学习已经成为语音识别的主流方案。At present, deep neural networks (DNN) have become a research hotspot in the field of automatic speech recognition. Convolutional neural networks (CNN) and recurrent neural networks (RNN) have achieved relatively good results in the creation of speech recognition models, and deep learning has become the mainstream solution for speech recognition.
在深度神经网络中,网络的深度往往与识别的正确率密切相关,因为传统的深度神经网络能够提取到低层、中层以及高层(low/mid/high-level)的多层次特征,网络的层数越多,意味着提取到的特征越丰富。但是,随着网络层级的不断加深,深度神经网络的“退化现象”也开始出现,导致语音识别的准确率很快达到饱和,出现网络层级越深,错误率反而越高的现象。此外,现有的语音识别模型在训练之前需要对语音训练样本进行对齐操作,对每一帧的语音数据与对应的标签进行对齐,以保证训练中所使用的损失函数能够准确估计语音识别模型的训练误差。然而,发明人意识到语音训练样本的对齐过程繁琐、复杂,需要耗费很大的时间成本。In deep neural networks, the depth of the network is often closely related to the accuracy of recognition, because traditional deep neural networks can extract low-level, middle-level, and high-level (low/mid/high-level) multi-level features, and the number of layers in the network The more, it means that the extracted features are richer. However, as the network level continues to deepen, the "degradation phenomenon" of deep neural networks also begins to appear, causing the accuracy of speech recognition to quickly reach saturation. The deeper the network level, the higher the error rate. In addition, the existing speech recognition model needs to align the speech training samples before training, and align the speech data of each frame with the corresponding label to ensure that the loss function used in the training can accurately estimate the speech recognition model. Training error. However, the inventor realizes that the alignment process of the speech training samples is cumbersome and complicated, and requires a lot of time and cost.
发明内容Summary of the invention
本申请实例中通过获取无标注数据的特征,将获得的特征其引入到监督学习中,使得可使用的样本数据得到扩充和提高未标注图像的利用效率,模型预测的准确率上升。In the example of this application, by acquiring features of unlabeled data, the acquired features are introduced into supervised learning, so that the available sample data is expanded, the utilization efficiency of unlabeled images is improved, and the accuracy of model prediction is increased.
第一方面,本申请提供一种构建语音识别模型方法,包括:In the first aspect, this application provides a method for constructing a speech recognition model, including:
获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道;The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers;
将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练,所述训练结束后,将带有训练好神经元权值的所述语音识别模型作为目标模型;A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as a target model;
通过L(S)=-lnΠ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈Sln p(z|h(x))评估所述目标模型的误差,其中,L(S)为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本,所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息; Through L(S)=-lnΠ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x)) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model;
调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的神经元权值设为理想权值;Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;
将所述目标模型以及所述理想权值部署至客户端。Deploy the target model and the ideal weight to the client.
在一些可能的设计中,所述将多个所述语音样本输入至所述语音识别模型之前,所述方法还包括:In some possible designs, before the inputting the multiple voice samples into the voice recognition model, the method further includes:
根据预设的分帧参数分帧处理所述训练语音信息,得到所述训练语音信息对应的语句,所述预设分帧参数包括帧时长、帧数和前后帧重复时长;Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
根据预设的二维参数和滤波器组特征提取算法转化所述语句,得到二维语音信息。The sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
在一些可能的设计中,所述根据预设的分帧参数分帧处理所述训练语音信息,包括:In some possible designs, the processing of the training voice information in frames according to preset framing parameters includes:
对所述二维语音信息进行离散傅里叶变换,以得到所述二维语音信息对应的线性频谱X(k);Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;
通过预设的带通滤波器对所述线性频谱滤波,以得到目标线性频谱,当所述带通滤波器的中心频率为f(m)时,则所述带通滤波器的传递函数为:The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:
Figure PCTCN2019119128-appb-000001
所述f(m)的表达式为:
Figure PCTCN2019119128-appb-000001
The expression of f(m) is:
Figure PCTCN2019119128-appb-000002
Figure PCTCN2019119128-appb-000002
所述带通滤波器包括多个具有三角形滤波特性的带通滤波器,所述f l为所述带通滤波器频率范围的最低频率,所述f h为带所述通滤波器频率范围的最高频率,所述N为DFT时的长度,所述f s为所述带通滤波器的采样频率,所述F mel函数为F mel=1125ln(1+f/700),所述Fmel的逆函数为:
Figure PCTCN2019119128-appb-000003
b为整数;
The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency of the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/700), the inverse of Fmel The function is:
Figure PCTCN2019119128-appb-000003
b is an integer;
根据
Figure PCTCN2019119128-appb-000004
0≤m≤M计算所述目标线性频谱对应的对数能量,得到语谱图,所述X(k)为所述线性频谱。
according to
Figure PCTCN2019119128-appb-000004
0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
在一些可能的设计中,所述全连接层包括分类函数,所述分类函数是指
Figure PCTCN2019119128-appb-000005
所述j为自然数,所述分类函数将卷积残差层输出的K维的语音频域信号向量z压缩到另一个K维实向量δ(z) j,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
In some possible designs, the fully connected layer includes a classification function, and the classification function refers to
Figure PCTCN2019119128-appb-000005
The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
在一些可能的设计中,所述残差模块的输入为x,所述输出残差模块的输出为y,则所述残差模块的数学表达式为:In some possible designs, the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is:
y=F(x,w i)+w sx,所述F(x,w i)为所述独立卷积层的输出,所述w s为所述残差模块的权值。 y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
在一些可能的设计中,所述F(x,w i)的采用ReLU函数作为所述独立卷积层的激活函数,所述ReLU函数的数学表达式为ReLU(x)=max(0,x), In some possible designs, the F(x, w i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x ),
在一些可能的设计中,所述调整所述目标模型的神经元的权值,包括:In some possible designs, the adjusting the weight of the neuron of the target model includes:
通过随机梯度下降法调整所述神经元的权值。The weight of the neuron is adjusted by the stochastic gradient descent method.
第二方面,本申请提供一种构建语音识别模型的装置,具有实现对应于上述第一方面 提供的构建语音识别模型的平台的方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括一个或多个与上述功能相对应的模块,所述模块可以是软件和/或硬件。In the second aspect, the present application provides an apparatus for constructing a speech recognition model, which has the function of implementing the method corresponding to the method for constructing a speech recognition model platform provided in the first aspect. The function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules may be software and/or hardware.
所述构建语音识别模型的装置包括:The device for constructing a speech recognition model includes:
获取模块,用于获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;An acquisition module for acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
处理模块,用于通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道,通过输入输出模块将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练,所述训练结束后,将带有训练好神经元权值的所述语音识别模型作为目标模型,通过L(S)=-lnΠ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈Sln p(z|h(x))评估所述目标模型的误差,其中,L(S)为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本,所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息; The processing module is used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual The difference stack layer includes a plurality of sequentially connected residual modules, the residual module includes a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers, through the input and output modules A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as the target model, through L(S)=-lnΠ (h(x), z)∈S p(z|h(x))=-∑ (h(x), z) ∈ S ln p(z|h(x)) evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, p(z |h(x)) is the similarity between the predicted text and the text label, S is the multiple training voice samples, and the predicted text refers to the input of the voice information into the target model, The target model calculates the output text information according to the weight of the neuron;
调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的神经元权值设为理想权值,将所述目标模型以及所述理想权值部署至客户端。Adjust the weight of the neuron of the target model until the error is less than the threshold, set the weight of the neuron with the error less than the threshold as the ideal weight, and deploy the target model and the ideal weight to the client end.
在一些可能的设计中,所述处理模块还用于:In some possible designs, the processing module is also used to:
根据预设的分帧参数分帧处理所述训练语音信息,得到所述训练语音信息对应的语句,所述预设分帧参数包括帧时长、帧数和前后帧重复时长;Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
根据预设的二维参数和滤波器组特征提取算法转化所述语句,得到二维语音信息。The sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
在一些可能的设计中,所述处理模块还用于:In some possible designs, the processing module is also used to:
对所述二维语音信息进行离散傅里叶变换,以得到所述二维语音信息对应的线性频谱X(k);Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;
通过预设的带通滤波器对所述线性频谱滤波,以得到目标线性频谱,当所述带通滤波器的中心频率为f(m)时,则所述带通滤波器的传递函数为:The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:
Figure PCTCN2019119128-appb-000006
所述f(m)的表达式为:
Figure PCTCN2019119128-appb-000006
The expression of f(m) is:
Figure PCTCN2019119128-appb-000007
Figure PCTCN2019119128-appb-000007
所述带通滤波器包括多个具有三角形滤波特性的带通滤波器,所述f l为所述带通滤波器频率范围的最低频率,所述f h为所述带通滤波器频率范围的最高频率,所述N为DFT时的长度,所述f s为所述带通滤波器的采样频率,所述F mel函数为F mel=1125ln(1+f/700),所述Fmel的逆函数为:
Figure PCTCN2019119128-appb-000008
b为整数;
The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/700), the inverse of Fmel The function is:
Figure PCTCN2019119128-appb-000008
b is an integer;
根据
Figure PCTCN2019119128-appb-000009
0≤m≤M计算所述目标线性频谱对应的对数能量,得到语谱图,所述X(k)为所述线性频谱。
according to
Figure PCTCN2019119128-appb-000009
0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
在一些可能的设计中,所述全连接层包括分类函数,所述分类函数是指
Figure PCTCN2019119128-appb-000010
所述j为自然数,所述分类函数将卷积残差层输出的K维的语音频域信号向量z压缩到另一个K维实向量δ(z) j,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
In some possible designs, the fully connected layer includes a classification function, and the classification function refers to
Figure PCTCN2019119128-appb-000010
The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
在一些可能的设计中,所述处理模块还用于:所述残差模块的输入为x,所述输出残差模块的输出为y,则所述残差模块的数学表达式为:In some possible designs, the processing module is further configured to: the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is:
y=F(x,w i)+w sx,所述F(x,w i)为所述独立卷积层的输出,所述w s为所述残差模块的权值。 y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
在一些可能的设计中,所述F(x,w i)的采用ReLU函数作为所述独立卷积层的激活函数,所述ReLU函数的数学表达式为ReLU(x)=max(0,x)。 In some possible designs, the F(x, w i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x ).
在一些可能的设计中,所述调整所述目标模型的神经元的权值,包括:In some possible designs, the adjusting the weight of the neuron of the target model includes:
通过随机梯度下降法调整所述神经元的权值。The weight of the neuron is adjusted by the stochastic gradient descent method.
本申请又一方面提供了一种构建语音识别模型的设备,其包括至少一个连接的处理器、存储器、输入输出单元,其中,所述存储器用于存储程序代码,所述处理器用于调用所述存储器中的程序代码来执行上述各方面所述的方法。Another aspect of the present application provides a device for constructing a speech recognition model, which includes at least one connected processor, a memory, and an input and output unit, wherein the memory is used to store program code, and the processor is used to call the The program code in the memory executes the methods described in the above aspects.
本申请又一方面提供了一种计算机存储介质,所述计算机存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行上述的主备存储卷同步数据校验方法的步骤。Another aspect of the present application provides a computer storage medium in which computer instructions are stored. When the computer instructions are run on a computer, the computer executes the method for verifying synchronization data between primary and secondary storage volumes. step.
本申请通过旁路通道将输入信息x直接绕道传至隐含层的输出,旁路通道没有权值,保护了输入信息x的完整性,使得神经网络训练的更深,整个神经网络只需要训练输入、输出差别的部分,即传递输入信息x后,每个残差模块只学习残差F(x),简化训练的目标和难度,且神经网络稳定易于训练,随着神经网络深度的增加,语音识别模型的性能也逐渐变好,并且以CTC损失函数评估语音识别模型的预测文本,无需考虑文本标签中的发音音素与训练语音信息的序列之间精准的映射关系,只需要输入序列和输出序列即可训练语音识别模型,节省了训练语音样本集的制作成本。此外,采用三角带通滤波器对所述训练语音信息的频谱进行平滑,消除所述训练语音信息中的谐波,突显原始声音的共振峰,避免语音信息中的音调对语音识别模型预测文本的影响,并降低语音识别模型识别过程中的对语音信息的运算量。This application directly bypasses the input information x to the output of the hidden layer through the bypass channel. The bypass channel has no weight, which protects the integrity of the input information x, making the neural network training deeper, and the entire neural network only needs training input , The output difference part, that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train. As the depth of the neural network increases, the voice The performance of the recognition model is gradually getting better, and the CTC loss function is used to evaluate the predicted text of the speech recognition model. There is no need to consider the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed. Then the speech recognition model can be trained, saving the production cost of the training speech sample set. In addition, a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formants of the original sound, and prevent the tones in the voice information from affecting the speech recognition model's predicted text. Influence, and reduce the amount of calculation of speech information in the process of speech recognition model recognition.
附图说明Description of the drawings
图1为本申请实施例中构建语音识别模型的方法的流程示意图;FIG. 1 is a schematic flowchart of a method for constructing a speech recognition model in an embodiment of the application;
图2为本申请实施例中构建语音识别模型的装置的结构示意图;Figure 2 is a schematic structural diagram of an apparatus for constructing a speech recognition model in an embodiment of the application;
图3为本申请实施例中构建语音识别模型的设备的结构示意图。Fig. 3 is a schematic structural diagram of a device for constructing a speech recognition model in an embodiment of the application.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外, 术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块,本申请中所出现的模块的划分,仅仅是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个模块可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行。It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. The terms "first" and "second" in the specification and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those clearly listed. Those steps or modules may include other steps or modules that are not clearly listed or are inherent to these processes, methods, products, or equipment. The division of modules in this application is only a logical division In actual applications, there may be other divisions when implemented. For example, multiple modules may be combined or integrated in another system, or some features may be ignored or not implemented.
为解决上述技术问题,本申请主要提供以下技术方案:In order to solve the above technical problems, this application mainly provides the following technical solutions:
本申请通过旁路通道将输入信息x直接绕道传至隐含层的输出,旁路通道没有权值,保护了输入信息x的完整性,使得神经网络训练的更深,整个神经网络只需要训练输入、输出差别的部分,即传递输入信息x后,每个残差模块只学习残差F(x),简化训练的目标和难度,且神经网络稳定易于训练,随着神经网络深度的增加,语音识别模型的性能也逐渐变好,并且以CTC损失函数评估语音识别模型的预测文本,无需考虑文本标签中的发音音素与训练语音信息的序列之间精准的映射关系,只需要输入序列和输出序列即可训练语音识别模型,节省了训练语音样本集的制作成本。此外,采用三角带通滤波器对训练语音信息的频谱进行平滑,消除训练语音信息中的谐波,突显原始声音的共振峰,避免语音信息中的音调对语音识别模型预测文本的影响,并降低语音识别模型识别过程中的对语音信息的运算量。This application directly bypasses the input information x to the output of the hidden layer through the bypass channel. The bypass channel has no weight, which protects the integrity of the input information x, making the neural network training deeper, and the entire neural network only needs training input , The output difference part, that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train. As the depth of the neural network increases, the voice The performance of the recognition model is gradually getting better, and the CTC loss function is used to evaluate the predicted text of the speech recognition model. There is no need to consider the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed. Then the speech recognition model can be trained, saving the production cost of the training speech sample set. In addition, a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formant of the original sound, avoid the influence of the tone in the voice information on the predicted text of the speech recognition model, and reduce The amount of calculation of speech information in the process of speech recognition model recognition.
请参照图1,以下对本申请提供一种构建语音识别模型的方法进行举例说明,方法包括:Please refer to Fig. 1, the following is an example of a method for constructing a speech recognition model provided by this application, and the method includes:
101、获取多个训练语音样本。101. Obtain multiple training voice samples.
训练语音样本包括语音信息以及与语音信息对应的文本标签。The training speech samples include speech information and text labels corresponding to the speech information.
文本标签用于标注训练语音信息的发音音素。The text label is used to label the pronunciation phonemes of the training speech information.
所诉语音信息根据预先录制语音,将录音的内容写成文本;按照词语的先后顺序,对文本中的词语进行编号,对每个词语根据其发音音素进行标注,得到文本标签。文本标签中每个发音音素对应于录音中的一帧或多帧数据。According to the voice information in question, the recorded content is written into text according to the pre-recorded voice; the words in the text are numbered according to the sequence of the words, and each word is marked according to its pronunciation phoneme to obtain a text label. Each pronunciation phoneme in the text label corresponds to one or more frames of data in the recording.
102、通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型。102. Construct a speech recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer.
卷积残差层包括多个顺次连接的残差堆叠层。残差堆叠层包含多个顺次连接的残差模块。残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道。The convolutional residual layer includes a plurality of residual stacked layers connected in sequence. The residual stacking layer contains multiple residual modules connected in sequence. The residual module includes multiple hidden layers connected in sequence and bypass channels bypassing multiple weighted layers connected in sequence.
独立卷积层用于从语音信息中提取声学特征,并消除声学特征中的非极大值,降低声学特征的复杂度。声学特征包括特定音节的发音、用户连读习惯以及语音频谱等。The independent convolutional layer is used to extract acoustic features from voice information, eliminate non-maximum values in acoustic features, and reduce the complexity of acoustic features. Acoustic features include the pronunciation of specific syllables, the user's continuous reading habits, and the speech spectrum.
卷积残差层用于将声学特征映射到隐层特征空间。The convolution residual layer is used to map the acoustic features to the hidden layer feature space.
全连接层用于整合映射到隐层特征空间的声学特征,以获取声学特征的含义,根据含义输出各种文本类型所对应的概率。The fully connected layer is used to integrate the acoustic features mapped to the hidden layer feature space to obtain the meaning of the acoustic features, and output the probabilities corresponding to various text types according to the meaning.
输出层用于根据各种文本类型所对应的概率输出语音信息所对应的文本。The output layer is used to output the text corresponding to the voice information according to the probabilities corresponding to various text types.
本实施例中的语音识别模型为若干个顺次连接的隐藏层增加了旁路通道,以解决传统的神经网络随着网络层数的增加,训练准确率越来越低的问题。语音识别模型的卷积残差层有很多旁路通道,旁路通道作为隐藏层的支线,实现隐藏层之间的跨层连接,即将隐藏层的输入直接连到下一级层,使得下一级层可以直接学习残差。In the speech recognition model in this embodiment, bypass channels are added to several hidden layers that are sequentially connected, so as to solve the problem of lower and lower training accuracy of traditional neural networks as the number of network layers increases. The convolution residual layer of the speech recognition model has many bypass channels. The bypass channel is used as a branch of the hidden layer to realize the cross-layer connection between the hidden layers. That is, the input of the hidden layer is directly connected to the next layer, making the next layer Levels can directly learn residuals.
具体地,如图2所示,在一个残差模块中,跨层连接一般只跨越2至3个隐藏层,但不排斥跨越更多的隐藏层。仅跨越1个隐藏层的情况意义不大,实验效果不理想。Specifically, as shown in Figure 2, in a residual module, the cross-layer connection generally only spans 2 to 3 hidden layers, but it does not exclude spanning more hidden layers. It is of little significance to cross only one hidden layer, and the experimental effect is not ideal.
假定残差模块的输入为x,期望输出是H(x),即H(x)是期望的复杂潜在映射,但通常H(x)的学习难度很大;如果直接把输入x传到输出作为初始结果,那么此时残差模块需要学习的目标就是F(x)=H(x)-x。于是,相对于传统的神经网络,本实施例中的语音识别模型相当于将学习目标改变了,不再是学习一个完整的输出,而是学习最优解H(X)和全等映射x的差值,即残差:F(x)=H(x)-x。Assuming that the input of the residual module is x, the expected output is H(x), that is, H(x) is the desired complex latent mapping, but usually H(x) is very difficult to learn; if the input x is directly passed to the output as The initial result, then the goal that the residual module needs to learn at this time is F(x)=H(x)-x. Therefore, compared with the traditional neural network, the speech recognition model in this embodiment is equivalent to changing the learning goal. It is no longer learning a complete output, but learning the optimal solution H(X) and the congruent mapping x. Difference, that is, residual: F(x)=H(x)-x.
从整体功能上看,如果用{w i}表示残差模块的所有权值,那么残差模块实际上计算的输出结果为: From the overall function point of view, if {w i } is used to represent the ownership value of the residual module, then the actual output result calculated by the residual module is:
y=F(x,{w i})+x y=F(x,{w i })+x
以跨越2个隐藏层为例,在忽略偏置的情况下,F(x,{w i})=w 2δ(w 1x)=w 2ReLU(w 1x),其中,ReLU函数为残差模块的激活函数。 Taking two hidden layers as an example, in the case of ignoring the bias, F(x,{w i })=w 2 δ(w 1 x)=w 2 ReLU(w 1 x), where the ReLU function is The activation function of the residual module.
可以理解的是,F(x,{w i})与x需要具有相同的维数。如果它们的维数不相同,则可以引入一个额外的权值矩阵w s对x进行线性投影,使得F(x,{w i})与x的维数相同,相应地,残差模块的计算结果为:y=F(x,{w i})+w sx It is understandable that F(x, {w i }) and x need to have the same dimensionality. If their dimensions are not the same, you can introduce an additional weight matrix w s to linearly project x, so that F(x,{w i }) has the same dimension as x, correspondingly, the calculation of the residual module The result is: y=F(x,{w i })+w s x
将多个语音样本依次输入至语音识别模型,将语音信息及将语音信息对应的文本标签分别作为语音识别模型的输入以及输出,通过输入以及输出不断训练语音识别模型的神经元权值,直至语音样本均已输入至语音识别模型,结束对语音识别模型的训练。训练结束后,将带有训练好神经元权值的语音识别模型作为目标模型。Input multiple speech samples into the speech recognition model in turn, use the speech information and the text labels corresponding to the speech information as the input and output of the speech recognition model respectively, and continuously train the neuron weights of the speech recognition model through the input and output until the speech The samples have been input to the speech recognition model, and the training of the speech recognition model is ended. After training, the speech recognition model with trained neuron weights is used as the target model.
在训练过程中,随机初始化语音识别模型内部的神经元的权值,再以训练语音信息作为语音识别模型的输入,以训练语音信息对于的文本标签作为语音识别模型的输出参考。训练语音信息在语音识别模型中运行前向传播,语音识别模型利用各层初始化后的神经元对训练语音信息进行随机分类,最终得到与训练语音信息对应的预测文本。然后根据语音识别模型输出的预测文本和文本标签之间的差距来更新神经元的权值,再继续下一轮迭代,直至神经元的权值逼近要求值。In the training process, the weights of neurons in the speech recognition model are randomly initialized, and then the training speech information is used as the input of the speech recognition model, and the text label for the training speech information is used as the output reference of the speech recognition model. The training speech information is propagated forward in the speech recognition model. The speech recognition model uses the initialized neurons of each layer to randomly classify the training speech information, and finally obtains the predicted text corresponding to the training speech information. Then update the weight of the neuron according to the gap between the predicted text output by the speech recognition model and the text label, and then continue the next iteration until the weight of the neuron approaches the required value.
103、通过L(S)=-lnΠ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈Sln p(z|h(x))评估目标模型的误差。 103. Through L(S)=-lnΠ (h(x), z)∈S p(z|h(x))=-∑ (h(x), z)∈S ln p(z|h(x) )) Evaluate the error of the target model.
其中,L(S)为误差,x为语音信息,z为文本标签,p(z|h(x))为预测文本与文本标签的相似度,S为多个训练语音样本。预测文本是指语音信息输入至目标模型后,由目标模型根据神经元权值计算输出的文本信息。Among them, L(S) is the error, x is the speech information, z is the text label, p(z|h(x)) is the similarity between the predicted text and the text label, and S is the multiple training speech samples. Predictive text refers to text information that is calculated and output by the target model according to neuron weights after voice information is input to the target model.
CTC损失函数用来估量语音识别模型输出的预测文本与真实的文本标签的不一致程度,其优点是不要求输入数据与输出数据强制对齐。与输入特征和目标标签之间的帧级对准的交叉熵准则不同,CTC损失函数能够自动学习语音数据和标签序列(比如,音素或者字符等)之间的对齐,这消除了对数据进行强制对齐的需要,并且输入数据与标签的长度不一定相同。以CTC损失函数评估语音识别模型的预测文本,无需考虑文本标签中的发音音素与训练语音信息的序列之间精准的映射关系,只需要输入序列和输出序列即可训练语音识别模型,节省了训练语音样本集的制作成本。The CTC loss function is used to estimate the degree of inconsistency between the predicted text output by the speech recognition model and the real text label. Its advantage is that it does not require forced alignment of input data and output data. Different from the cross-entropy criterion of frame-level alignment between input features and target tags, the CTC loss function can automatically learn the alignment between speech data and tag sequences (for example, phonemes or characters, etc.), which eliminates the need for mandatory data. The need for alignment, and the length of the input data and the label may not be the same. The CTC loss function is used to evaluate the predicted text of the speech recognition model, without considering the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed to train the speech recognition model, saving training The production cost of the voice sample set.
104、调整目标模型的神经元的权值,直至误差小于阈值,将误差小于阈值的神经元权值设为理想权值。104. Adjust the weight of the neuron of the target model until the error is less than the threshold, and set the weight of the neuron with the error less than the threshold as the ideal weight.
根据CTC损失函数计算出相应的训练语音样本集的误差,通过梯度下降算法在语音识别模型中的反向传播误差,从而更新语音识别模型中的权重与阈值等目标参数,不断提高语音识别模型语音识别的准确率,直至达到收敛要求。According to the CTC loss function, the error of the corresponding training speech sample set is calculated, and the back propagation error in the speech recognition model by the gradient descent algorithm is used to update the target parameters such as the weight and threshold in the speech recognition model, and continuously improve the speech recognition model speech The accuracy of the recognition until the convergence requirement is reached.
105、将目标模型以及理想权值部署至客户端。105. Deploy the target model and ideal weights to the client.
相较于现有技术,本申请通过旁路通道将输入信息x直接绕道传至隐含层的输出,旁路通道没有权值,保护了输入信息x的完整性,使得神经网络训练的更深,整个神经网络只需要训练输入、输出差别的部分,即传递输入信息x后,每个残差模块只学习残差F(x),简化训练的目标和难度,且神经网络稳定易于训练,随着神经网络深度的增加,语音识别模型的性能也逐渐变好,并且以CTC损失函数评估语音识别模型的预测文本,无需考虑文本标签中的发音音素与训练语音信息的序列之间精准的映射关系,只需要输入序列和输出序列即可训练语音识别模型,节省了训练语音样本集的制作成本。此外,采用三角带通滤 波器对训练语音信息的频谱进行平滑,消除训练语音信息中的谐波,突显原始声音的共振峰,避免语音信息中的音调对语音识别模型预测文本的影响,并降低语音识别模型识别过程中的对语音信息的运算量。Compared with the prior art, this application directly bypasses the input information x to the output of the hidden layer through the bypass channel. The bypass channel has no weight, which protects the integrity of the input information x and makes the neural network training deeper. The entire neural network only needs to train the difference between the input and output, that is, after the input information x is transmitted, each residual module only learns the residual F(x), which simplifies the training goal and difficulty, and the neural network is stable and easy to train. With the increase in the depth of the neural network, the performance of the speech recognition model is gradually getting better, and the predicted text of the speech recognition model is evaluated with the CTC loss function, without considering the precise mapping relationship between the pronunciation phoneme in the text label and the sequence of the training speech information. Only the input sequence and the output sequence are needed to train the speech recognition model, which saves the production cost of the training speech sample set. In addition, a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, highlight the formant of the original sound, avoid the influence of the tone in the voice information on the predicted text of the speech recognition model, and reduce The amount of calculation of speech information in the process of speech recognition model recognition.
一些实施方式中,将多个语音样本输入至语音识别模型之前,方法还包括:In some embodiments, before inputting multiple speech samples into the speech recognition model, the method further includes:
根据预设的分帧参数分帧处理训练语音信息,得到训练语音信息对应的语句,预设分帧参数包括帧时长、帧数和前后帧重复时长;The training voice information is processed in frames according to preset framing parameters to obtain sentences corresponding to the training voice information. The preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
根据预设的二维参数和滤波器组特征提取算法转化语句,得到二维语音信息。The sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
一些实施方式中,根据预设的分帧参数分帧处理训练语音信息,包括:In some implementation manners, processing the training voice information in frames according to preset framing parameters includes:
对二维语音信息进行离散傅里叶变换,以得到二维语音信息对应的线性频谱X(k);;Perform discrete Fourier transform on the two-dimensional voice information to obtain the linear frequency spectrum X(k) corresponding to the two-dimensional voice information;
通过预设的带通滤波器对线性频谱滤波,以得到目标线性频谱,当带通滤波器的中心频率为f(m)时,则带通滤波器的传递函数为:Filter the linear spectrum through the preset band-pass filter to obtain the target linear spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:
Figure PCTCN2019119128-appb-000011
f(m)的表达式为:
Figure PCTCN2019119128-appb-000011
The expression of f(m) is:
Figure PCTCN2019119128-appb-000012
Figure PCTCN2019119128-appb-000012
带通滤波器包括多个具有三角形滤波特性的带通滤波器,f l为带通滤波器频率范围的最低频率,f h为带通滤波器频率范围的最高频率,N为DFT时的长度,f s为带通滤波器的采样频率,F mel函数为F mel=1125ln(1+f/700),Fmel的逆函数为:
Figure PCTCN2019119128-appb-000013
b为整数;
The band-pass filter includes multiple band-pass filters with triangular filtering characteristics, f l is the lowest frequency in the frequency range of the band-pass filter, f h is the highest frequency in the frequency range of the band-pass filter, and N is the length of the DFT. f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/700), and the inverse function of Fmel is:
Figure PCTCN2019119128-appb-000013
b is an integer;
根据
Figure PCTCN2019119128-appb-000014
0≤m≤M计算目标线性频谱对应的对数能量,得到语谱图,X(k)为线性频谱。
according to
Figure PCTCN2019119128-appb-000014
0≤m≤M Calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain the spectrogram, X(k) is the linear frequency spectrum.
上述实施方式中,人对声音声压的反应呈对数关系,人对高声压的细微变化敏感度不如低声压。此外,使用对数可以降低提取的特征对输入声音能量变化的敏感度,因为声音与麦克风之间的距离是变化的,因而麦克风采集到的声音能量也是变化的。语谱图是一种声音能量时频分布的可视化表达方式,有效的利用了时频两域之间的相关性,通过语谱图分析获得的特征矢量序列对于声学特征的提取的效果更好,输入到语音识别模型中,使后续的运算准确性更高。并且采用三角带通滤波器对训练语音信息的频谱进行平滑,消除训练语音信息中的谐波,突显原始声音的共振峰。因此训练语音信息中一段声音的音调或音高,不会反应在声学特征内,也就是说,语音识别模型不会受到语音信息中的音调不同而对预测文本有所影响;并且降低了语音识别模型识别过程中的对语音信息的运算量。In the above embodiment, the human response to the sound pressure is logarithmic, and the human is less sensitive to subtle changes in high sound pressure than to low sound pressure. In addition, the use of logarithms can reduce the sensitivity of the extracted features to changes in the input sound energy, because the distance between the sound and the microphone changes, so the sound energy collected by the microphone also changes. The spectrogram is a visual expression of the time-frequency distribution of sound energy, which effectively utilizes the correlation between the time and frequency domains. The feature vector sequence obtained through the analysis of the spectrogram has a better effect on the extraction of acoustic features. Input into the speech recognition model to make subsequent calculations more accurate. And a triangular band-pass filter is used to smooth the frequency spectrum of the training voice information, eliminate the harmonics in the training voice information, and highlight the formant of the original sound. Therefore, the pitch or pitch of a sound in the training speech information will not be reflected in the acoustic characteristics, that is to say, the speech recognition model will not be affected by the difference in the pitch of the speech information and have an impact on the predicted text; and the speech recognition is reduced. The amount of calculation of voice information in the process of model recognition.
一些实施方式中,全连接层包括分类函数。分类函数是指
Figure PCTCN2019119128-appb-000015
j为自然数,分类函数将卷积残差层输出的K维的语音频域信号向量z压缩到另一个K维实向量δ(z) j,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
In some embodiments, the fully connected layer includes a classification function. Classification function refers to
Figure PCTCN2019119128-appb-000015
j is a natural number. The classification function compresses the K-dimensional speech and audio signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) j , so that the range of each element is (0, 1) , And the sum of all elements is 1.
一些实施方式中,残差模块的输入为x,输出残差模块的输出为y,则残差模块的数学表达式为:In some embodiments, the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is:
y=F(x,w i)+w sx。F(x,w i)为独立卷积层的输出,w s为残差模块的权值。 y=F(x, w i )+w s x. F(x, w i ) is the output of the independent convolutional layer, and w s is the weight of the residual module.
上述实施方式中,本实施例中的语音识别模型为若干个顺次连接的隐藏层增加了旁路通道,以解决传统的神经网络随着网络层数的增加,训练准确率越来越低的问题。语音识别模型的卷积残差层有很多旁路通道,旁路通道作为隐藏层的支线,实现隐藏层之间的跨层连接,即将隐藏层的输入直接连到下一级层,使得下一级层可以直接学习残差。In the above-mentioned embodiment, the speech recognition model in this embodiment adds bypass channels to several hidden layers connected in sequence, so as to solve the problem that the training accuracy of traditional neural networks is getting lower and lower as the number of network layers increases. problem. The convolution residual layer of the speech recognition model has many bypass channels. The bypass channel is used as a branch of the hidden layer to realize the cross-layer connection between the hidden layers. That is, the input of the hidden layer is directly connected to the next layer, making the next layer Levels can directly learn residuals.
具体地,在一个残差模块中,跨层连接一般只跨越2至3个隐藏层,但不排斥跨越更多的隐藏层。仅跨越1个隐藏层的情况意义不大,实验效果不理想。Specifically, in a residual module, the cross-layer connection generally only spans 2 to 3 hidden layers, but it does not exclude spanning more hidden layers. It is of little significance to cross only one hidden layer, and the experimental effect is not ideal.
假定残差模块的输入为x,期望输出是H(x),即H(x)是期望的复杂潜在映射,但通常H(x)的学习难度很大;如果直接把输入x传到输出作为初始结果,那么此时残差模块需要学习的目标就是F(x)=H(x)-x。于是,相对于传统的神经网络,本实施例中的语音识别模型相当于将学习目标改变了,不再是学习一个完整的输出,而是学习最优解H(X)和全等映射x的差值,即残差:F(x)=H(x)-x。从整体功能上看,如果用{w i}表示残差模块的所有权值,那么残差模块实际上计算的输出结果为:y=F(x,{w i})+x,以跨越2个隐藏层为例,在忽略偏置的情况下,F(x,{w i})=w 2δ(w 1x)=w 2ReLU(w 1x),其中,ReLU()为残差模块的激活函数。 Assuming that the input of the residual module is x, the expected output is H(x), that is, H(x) is the desired complex latent mapping, but usually H(x) is very difficult to learn; if the input x is directly passed to the output as The initial result, then the goal that the residual module needs to learn at this time is F(x)=H(x)-x. Therefore, compared with the traditional neural network, the speech recognition model in this embodiment is equivalent to changing the learning goal. It is no longer learning a complete output, but learning the optimal solution H(X) and the congruent mapping x. Difference, that is, residual: F(x)=H(x)-x. From the overall function point of view, if {w i } is used to represent the ownership value of the residual module, then the actual output result calculated by the residual module is: y=F(x,{w i })+x, which spans 2 Take the hidden layer as an example. In the case of ignoring the bias, F(x,{w i })=w 2 δ(w 1 x)=w 2 ReLU(w 1 x), where ReLU() is the residual module Activation function.
可以理解的是,F(x,{w i})与x需要具有相同的维数。如果它们的维数不相同,则可以引入一个额外的权值矩阵w s对x进行线性投影,使得F(x,{w i})与x的维数相同,相应地,残差模块的计算结果为:y=F(x,{w i})+w sx It is understandable that F(x, {w i }) and x need to have the same dimensionality. If their dimensions are not the same, you can introduce an additional weight matrix w s to linearly project x, so that F(x,{w i }) has the same dimension as x, correspondingly, the calculation of the residual module The result is: y=F(x,{w i })+w s x
一些实施方式中,F(x,w i)的采用ReLU函数作为独立卷积层的激活函数,ReLU函数的数学表达式为ReLU(x)=max(0,x)。 In some embodiments, the ReLU function of F(x, w i ) is used as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x).
上述实施方式中,通过上述公式可以训练神经网络。In the above embodiment, the neural network can be trained through the above formula.
一些实施方式中,调整目标模型的神经元的权值,包括:In some embodiments, adjusting the weight of the neuron of the target model includes:
通过随机梯度下降法调整神经元的权值。The weights of neurons are adjusted by stochastic gradient descent method.
上述实施方式中,采用随机梯度下降算法能有效避免冗余计算,消耗时间更短。当然本领域技术人员还可以采用其它算法。In the above-mentioned embodiment, the stochastic gradient descent algorithm can effectively avoid redundant calculation and consume less time. Of course, those skilled in the art can also use other algorithms.
如图2所示的一种构建语音识别模型的装置20的结构示意图,其可应用于构建语音识别模型。本申请实施例中的构建语音识别模型的装置能够实现对应于上述图1所对应的实施例中所执行的构建语音识别模型的方法的步骤。构建语音识别模型的装置20实现的功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括一个或多个与上述功能相对应的模块,所述模块可以是软件和/或硬件。所述构建语音识别模型的装置可包括输入输出模块201和处理模块202,所述处理模块202和输入输出模块201的功能实现可参考图1所对应的实施例中所执行的操作,此处不作赘述。输入输出模块201可用于控制所述输入输出模块201的输入、输出以及获取操作。As shown in FIG. 2, a schematic structural diagram of an apparatus 20 for constructing a speech recognition model can be applied to construct a speech recognition model. The apparatus for constructing a speech recognition model in the embodiment of the present application can implement the steps corresponding to the method for constructing a speech recognition model executed in the embodiment corresponding to FIG. 1 above. The functions implemented by the device 20 for constructing a speech recognition model can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules may be software and/or hardware. The apparatus for constructing a speech recognition model may include an input and output module 201 and a processing module 202. The functional realization of the processing module 202 and the input and output module 201 can refer to the operations performed in the embodiment corresponding to FIG. Go into details. The input/output module 201 can be used to control the input, output, and acquisition operations of the input/output module 201.
一些实施方式中,所述输入输出模块201可用于用于获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;In some embodiments, the input and output module 201 may be used to obtain multiple training voice samples, where the training voice samples include voice information and text labels corresponding to the voice information;
所述处理模块202可用于用于通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道;通过所述输入输出模块将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练,所述训练结束后,将带有训练好神经元权值的所述语音识别模型作为目标模型;通过L(S)=-lnΠ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈Sln p(z|h(x))评估所述目标模型的误差;其中,L(S) 为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本,所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息;调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的神经元权值设为理想权值。将所述目标模型以及所述理想权值部署至客户端。 The processing module 202 can be used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers. , The residual stack layer includes a plurality of sequentially connected residual modules, the residual module includes a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers; A plurality of the voice samples are sequentially input to the voice recognition model through the input and output module, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model, and The input and the output continuously train the neuron weights of the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is ended. After the training is over , Take the speech recognition model with trained neuron weights as the target model; pass L(S)=-lnΠ (h(x), z)∈S p(z|h(x))=-∑ (h(x), z)∈S ln p(z|h(x)) evaluate the error of the target model; where L(S) is the error, x is the voice information, and z is the Text label, p(z|h(x)) is the similarity between the predicted text and the text label, S is the multiple training speech samples, and the predicted text refers to the input of the speech information to the After the target model, the target model calculates the output text information according to the neuron weight; adjusts the neuron weight of the target model until the error is less than the threshold, and the error is less than the threshold of the neuron weight Set as ideal weight. Deploy the target model and the ideal weight to the client.
一些实施方式中,所述处理模块202还用于:In some implementation manners, the processing module 202 is further configured to:
根据预设的分帧参数分帧处理所述训练语音信息,得到所述训练语音信息对应的语句,所述预设分帧参数包括帧时长、帧数和前后帧重复时长;Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
根据预设的二维参数和滤波器组特征提取算法转化所述语句,得到二维语音信息。The sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
一些实施方式中,所述处理模块202还用于:In some implementation manners, the processing module 202 is further configured to:
对所述二维语音信息进行离散傅里叶变换,以得到所述二维语音信息对应的线性频谱X(k);Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;
通过预设的带通滤波器对所述线性频谱滤波,以得到目标线性频谱,当所述带通滤波器的中心频率为f(m)时,则所述带通滤波器的传递函数为:The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:
Figure PCTCN2019119128-appb-000016
所述f(m)的表达式为:
Figure PCTCN2019119128-appb-000016
The expression of f(m) is:
Figure PCTCN2019119128-appb-000017
Figure PCTCN2019119128-appb-000017
所述带通滤波器包括多个具有三角形滤波特性的带通滤波器,所述f l为所述带通滤波器频率范围的最低频率,所述f h为所述带通滤波器频率范围的最高频率,所述N为DFT时的长度,所述f s为所述带通滤波器的采样频率,所述F mel函数为F mel=1125ln(1+f/700),所述Fmel的逆函数为:
Figure PCTCN2019119128-appb-000018
b为整数;
The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/700), the inverse of Fmel The function is:
Figure PCTCN2019119128-appb-000018
b is an integer;
根据
Figure PCTCN2019119128-appb-000019
0≤m≤M计算所述目标线性频谱对应的对数能量,得到语谱图,所述X(k)为所述线性频谱;
according to
Figure PCTCN2019119128-appb-000019
0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum;
一些实施方式中,所述全连接层包括分类函数,所述分类函数是指
Figure PCTCN2019119128-appb-000020
所述j为自然数,所述分类函数将卷积残差层输出的K维的语音频域信号向量z压缩到另一个K维实向量,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
In some embodiments, the fully connected layer includes a classification function, and the classification function refers to
Figure PCTCN2019119128-appb-000020
The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector, so that the range of each element is between (0, 1) And the sum of all elements is 1.
一些实施方式中,所述残差模块的输入为x,所述输出残差模块的输出为y,则所述残差模块的数学表达式为:y=F(x,w i)+w sx,所述F(x,w i)为所述独立卷积层的输出,所述w s为所述残差模块的权值。 In some embodiments, the input of the residual module is x, and the output of the output residual module is y, then the mathematical expression of the residual module is: y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
一些实施方式中,所述F(x,w i)的采用ReLU函数作为所述独立卷积层的激活函数,所述ReLU函数的数学表达式为ReLU(x)=max(0,x)。 In some embodiments, the F(x, w i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0,x).
一些实施方式中,所述调整所述目标模型的神经元的权值,包括:In some embodiments, the adjusting the weight of the neuron of the target model includes:
通过随机梯度下降法调整所述神经元的权值。The weight of the neuron is adjusted by the stochastic gradient descent method.
上面从模块化功能实体的角度分别介绍了本申请实施例中的创建装置,以下从硬件角 度介绍一种构建语音识别模型的设备,如图3所示,其包括:处理器、存储器、输入输出单元(也可以是收发器,图3中未标识出)以及存储在所述存储器中并可在所述处理器上运行的计算机程序。例如,该计算机程序可以为图1所对应的实施例中构建语音识别模型的方法对应的程序。例如,当构建语音识别模型的设备实现如图2所示的构建语音识别模型的装置20的功能时,所述处理器执行所述计算机程序时实现上述图2所对应的实施例中由构建语音识别模型的装置20执行的构建语音识别模型的方法中的各步骤。或者,所述处理器执行所述计算机程序时实现上述图2所对应的实施例的构建语音识别模型的装置20中各模块的功能。又例如,该计算机程序可以为图1所对应的实施例中构建语音识别模型的方法对应的程序。The above describes the creation device in the embodiment of the present application from the perspective of modular functional entities. The following describes a device for building a speech recognition model from the perspective of hardware, as shown in Figure 3, which includes: processor, memory, input and output A unit (can also be a transceiver, not identified in Figure 3) and a computer program stored in the memory and running on the processor. For example, the computer program may be a program corresponding to the method of constructing a speech recognition model in the embodiment corresponding to FIG. 1. For example, when the device for constructing a speech recognition model implements the function of the device 20 for constructing a speech recognition model as shown in FIG. 2, the processor executes the computer program to implement the speech recognition model in the embodiment corresponding to FIG. Each step in the method of constructing a voice recognition model executed by the device 20 for recognizing a model. Alternatively, when the processor executes the computer program, the function of each module in the apparatus 20 for constructing a speech recognition model of the embodiment corresponding to FIG. 2 is realized. For another example, the computer program may be a program corresponding to the method for constructing a speech recognition model in the embodiment corresponding to FIG. 1.
所称处理器可以是中央处理单元(central processing unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器是所述计算机装置的控制中心,利用各种接口和线路连接整个计算机装置的各个部分。The so-called processor may be a central processing unit (CPU), other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (ASICs), ready-made Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc. The processor is the control center of the computer device, and various interfaces and lines are used to connect various parts of the entire computer device.
所述存储器可用于存储所述计算机程序和/或模块,所述处理器通过运行或执行存储在所述存储器内的计算机程序和/或模块,以及调用存储在存储器内的数据,实现所述计算机装置的各种功能。所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、视频数据等)等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory may be used to store the computer program and/or module, and the processor implements the computer by running or executing the computer program and/or module stored in the memory and calling data stored in the memory. Various functions of the device. The memory may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store Data created based on the use of mobile phones (such as audio data, video data, etc.), etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
所述输入输出单元也可以用接收器和发送器代替,可以为相同或者不同的物理实体。为相同的物理实体时,可以统称为输入输出单元。该输入输出可以为收发器。The input and output units can also be replaced by receivers and transmitters, and they can be the same or different physical entities. When they are the same physical entity, they can be collectively referred to as input and output units. The input and output can be a transceiver.
所述存储器可以集成在所述处理器中,也可以与所述处理器分开设置。The memory may be integrated in the processor, or may be provided separately from the processor.
本申请还提供一种计算机存储介质。该计算机存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The application also provides a computer storage medium. The computer storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道;The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;
将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练,所述训练结束后,将带有训练好神经元权值的所述语音识别模型作为目标模型;A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as a target model;
通过L(S)=-lnΠ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈Sln p(z|h(x))评估所述目标模型的误差,其中,L(S)为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本,所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息; Through L(S)=-lnΠ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x)) Evaluate the error of the target model, where L(S) is the error, x is the speech information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model;
调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的 神经元权值设为理想权值;Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;
将所述目标模型以及所述理想权值部署至客户端。Deploy the target model and the ideal weight to the client.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present application.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,这些均属于本申请的保护之内。The embodiments of the application are described above with reference to the accompanying drawings, but the application is not limited to the above-mentioned specific embodiments. The above-mentioned specific embodiments are only illustrative and not restrictive. Those of ordinary skill in the art are Under the enlightenment of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can be made, any equivalent structure or equivalent process transformation made by using the content of the description and drawings of this application, or It is directly or indirectly used in other related technical fields, and these all fall within the protection of this application.

Claims (20)

  1. 一种构建语音识别模型的方法,所述方法包括:A method for constructing a speech recognition model, the method comprising:
    获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
    通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道;The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;
    将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练,所述训练结束后,将带有训练好神经元权值的所述语音识别模型作为目标模型;A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as a target model;
    通过L(S)=-ln∏ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x))评估所述目标模型的误差,其中,L(S)为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本,所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息; Through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x) ) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model ;
    调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的神经元权值设为理想权值;Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;
    将所述目标模型以及所述理想权值部署至客户端。Deploy the target model and the ideal weight to the client.
  2. 根据权利要求1所述的方法,所述将多个所述语音样本输入至所述语音识别模型之前,所述方法还包括:The method according to claim 1, before inputting a plurality of the speech samples into the speech recognition model, the method further comprises:
    根据预设的分帧参数分帧处理所述训练语音信息,得到与所述训练语音信息对应的语句,所述预设分帧参数包括帧时长、帧数和前后帧重复时长;Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
    根据预设的二维参数和滤波器组的特征提取转化为所述语句,得到二维语音信息。According to the preset two-dimensional parameters and the feature extraction of the filter bank, the sentence is converted into the sentence to obtain two-dimensional voice information.
  3. 根据权利要求2所述的方法,所述根据预设的分帧参数分帧处理所述训练语音信息,包括:The method according to claim 2, wherein the processing of the training voice information in frames according to preset framing parameters includes:
    对所述二维语音信息进行离散傅里叶变换,以得到所述二维语音信息对应的线性频谱X(k);Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;
    通过预设的带通滤波器对所述线性频谱滤波,以得到目标线性频谱,当所述带通滤波器的中心频率为f(m)时,则所述带通滤波器的传递函数为:The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:
    Figure PCTCN2019119128-appb-100001
    所述f(m)的表达式为:
    Figure PCTCN2019119128-appb-100001
    The expression of f(m) is:
    Figure PCTCN2019119128-appb-100002
    Figure PCTCN2019119128-appb-100002
    所述带通滤波器包括多个具有三角形滤波特性的带通滤波器,所述f l为所述带通滤波器频率范围的最低频率,所述f h为所述带通滤波器频率范围的最高频率,所述N为DFT时的长度,所述f s为所述带通滤波器的采样频率,所述F mel函数为F mel=1125ln(1+f/70),所述Fmel的逆函数为:
    Figure PCTCN2019119128-appb-100003
    b为整数;
    The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/70), the inverse of Fmel The function is:
    Figure PCTCN2019119128-appb-100003
    b is an integer;
    根据
    Figure PCTCN2019119128-appb-100004
    0≤m≤M计算所述目标线性频谱对应的对数能量,得到语谱图,所述X(k)为所述线性频谱。
    according to
    Figure PCTCN2019119128-appb-100004
    0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
  4. 根据权利要求1所述的方法,所述全连接层包括分类函数,所述分类函数是指
    Figure PCTCN2019119128-appb-100005
    所述j为自然数,所述分类函数将卷积残差层输出的K维的语音频域信号向量z压缩到另一个K维实向量
    Figure PCTCN2019119128-appb-100006
    使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
    The method according to claim 1, wherein the fully connected layer includes a classification function, and the classification function refers to
    Figure PCTCN2019119128-appb-100005
    The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector
    Figure PCTCN2019119128-appb-100006
    So that the range of each element is between (0,1), and the sum of all elements is 1.
  5. 根据权利要求1所述的方法,所述残差模块的输入为x,所述输出残差模块的输出为y,所述残差模块的数学表达式为:The method according to claim 1, wherein the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:
    y=F(x,w i)+w sx,所述F(x,w i)为所述独立卷积层的输出,所述w s为所述残差模块的权值。 y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
  6. 根据权利要求5所述的方法,所述F(x,w i)的采用ReLU函数作为所述独立卷积层的激活函数,所述ReLU函数的数学表达式为ReLU(x)=max(0,x)。 According to the method of claim 5, the ReLU function of the F(x, w i ) is used as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x)=max(0 ,x).
  7. 根据权利要求1所述的方法,所述调整所述目标模型的神经元的权值,包括:The method according to claim 1, wherein the adjusting the weights of the neurons of the target model comprises:
    通过随机梯度下降法调整所述神经元的权值。The weight of the neuron is adjusted by the stochastic gradient descent method.
  8. 一种构建语音识别模型的装置,所述装置包括:A device for constructing a speech recognition model, the device comprising:
    输入输出模块,用于获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;The input and output module is used to obtain a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
    处理模块,用于通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道;通过输入输出模块将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练;所述训练结束后,将带有训练好神经元权值的所述语音识别模型作为目标模型;通过L(S)=-ln∏ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x))评估所述目标模型的误差;其中,L(S)为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本;所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息; The processing module is used to construct a voice recognition model through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual The difference stacking layer includes a plurality of sequentially connected residual modules, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the multiple sequentially connected weight layers; through the input and output modules A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is ended; after the training is over, there will be a well-trained The speech recognition model of neuron weights is used as the target model; through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x) , Z) ∈ S ln p(z|h(x)) evaluate the error of the target model; where L(S) is the error, x is the speech information, z is the text label, p( z|h(x)) is the similarity between the predicted text and the text label, S is the multiple training voice samples; the predicted text refers to the voice information input to the target model, The target model calculates and outputs text information according to neuron weights;
    调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的神经元权值设为理想权值;将所述目标模型以及所述理想权值部署至客户端。Adjust the weight of the neuron of the target model until the error is less than the threshold, and set the weight of the neuron with the error less than the threshold as the ideal weight; deploy the target model and the ideal weight to the client end.
  9. 根据权利要求8所述构建语音识别模型的装置,所述处理模块还用于:According to the apparatus for constructing a speech recognition model according to claim 8, the processing module is further configured to:
    根据预设的分帧参数分帧处理所述训练语音信息,得到所述训练语音信息对应的语句,所述预设分帧参数包括帧时长、帧数和前后帧重复时长;Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
    根据预设的二维参数和滤波器组特征提取算法转化所述语句,得到二维语音信息。The sentence is transformed according to the preset two-dimensional parameters and the filter bank feature extraction algorithm to obtain two-dimensional voice information.
  10. 根据权利要求9所述构建语音识别模型的装置,所述处理模块还用于:According to the apparatus for constructing a speech recognition model according to claim 9, the processing module is further configured to:
    对所述二维语音信息进行离散傅里叶变换,以得到所述二维语音信息对应的线性频谱X(k);Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;
    通过预设的带通滤波器对所述线性频谱滤波,以得到目标线性频谱,当所述带通滤波器的中心频率为f(m)时,则所述带通滤波器的传递函数为:The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:
    Figure PCTCN2019119128-appb-100007
    所述f(m)的表达式为:
    Figure PCTCN2019119128-appb-100007
    The expression of f(m) is:
    Figure PCTCN2019119128-appb-100008
    Figure PCTCN2019119128-appb-100008
    所述带通滤波器包括多个具有三角形滤波特性的带通滤波器,所述f l为所述带通滤波器频率范围的最低频率,所述f h为所述带通滤波器频率范围的最高频率,所述N为DFT时的长度,所述f s为所述带通滤波器的采样频率,所述F mel函数为F mel=1125ln(1+f/700),所述Fmel的逆函数为:
    Figure PCTCN2019119128-appb-100009
    b为整数;
    The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/700), the inverse of Fmel The function is:
    Figure PCTCN2019119128-appb-100009
    b is an integer;
    根据
    Figure PCTCN2019119128-appb-100010
    0≤m≤M计算所述目标线性频谱对应的对数能量,得到语谱图,所述X(k)为所述线性频谱。
    according to
    Figure PCTCN2019119128-appb-100010
    0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
  11. 根据权利要求8所述构建语音识别模型的装置,所述全连接层包括分类函数,所述分类函数是指
    Figure PCTCN2019119128-appb-100011
    所述j为自然数,所述分类函数将卷积残差层输出的K维的语音频域信号向量z压缩到另一个K维实向量δ(z) j,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
    The apparatus for constructing a speech recognition model according to claim 8, wherein the fully connected layer includes a classification function, and the classification function refers to
    Figure PCTCN2019119128-appb-100011
    The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
  12. 根据权利要求8所述构建语音识别模型的装置,所述处理模块还用于:所述残差模块的输入为x,所述输出残差模块的输出为y,则所述残差模块的数学表达式为:According to the apparatus for constructing a speech recognition model according to claim 8, the processing module is further used for: the input of the residual module is x, and the output of the output residual module is y, then the mathematics of the residual module The expression is:
    y=F(x,w i)+w sx,所述F(x,w i)为所述独立卷积层的输出,所述w s为所述残差模块的权值。 y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
  13. 根据权利要求12所述构建语音识别模型的装置,所述F(x,w i)的采用ReLU函数作为所述独立卷积层的激活函数,所述ReLU函数的数学表达式为ReLU(x)=max(0,x)。 The device for constructing a speech recognition model according to claim 12, wherein the F(x, w i ) adopts the ReLU function as the activation function of the independent convolutional layer, and the mathematical expression of the ReLU function is ReLU(x) =max(0,x).
  14. 根据权利要求8所述构建语音识别模型的装置,所述调整所述目标模型的神经元的权值,包括:8. The apparatus for constructing a speech recognition model according to claim 8, wherein the adjusting the weight of the neuron of the target model comprises:
    通过随机梯度下降法调整所述神经元的权值。The weight of the neuron is adjusted by the stochastic gradient descent method.
  15. 一种构建语音识别模型的设备,所述构建语音识别模型的设备包括:至少一个处理器、存储器和输入输出单元;A device for constructing a voice recognition model, the device for constructing a voice recognition model includes: at least one processor, a memory, and an input and output unit;
    其中,所述存储器用于存储程序代码,所述处理器用于调用所述存储器中存储的程序代码来执行如下步骤:Wherein, the memory is used to store program code, and the processor is used to call the program code stored in the memory to perform the following steps:
    获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
    通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道;The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;
    将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练,所述训练结束后,将带有训练好神经元权值的所 述语音识别模型作为目标模型;A plurality of the voice samples are sequentially input to the voice recognition model, the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model, and the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is ended. After the training is over, there will be a well-trained The speech recognition model of neuron weights is used as a target model;
    通过L(S)=-ln∏ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x))评估所述目标模型的误差,其中,L(S)为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本,所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息; Through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x) ) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model ;
    调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的神经元权值设为理想权值;Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;
    将所述目标模型以及所述理想权值部署至客户端。Deploy the target model and the ideal weight to the client.
  16. 根据权利要求15所述的构建语音识别模型的设备,所述处理器用于调用所述存储器中存储的程序代码执行所述将多个所述语音样本输入至所述语音识别模型之前,还包括如下步骤:The device for constructing a speech recognition model according to claim 15, wherein the processor is configured to call the program code stored in the memory to execute the input of the plurality of speech samples into the speech recognition model, further comprising the following step:
    根据预设的分帧参数分帧处理所述训练语音信息,得到与所述训练语音信息对应的语句,所述预设分帧参数包括帧时长、帧数和前后帧重复时长;Processing the training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, and the preset framing parameters include frame duration, number of frames, and repetition duration of the preceding and following frames;
    根据预设的二维参数和滤波器组的特征提取转化为所述语句,得到二维语音信息。According to the preset two-dimensional parameters and the feature extraction of the filter bank, the sentence is converted into the sentence to obtain two-dimensional voice information.
  17. 根据权利要求16所述的构建语音识别模型的设备,所述处理器用于调用所述存储器中存储的程序代码执行所述根据预设的分帧参数分帧处理所述训练语音信息时,包括如下步骤:The device for constructing a speech recognition model according to claim 16, wherein the processor is configured to call the program code stored in the memory to execute the framing processing of the training voice information according to the preset framing parameters, including the following step:
    对所述二维语音信息进行离散傅里叶变换,以得到所述二维语音信息对应的线性频谱X(k);Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X(k) corresponding to the two-dimensional voice information;
    通过预设的带通滤波器对所述线性频谱滤波,以得到目标线性频谱,当所述带通滤波器的中心频率为f(m)时,则所述带通滤波器的传递函数为:The linear frequency spectrum is filtered by a preset band-pass filter to obtain the target linear frequency spectrum. When the center frequency of the band-pass filter is f(m), the transfer function of the band-pass filter is:
    Figure PCTCN2019119128-appb-100012
    所述f(m)的表达式为:
    Figure PCTCN2019119128-appb-100012
    The expression of f(m) is:
    Figure PCTCN2019119128-appb-100013
    Figure PCTCN2019119128-appb-100013
    所述带通滤波器包括多个具有三角形滤波特性的带通滤波器,所述f l为所述带通滤波器频率范围的最低频率,所述f h为所述带通滤波器频率范围的最高频率,所述N为DFT时的长度,所述f s为所述带通滤波器的采样频率,所述F mel函数为F mel=1125ln(1+f/70),所述Fmel的逆函数为:
    Figure PCTCN2019119128-appb-100014
    b为整数;
    The band-pass filter includes a plurality of band-pass filters with triangular filtering characteristics, the f l is the lowest frequency in the frequency range of the band-pass filter, and the f h is the frequency range of the band-pass filter. The highest frequency, the N is the length of DFT, the f s is the sampling frequency of the band-pass filter, the F mel function is F mel =1125ln(1+f/70), the inverse of Fmel The function is:
    Figure PCTCN2019119128-appb-100014
    b is an integer;
    根据
    Figure PCTCN2019119128-appb-100015
    0≤m≤M计算所述目标线性频谱对应的对数能量,得到语谱图,所述X(k)为所述线性频谱。
    according to
    Figure PCTCN2019119128-appb-100015
    0≤m≤M calculate the logarithmic energy corresponding to the target linear frequency spectrum to obtain a spectrogram, and the X(k) is the linear frequency spectrum.
  18. 根据权利要求15所述的构建语音识别模型的设备,所述全连接层包括分类函数,所述分类函数是指
    Figure PCTCN2019119128-appb-100016
    所述j为自然数,所述分类函数将卷积残差层输出的K维的语音频域信号向量z压缩到另一个K维实向量δ(z) j,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
    The device for constructing a speech recognition model according to claim 15, wherein the fully connected layer includes a classification function, and the classification function refers to
    Figure PCTCN2019119128-appb-100016
    The j is a natural number, and the classification function compresses the K-dimensional speech and audio domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ(z) j , so that the range of each element is ( 0, 1), and the sum of all elements is 1.
  19. 根据权利要求15所述的构建语音识别模型的设备,所述残差模块的输入为x,所述输出残差模块的输出为y,所述残差模块的数学表达式为:The device for constructing a speech recognition model according to claim 15, wherein the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:
    y=F(x,w i)+w sx,所述F(x,w i)为所述独立卷积层的输出,所述w s为所述残差模块的权值。 y=F(x, w i )+w s x, the F(x, w i ) is the output of the independent convolutional layer, and the w s is the weight of the residual module.
  20. 一种计算机存储介质,所述计算机存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer storage medium in which computer instructions are stored. When the computer instructions are run on a computer, the computer is caused to perform the following steps:
    获取多个训练语音样本,所述训练语音样本包括语音信息以及与语音信息对应的文本标签;Acquiring a plurality of training voice samples, the training voice samples including voice information and text labels corresponding to the voice information;
    通过独立卷积层、卷积残差层、全连接层以及输出层构建语音识别模型,所述卷积残差层包括多个顺次连接的残差堆叠层,所述残差堆叠层包含多个顺次连接的残差模块,所述残差模块包含多个顺次连接的隐藏层以及旁路于多个顺次连接的权值层的旁路通道;The speech recognition model is constructed through an independent convolutional layer, a convolutional residual layer, a fully connected layer, and an output layer. The convolutional residual layer includes a plurality of sequentially connected residual stacked layers, and the residual stacked layer includes multiple A sequentially connected residual module, the residual module including a plurality of sequentially connected hidden layers and a bypass channel bypassing the plurality of sequentially connected weight layers;
    将多个所述语音样本依次输入至所述语音识别模型,将所述语音信息及将所述语音信息对应的文本标签分别作为所述语音识别模型的输入以及输出,通过所述输入以及所述输出不断训练所述语音识别模型的神经元权值,直至所述语音样本均已输入至所述语音识别模型,结束对所述语音识别模型的训练,所述训练结束后,将带有训练好神经元权值的所述语音识别模型作为目标模型;A plurality of the voice samples are sequentially input to the voice recognition model, and the voice information and the text label corresponding to the voice information are respectively used as the input and output of the voice recognition model. Through the input and the Output the neuron weights for continuously training the speech recognition model until the speech samples have been input to the speech recognition model, and the training of the speech recognition model is finished. The speech recognition model of neuron weights is used as a target model;
    通过L(S)=-ln∏ (h(x),z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x))评估所述目标模型的误差,其中,L(S)为所述误差,x为所述语音信息,z为所述文本标签,p(z|h(x))为所述预测文本与所述文本标签的相似度,S为所述多个训练语音样本,所述预测文本是指所述语音信息输入至所述目标模型后,由所述目标模型根据神经元权值计算输出的文本信息; Through L(S)=-ln∏ (h(x), z)∈S p(z|h(x))=-∑ (h(x),z)∈S ln p(z|h(x) ) Evaluate the error of the target model, where L(S) is the error, x is the voice information, z is the text label, and p(z|h(x)) is the predicted text and the The similarity of the text label, S is the multiple training speech samples, and the predicted text refers to the text information calculated and output by the target model according to the neuron weight after the speech information is input to the target model ;
    调整所述目标模型的神经元的权值,直至所述误差小于阈值,将所述误差小于阈值的神经元权值设为理想权值;Adjusting the weight of the neuron of the target model until the error is less than the threshold, and setting the weight of the neuron with the error less than the threshold as an ideal weight;
    将所述目标模型以及所述理想权值部署至客户端。Deploy the target model and the ideal weight to the client.
PCT/CN2019/119128 2019-09-19 2019-11-18 Method, apparatus and device for constructing speech recognition model, and storage medium WO2021051628A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910884620.9 2019-09-19
CN201910884620.9A CN110751944A (en) 2019-09-19 2019-09-19 Method, device, equipment and storage medium for constructing voice recognition model

Publications (1)

Publication Number Publication Date
WO2021051628A1 true WO2021051628A1 (en) 2021-03-25

Family

ID=69276643

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/119128 WO2021051628A1 (en) 2019-09-19 2019-11-18 Method, apparatus and device for constructing speech recognition model, and storage medium

Country Status (2)

Country Link
CN (1) CN110751944A (en)
WO (1) WO2021051628A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354345B (en) * 2020-03-11 2021-08-31 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating speech model and speech recognition
CN111862942B (en) * 2020-07-28 2022-05-06 思必驰科技股份有限公司 Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN112597764B (en) * 2020-12-23 2023-07-25 青岛海尔科技有限公司 Text classification method and device, storage medium and electronic device
CN113012706B (en) * 2021-02-18 2023-06-27 联想(北京)有限公司 Data processing method and device and electronic equipment
CN113053361B (en) * 2021-03-18 2023-07-04 北京金山云网络技术有限公司 Speech recognition method, model training method, device, equipment and medium
CN113744729A (en) * 2021-09-17 2021-12-03 北京达佳互联信息技术有限公司 Speech recognition model generation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108550375A (en) * 2018-03-14 2018-09-18 鲁东大学 A kind of emotion identification method, device and computer equipment based on voice signal
CN108694951A (en) * 2018-05-22 2018-10-23 华南理工大学 A kind of speaker's discrimination method based on multithread hierarchical fusion transform characteristics and long memory network in short-term
CN108847223A (en) * 2018-06-20 2018-11-20 陕西科技大学 A kind of audio recognition method based on depth residual error neural network
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106910497B (en) * 2015-12-22 2021-04-16 阿里巴巴集团控股有限公司 Chinese word pronunciation prediction method and device
KR102526103B1 (en) * 2017-10-16 2023-04-26 일루미나, 인코포레이티드 Deep learning-based splice site classification
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
US10573295B2 (en) * 2017-10-27 2020-02-25 Salesforce.Com, Inc. End-to-end speech recognition with policy learning
CN109346061B (en) * 2018-09-28 2021-04-20 腾讯音乐娱乐科技(深圳)有限公司 Audio detection method, device and storage medium
CN109919005A (en) * 2019-01-23 2019-06-21 平安科技(深圳)有限公司 Livestock personal identification method, electronic device and readable storage medium storing program for executing
CN110010133A (en) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 Vocal print detection method, device, equipment and storage medium based on short text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108550375A (en) * 2018-03-14 2018-09-18 鲁东大学 A kind of emotion identification method, device and computer equipment based on voice signal
CN108694951A (en) * 2018-05-22 2018-10-23 华南理工大学 A kind of speaker's discrimination method based on multithread hierarchical fusion transform characteristics and long memory network in short-term
CN108847223A (en) * 2018-06-20 2018-11-20 陕西科技大学 A kind of audio recognition method based on depth residual error neural network
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HE KAIMING; ZHANG XIANGYU; REN SHAOQING; SUN JIAN: "Deep Residual Learning for Image Recognition", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 27 June 2016 (2016-06-27), pages 770 - 778, XP033021254, DOI: 10.1109/CVPR.2016.90 *
SAINATH TARA N.; VINYALS ORIOL; SENIOR ANDREW; SAK HASIM: "Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks", 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 19 April 2015 (2015-04-19), pages 4580 - 4584, XP033187628, DOI: 10.1109/ICASSP.2015.7178838 *

Also Published As

Publication number Publication date
CN110751944A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
WO2021051628A1 (en) Method, apparatus and device for constructing speech recognition model, and storage medium
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
WO2021139294A1 (en) Method and apparatus for training speech separation model, storage medium, and computer device
CN110600047B (en) Perceptual STARGAN-based multi-to-multi speaker conversion method
US20200402497A1 (en) Systems and Methods for Speech Generation
CN109272988B (en) Voice recognition method based on multi-path convolution neural network
WO2018227780A1 (en) Speech recognition method and device, computer device and storage medium
WO2019227586A1 (en) Voice model training method, speaker recognition method, apparatus, device and medium
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN113488058B (en) Voiceprint recognition method based on short voice
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
Hasannezhad et al. PACDNN: A phase-aware composite deep neural network for speech enhancement
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN114203184A (en) Multi-state voiceprint feature identification method and device
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
CN116434758A (en) Voiceprint recognition model training method and device, electronic equipment and storage medium
CN115472182A (en) Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder
CN112951270B (en) Voice fluency detection method and device and electronic equipment
Jaleel et al. Gender identification from speech recognition using machine learning techniques and convolutional neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945725

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945725

Country of ref document: EP

Kind code of ref document: A1