CN109754789B

CN109754789B - Method and device for recognizing voice phonemes

Info

Publication number: CN109754789B
Application number: CN201711082646.9A
Authority: CN
Inventors: 姜珂
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2021-06-08
Anticipated expiration: 2037-11-07
Also published as: CN109754789A

Abstract

The invention discloses a method and a device for recognizing a voice phoneme, relates to the technical field of voice recognition, and mainly aims to solve the problems of low efficiency of segmenting the phoneme or a local optimal solution during voice recognition. The main technical scheme of the invention comprises the following steps: inputting a speech to be recognized into a phoneme recognition model, and obtaining an expected result corresponding to the speech to be recognized according to an output result, wherein the phoneme recognition model recognizes each phoneme in the speech to be recognized through a plurality of neural network models and hidden Markov models; training model parameters in the phoneme recognition model according to the expected result until the change rate of the output result of the phoneme model is smaller than a preset threshold value; and determining the output result of which the change rate is smaller than the preset threshold value as the final phoneme recognition result corresponding to the speech to be recognized. The invention is mainly applied to the process of recognizing the sound.

Description

Method and device for recognizing voice phonemes

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a method and an apparatus for recognizing speech phonemes.

Background

In the field of speech recognition, a phoneme (phone) is the smallest unit in speech, and in order to improve the recognition accuracy, the recognition degree of each phoneme is first improved.

Currently, there are two main methods for training phoneme models: one is a Gaussian mixture hidden Markov model (GMM-HMM), a neural network hidden Markov model DNN-HMM. The GMM-HMM mainly utilizes the HMM to fit the change state of the frame corresponding to the phoneme, then utilizes the GMM or DNN to converge the frame, and adopts viterbi to decode during recognition, so that the audio can be segmented based on the time frame.

In the process of implementing the invention, the inventor finds that in the prior art, particularly in a phoneme model for identifying pinyin of Chinese characters, in order to improve the accuracy of segmentation, when the phoneme is segmented according to a time frame, the segmentation can be accurate to a millisecond level, and the efficiency of segmenting the phoneme is low; in addition, in the process of using the HMM, since the innate unary hypothesis and binary hypothesis of the HMM may cause the recognized phoneme to fall into a locally optimal solution, accuracy of phoneme recognition may be reduced, and if a ternary hypothesis or a quaternary hypothesis is used, a calculation amount for recognizing the phoneme may be enormous.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for recognizing speech phonemes, which mainly aim to solve the problem that the efficiency of segmenting phonemes is low or a local optimal solution is not obtained during speech recognition.

In order to solve the above problems, the present invention mainly provides the following technical solutions:

in one aspect, the present invention provides a method for recognizing speech phonemes, the method comprising:

inputting a speech to be recognized into a phoneme recognition model, and obtaining an expected result corresponding to the speech to be recognized according to an output result, wherein the phoneme recognition model recognizes each phoneme in the speech to be recognized through a plurality of neural network models and hidden Markov models;

training model parameters in the phoneme recognition model according to the expected result until the change rate of the output result of the phoneme model is smaller than a preset threshold value;

and determining the output result of which the change rate is smaller than the preset threshold value as the final phoneme recognition result corresponding to the speech to be recognized.

Optionally, before inputting the speech to be recognized into the phoneme recognition model, the method further includes:

and constructing the phoneme recognition model.

Optionally, the constructing the phoneme recognition model includes:

constructing a convolutional neural network CNN and a long-short term memory network LSTM with a preset number of layers;

adding a deep neural network DNN and a hidden Markov model HMM;

and constructing the phoneme recognition model by using the convolutional neural network CNN, the long-short term memory network LSTM, the deep neural network DNN and the hidden Markov model HMM, and assigning an initialization value to the phoneme recognition model, wherein the convolutional neural network CNN is used as an input end of the speech to be recognized, and the deep neural network DNN is used as an output end of the speech to be recognized.

Optionally, obtaining an expected result corresponding to the speech to be recognized according to the output result includes:

inputting the voice to be recognized into the convolutional neural network CNN, and performing noise reduction processing on the voice to be recognized;

inputting the voice to be recognized after noise reduction into the long-short term memory network LSTM with the preset number of layers, and fitting the voice to be recognized, wherein the long-short term memory network LSTM filters invalid phonemes by activating a forgetting gate, and retains the valid phonemes by activating a memory gate;

inputting the fitted voice to be recognized into the deep neural network DNN;

recording the output phonemes and the corresponding probabilities at each moment as visible observation sequences in a probability output matrix in the Hidden Markov Model (HMM);

calculating according to the probability output matrix and a forward algorithm to obtain a first matrix; calculating according to the probability output matrix and a backward algorithm to obtain a second matrix;

calculating the maximum likelihood corresponding to each phoneme according to the first matrix and the second matrix, and recording the maximum likelihood into a third matrix;

and decoding the third matrix to obtain the maximum likelihood value of each phoneme so as to obtain the expected result.

Optionally, training the model parameters in the phoneme recognition model according to the expected result until the change rate of the output result of the phoneme model is smaller than a preset threshold includes:

according to the expected result, performing gradient descent derivation operation on each phoneme layer by layer from the output end of the phoneme model;

and adjusting the neuron parameters of the phoneme model in each phoneme according to the derivation operation until the change rate of the output result of the phoneme model is smaller than the preset threshold value.

In a second aspect, the present invention provides an apparatus for recognizing speech phonemes, comprising:

the device comprises an input unit, a phoneme recognition model and a control unit, wherein the input unit is used for inputting a speech to be recognized into the phoneme recognition model, and the phoneme recognition model recognizes each phoneme in the speech to be recognized through a plurality of neural network models and hidden Markov models;

the output unit is used for obtaining an expected result corresponding to the speech to be recognized according to an output result after the speech to be recognized is input into the phoneme recognition model by the input unit,

the training unit is used for training the model parameters in the phoneme recognition model according to the expected result output by the output unit until the change rate of the output result of the phoneme model is smaller than a preset threshold value;

and the determining unit is used for determining an output result of which the change rate is smaller than the preset threshold value as a final phoneme recognition result corresponding to the speech to be recognized.

Optionally, the apparatus further comprises:

and the construction unit is used for constructing the phoneme recognition model before the input unit inputs the speech to be recognized into the phoneme recognition model.

Optionally, the building unit includes:

the first construction module is used for constructing a convolutional neural network CNN and long-short term memory networks LSTM with a preset number of layers;

the adding module is used for adding a deep neural network DNN and a hidden Markov model HMM;

a second constructing module, configured to construct the phoneme recognition model by using the convolutional neural network CNN, the long-short term memory network LSTM, a deep neural network DNN, and the hidden markov model HMM, where the convolutional neural network CNN is used as an input end of the speech to be recognized, and the deep neural network DNN is used as an output end of the speech to be recognized;

and the assignment module is used for assigning an initialization value to the phoneme recognition model.

Optionally, the output unit includes:

the noise reduction module is used for inputting the voice to be recognized into the convolutional neural network CNN and performing noise reduction processing on the voice to be recognized;

the first input module is used for inputting the voice to be recognized after the noise reduction module performs noise reduction into the long-short term memory network LSTM with the preset number of layers, wherein the long-short term memory network LSTM filters invalid phonemes by activating a forgetting gate, and retains the valid phonemes by activating a memory gate;

the fitting module is used for fitting the voice to be recognized;

the second input module is used for inputting the fitted voice to be recognized into the deep neural network DNN;

a recording module, configured to record the phoneme and the corresponding probability in each time output by the second output module as a visible observation sequence in a probability output matrix in the hidden markov model HMM;

the first calculation module is used for calculating according to the probability output matrix and a forward algorithm to obtain a first matrix;

the second calculation module is used for calculating according to the probability output matrix and a backward algorithm to obtain a second matrix;

the third calculation module is used for calculating the maximum likelihood corresponding to each phoneme according to the first matrix and the second matrix and recording the maximum likelihood into a third matrix;

and the processing module is used for decoding the third matrix to obtain the maximum likelihood value of each phoneme so as to obtain the expected result.

Optionally, the training unit includes:

the calculation module is used for performing gradient descent derivation operation on each phoneme layer by layer from the output end of the phoneme model according to the expected result;

and the adjusting module is used for adjusting the neuron parameters of the phoneme model in each phoneme according to the derivation operation until the change rate of the output result of the phoneme model is smaller than the preset threshold value.

In order to achieve the above object, according to another aspect of the present invention, there is provided a storage medium including a stored program, wherein the program is executed to control an apparatus on which the storage medium is located to perform the method for recognizing speech phonemes as described above.

In order to achieve the above object, according to another aspect of the present invention, there is provided a processor for executing a program, wherein the program executes to execute the method for recognizing speech phonemes as described above.

By the technical scheme, the technical scheme provided by the invention at least has the following advantages:

the invention provides a method and a device for recognizing voice phonemes, which comprises the steps of firstly, inputting a voice to be recognized into a phoneme recognition model, and obtaining an expected result corresponding to the voice to be recognized according to an output result, wherein the phoneme recognition model recognizes each phoneme in the voice to be recognized through a plurality of neural network models and hidden Markov models; secondly, training model parameters in the phoneme recognition model according to the expected result until the change rate of the output result of the phoneme model is smaller than a preset threshold value; finally, determining an output result of which the change rate is smaller than the preset threshold value as a final phoneme recognition result corresponding to the speech to be recognized; compared with the prior art, the invention makes up the defects of the innate unitary hypothesis and the inherent binary hypothesis of the HMM by applying the phoneme recognition model, avoids the problem of falling into a local optimal solution, and can decode the speech at the frame level by the hidden Markov model in the phoneme recognition model, thereby reducing the operation scale.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating a method for recognizing speech phonemes according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an architecture of a phoneme recognition model provided by an embodiment of the present invention;

FIG. 3 is a diagram illustrating a change rate of an output result of a recognition result versus time according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating another method for recognizing speech phonemes provided by an embodiment of the present invention;

FIG. 5 is a diagram illustrating a long short term memory network LSTM fitting result provided by an embodiment of the present invention;

FIG. 6 is a diagram illustrating a probability output matrix provided by an embodiment of the invention;

FIG. 7 is a diagram illustrating a portion of neurons in a phoneme recognition model according to an embodiment of the present invention;

fig. 8 is a block diagram illustrating a speech phoneme recognition apparatus according to an embodiment of the present invention;

fig. 9 is a block diagram illustrating another speech phoneme recognition apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

An embodiment of the present invention provides a method for recognizing a speech phoneme, as shown in fig. 1, where the method includes:

101. and inputting the speech to be recognized into the phoneme recognition model, and obtaining an expected result corresponding to the speech to be recognized according to an output result.

The speech to be recognized according to the embodiment of the present invention may be any section of speech (for example, the speech to be recognized is a section of chinese), the speech to be recognized is input to the phoneme recognition model, and the phoneme recognition model may be used to segment, recognize, and convert the speech to be recognized into pinyin according to a time sequence, and then the pinyin is converted into corresponding chinese through the language model. The embodiment of the invention focuses on a process for recognizing the voice in the acoustics as the phoneme sequence.

The phoneme recognition model recognizes each phoneme in the speech to be recognized through a plurality of neural network models and hidden Markov models; the Neural Network includes a Convolutional Neural Network (CNN), a Long Short-Term Memory Network (LSTM). In the practical application process of the embodiment of the present invention, the set phoneme recognition Model may include a layer of convolutional neural network CNN, a layer of 5 long-short term memory network LSTM, a layer of 1 deep neural network DNN, and a layer of 1 Hidden Markov Model (HMM). However, the specific number of layers of the long-short term memory network LSTM is not limited in the embodiments of the present invention, and may be up to 170 layers.

As shown in fig. 2, fig. 2 shows a schematic diagram of an architecture of a phoneme recognition model provided by an embodiment of the present invention, where the convolutional neural network CNN is layered to adjust the influence of different voices of a speaker through convolution, and 5 layers of long-short term memory networks LSTM are used to transfer speech input by the convolutional neural network CNN, and have a good fitness for dynamic data such as speech, so that it is possible to implement intervention and modification of input at a current time by messages at a future time, and output by a last layer of deep neural network DNN. The connection shown in fig. 2 is only an exemplary illustration, and in practical applications, the connection relationship is a transverse connection, which is not limited in the embodiment of the present invention.

102. And training model parameters in the phoneme recognition model according to the expected result until the change rate of the output result of the phoneme model is smaller than a preset threshold value.

In general, the expected results obtained in step 101 are all incorrect speech phoneme recognition, and model parameters in the phoneme recognition model need to be repeatedly trained according to the expected results, and the essence of training is to reversely adjust model parameters in the phoneme recognition model according to the architecture of fig. 2, where the model parameters are connection relationships between neurons in the model, and in a specific application, the model parameters may be neuron parameters between neurons.

In practical application, there is no unique final result for the result of morpheme recognition, but in the process of training the phoneme recognition model, the change rate of the output result of the recognition result tends to be in a range with small fluctuation. As shown in fig. 3, fig. 3 is a schematic diagram illustrating a change rate of an output result of a recognition result versus time according to an embodiment of the present invention, where the highest point of the change rate is an expected result, a model parameter w is trained from the expected result, an output result after each training gradually becomes smaller until the change rate fluctuates less up and down, and a training result at this time is output. In terms of technical implementation, when the change rate of the output result is smaller than the preset threshold, the training result is output, and the preset threshold according to the embodiment of the present invention may be a specific numerical value, such as 0.2, or an interval, such as 0.18 to 0.26, and the like, which is not limited in the embodiment of the present invention.

103. And determining the output result of which the change rate is smaller than the preset threshold value as the final phoneme recognition result corresponding to the speech to be recognized.

It should be noted that the phoneme recognition model described in the embodiment of the present invention includes a hidden markov model HMM, and the output result is a result of decoding performed by the Viterbi, and can be directly transmitted to the language model for subsequent recognition of the language model. As another implementation manner of the embodiment of the present invention, if the phoneme recognition model does not include the hidden markov model HMM, the phoneme recognition model is still a speech sequence after being output through the deep neural network DNN, and needs to be transmitted to the hidden markov model HMM to perform a decoding operation. Specifically, the embodiment of the present invention does not limit this.

The invention provides a method for recognizing a voice phoneme, which comprises the steps of firstly, inputting a voice to be recognized into a phoneme recognition model, and obtaining an expected result corresponding to the voice to be recognized according to an output result, wherein the phoneme recognition model recognizes each phoneme in the voice to be recognized through a plurality of neural network models and hidden Markov models; secondly, training model parameters in the phoneme recognition model according to the expected result until the change rate of the output result of the phoneme model is smaller than a preset threshold value; finally, determining an output result of which the change rate is smaller than the preset threshold value as a final phoneme recognition result corresponding to the speech to be recognized; compared with the prior art, the invention makes up the defects of the innate unitary hypothesis and the inherent binary hypothesis of the HMM by applying the phoneme recognition model, avoids the problem of falling into a local optimal solution, and can decode the speech at the frame level by the hidden Markov model in the phoneme recognition model, thereby reducing the operation scale.

Further, as a refinement and an extension to the embodiments of the present invention, an embodiment of the present invention provides another method for recognizing a speech phoneme, as shown in fig. 4, where the method includes:

201. and constructing the phoneme recognition model.

With reference to fig. 2, a layer of convolutional neural network CNN is used as an input end of the speech to be recognized; constructing a long and short term memory network (LSTM) with a preset number of layers, adding a Deep Neural Network (DNN), using the Deep Neural Network (DNN) as an output end of the speech to be recognized, adding a Hidden Markov Model (HMM), constructing the phoneme recognition model by using the Convolutional Neural Network (CNN), the long and short term memory network (LSTM), the Deep Neural Network (DNN) and the Hidden Markov Model (HMM), and assigning an initialization value to the phoneme recognition model.

202. Inputting a speech to be recognized into a phoneme recognition model, and obtaining an expected result corresponding to the speech to be recognized according to an output result, wherein the phoneme recognition model recognizes each phoneme in the speech to be recognized through a plurality of neural network models and hidden Markov models (same as step 101).

Inputting the voice to be recognized into the convolutional neural network CNN, performing noise reduction treatment on the voice to be recognized, and effectively solving the problem of different channels of the voice; illustratively, a forgetting gate in the LSTM would read and output a probability between 0 and 1 to each phoneme corresponding to a number in the cell state. 1 means "complete retention" and 0 means "complete discard". The cellular state may include the sex of the current subject, so that the correct pronouns can be selected, and when we see a new subject, we want to forget the old subject, which will be filtered.

In the embodiment of the invention, the process of fitting the long-term and short-term memory network LSTM is substantially to determine the probability of each phoneme of a section of speech sequence to be recognized at a certain moment. Illustratively, if the speech to be recognized is "hello", the corresponding pinyin is "ni h a o", as shown in fig. 5, fig. 5 contains 26 letters a, b, c, d, e … z and the corresponding relationship between the letters and the time t, and at the time t0, the long-short term memory network LSTM receives an input of a speech sequence to be recognized, and "n i h a o", and sequentially judges the probabilities of the letters a, b, c, d … z. In practical application, the LSTM in both forward and backward directions can be used to implement intervention and correction of the message at the future time to the input at the current time.

Inputting the fitted speech to be recognized into the deep neural network DNN, wherein the deep neural network DNN is a classification algorithm, the output layer of the deep neural network DNN corresponds to different neurons, each neuron corresponds to one category, and the deep neural network DNN classification model requires that the value output by the neurons of the output layer is between 0 and 1, and the sum of all output values is 1, namely, in one row or one column, the phoneme with the sum of probability values of 1 is output, and is irrelevant to the number of the phonemes which are actually output. In practical applications, when the deep neural network DNN performs output, the phoneme output needs to be completed by means of an activation function, which includes, but is not limited to, a sigmoid activation function, a softmax activation function, a ReLU activation function, a tanh activation function, and the like, and is not limited in particular.

Recording the phonemes in each output moment and the corresponding probabilities as visible observation sequences in a probability output matrix of the hidden markov model HMM, wherein the probability output matrix is as shown in fig. 6, and fig. 6 shows a schematic diagram of a probability output matrix provided by the embodiment of the present invention, and the recorded probabilities P may not be consistent with the probabilities P recorded in fig. 5, but the sum of the probabilities corresponding to the phonemes in one row or one column is 1.

Calculating according to the probability output matrix and a forward algorithm to obtain a first matrix; calculating according to the probability output matrix and a backward algorithm to obtain a second matrix; and calculating the maximum likelihood corresponding to each phoneme according to the first matrix and the second matrix, and recording the maximum likelihood into a third matrix. In the embodiments of the present invention, please refer to the detailed description of the prior art regarding the specific implementation manners of the forward algorithm and the backward algorithm and the method for calculating the maximum likelihood, and the calculation processes of the forward algorithm and the backward algorithm are not described in detail in the embodiments of the present invention.

And decoding the third matrix to obtain the maximum likelihood value of each phoneme so as to obtain the expected result. In performing decoding, embodiments of the present invention employ a Viterbi algorithm to decode phonemes.

It should be noted that step 202 is a detailed process for obtaining a desired result, and when step 203 is executed, the calculation is performed in the same manner, except that the parameters (model parameters) of each neuron are different for each calculation.

203. And training model parameters in the phoneme recognition model according to the expected result until the change rate of the output result of the phoneme model is smaller than a preset threshold (the same as the step 102).

In the training process, according to the expected result, starting from the output end of the phoneme model, performing gradient descent derivation operation on each phoneme layer by layer; and adjusting the neuron parameters of the phoneme model in each phoneme according to the derivation operation until the change rate of the output result of the phoneme model is smaller than the preset threshold value. As shown in fig. 7, fig. 7 is a schematic diagram illustrating a part of neurons in a phoneme recognition model according to an embodiment of the present invention, in the model, the neurons are connected by full connections, and fig. 7 is an illustration of an example of one of the neurons, where w1w2w3w4 is a neuron parameter corresponding to different probability magnitudes, and when performing a derivative operation with a gradient descent, the derivation operation is performed sequentially from an output end to an input end, and after a neuron on a side away from the output end derives, a group of values is obtained, and an adjacent next column also changes with a result obtained by derivation, and the adjustment is based on the result obtained by derivation for each neuron.

204. And determining the output result of which the change rate is smaller than the preset threshold value as a final phoneme recognition result corresponding to the speech to be recognized (same as step 103).

If the change rate of the output result of the phoneme model is greater than or equal to the preset threshold, step 203 is executed in a loop.

Further, as an implementation of the method shown in fig. 1, another embodiment of the present invention further provides a device for recognizing speech phonemes. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method.

An embodiment of the present invention provides a speech phoneme recognition apparatus, as shown in fig. 8, including:

an input unit 31, configured to input a speech to be recognized into a phoneme recognition model, where the phoneme recognition model recognizes each phoneme in the speech to be recognized through a plurality of neural network type models and hidden markov models;

an output unit 32, configured to obtain an expected result corresponding to the speech to be recognized according to an output result after the speech to be recognized is input into the phoneme recognition model by the input unit 31,

a training unit 33, configured to train model parameters in the phoneme recognition model according to the expected result output by the output unit 32 until a change rate of an output result of the phoneme model is smaller than a preset threshold;

a determining unit 34, configured to determine an output result of which the change rate is smaller than the preset threshold as a final phoneme recognition result corresponding to the speech to be recognized.

Further, as shown in fig. 9, the apparatus further includes:

a construction unit 35, configured to construct the phoneme recognition model before the input unit 31 inputs the speech to be recognized into the phoneme recognition model.

Further, as shown in fig. 9, the building unit 35 includes:

the first construction module 351 is used for constructing a convolutional neural network CNN and a long-short term memory network LSTM with a preset number of layers;

an adding module 352, configured to add a deep neural network DNN and a hidden markov model HMM;

a second constructing module 353, configured to construct the phoneme recognition model by using the convolutional neural network CNN, the long-short term memory network LSTM, a deep neural network DNN, and the hidden markov model HMM, where the convolutional neural network CNN is used as an input end of the speech to be recognized, and the deep neural network DNN is used as an output end of the speech to be recognized;

an assigning module 354, configured to assign an initialization value to the phoneme recognition model.

Further, as shown in fig. 9, the output unit 32 includes:

a noise reduction module 321, configured to input the speech to be recognized into the convolutional neural network CNN, and perform noise reduction processing on the speech to be recognized;

a first input module 322, configured to input the speech to be recognized after the noise reduction by the noise reduction module into the long-short term memory network LSTM with the preset number of layers, where the long-short term memory network LSTM filters invalid phonemes by activating a forgetting gate, and retains the valid phonemes by activating a memory gate;

a fitting module 323, configured to fit the speech to be recognized;

a second input module 324, configured to input the fitted speech to be recognized to the deep neural network DNN;

a recording module 325, configured to record the phoneme and the corresponding probability in each time output by the second output module as a visible observation sequence in a probability output matrix in the hidden markov model HMM;

a first calculating module 326, configured to calculate a first matrix according to the probability output matrix and a forward algorithm;

a second calculating module 327, configured to calculate to obtain a second matrix according to the probability output matrix and a backward algorithm;

a third calculating module 328, configured to calculate, according to the first matrix and the second matrix, a maximum likelihood corresponding to each phoneme, and record the maximum likelihood into a third matrix;

the processing module 329 is configured to decode the third matrix, and obtain the maximum likelihood value of each phoneme to obtain the expected result.

Further, as shown in fig. 9, the training unit 33 includes:

the calculating module 331 is configured to perform a derivation operation of gradient descent on each phoneme layer by layer from an output end of the phoneme model according to the expected result;

an adjusting module 332, configured to adjust a neuron parameter of the phoneme model in each phoneme according to the derivation operation until a change rate of an output result of the phoneme model is smaller than the preset threshold.

The device for recognizing the speech phonemes comprises a processor and a memory, wherein the input unit, the output unit, the training unit, the determining unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem that the efficiency of segmenting phonemes is low or a local optimal solution is solved by adjusting kernel parameters during speech recognition.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium having a program stored thereon, which when executed by a processor implements the recognition of the speech phonemes.

The embodiment of the invention provides a processor, which is used for running a program, wherein the program executes the recognition of the speech phoneme.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:

and constructing the phoneme recognition model.

Optionally, the constructing the phoneme recognition model includes:

adding a deep neural network DNN and a hidden Markov model HMM;

inputting the fitted voice to be recognized into the deep neural network DNN;

The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: inputting a speech to be recognized into a phoneme recognition model, and obtaining an expected result corresponding to the speech to be recognized according to an output result, wherein the phoneme recognition model recognizes each phoneme in the speech to be recognized through a plurality of neural network models and hidden Markov models;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for recognizing speech phonemes, comprising:

determining an output result of which the change rate is smaller than the preset threshold value as a final phoneme recognition result corresponding to the speech to be recognized;

before inputting the speech to be recognized into the phoneme recognition model, the method further comprises:

constructing the phoneme recognition model;

the constructing the phoneme recognition model comprises:

adding a deep neural network DNN and a hidden Markov model HMM;

constructing the phoneme recognition model by using the convolutional neural network CNN, the long-short term memory network LSTM, the deep neural network DNN and the hidden Markov model HMM, and assigning an initialization value to the phoneme recognition model, wherein the convolutional neural network CNN is used as an input end of the speech to be recognized, and the deep neural network DNN is used as an output end of the speech to be recognized;

obtaining an expected result corresponding to the speech to be recognized according to the output result comprises:

inputting the fitted voice to be recognized into the deep neural network DNN;

2. The method of claim 1, wherein training model parameters in the phoneme recognition model according to the expected result until a rate of change of the phoneme model output result is less than a preset threshold comprises:

and adjusting the neuron parameters of the phoneme recognition model in each phoneme according to the derivation operation until the change rate of the output result of the phoneme model is smaller than the preset threshold value.

3. An apparatus for recognizing speech phonemes, comprising:

the determining unit is used for determining an output result of which the change rate is smaller than the preset threshold value as a final phoneme recognition result corresponding to the speech to be recognized;

the device further comprises:

the construction unit is used for constructing the phoneme recognition model before the input unit inputs the speech to be recognized into the phoneme recognition model;

the construction unit includes:

the assignment module is used for assigning an initialization value to the phoneme recognition model;

the output unit includes:

the fitting module is used for fitting the voice to be recognized;

4. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the method for recognizing speech phonemes of any one of claims 1 to 2.

5. A processor, characterized in that the processor is configured to execute a program, wherein the program executes the method for recognizing speech phonemes of one of claims 1 to 2.