CN109754789A

CN109754789A - The recognition methods of phoneme of speech sound and device

Info

Publication number: CN109754789A
Application number: CN201711082646.9A
Authority: CN
Inventors: 姜珂
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2019-05-14
Anticipated expiration: 2037-11-07
Also published as: CN109754789B

Abstract

The invention discloses a kind of recognition methods of phoneme of speech sound and devices, are related to technical field of voice recognition, when main purpose is to solve speech recognition, cutting phoneme low efficiency, alternatively, the problem of locally optimal solution.Main technical schemes of the invention include: that voice to be identified is inputted phoneme recognition model, and the corresponding expected results of the voice to be identified are obtained according to output result, wherein, the phoneme recognition model identifies each phoneme in the voice to be identified by a variety of neural network pattern types and hidden Markov model；According to the model parameter in the expected results training phoneme recognition model, until the change rate of phoneme model output result is less than preset threshold；Determine that the change rate is less than the output result of the preset threshold as the corresponding final phoneme recognition result of the voice to be identified.During identifying sound.

Description

The recognition methods of phoneme of speech sound and device

Technical field

The present invention relates to technical field of voice recognition, recognition methods and device more particularly to a kind of phoneme of speech sound.

Background technique

In field of speech recognition, phoneme (phone) is as the smallest unit in voice, to improve the accurate of identification Degree, first has to the resolution for improving each phoneme.

Currently, there are mainly two types of the main stream approach being trained for phoneme model: one is the hidden Ma Erke of Gaussian Mixture Husband's model (Gaussian mix-ture-hidden Markov model, GMM-HMM), neural network-Hidden Markov Model DNN-HMM.Wherein, GMM-HMM is mainly fitted using variable condition of the HMM to the corresponding frame of phoneme, then using GMM or Person DNN restrains frame, and identification when is decoded using viterbi, cut based on time frame to audio Point.

During inventor states invention in realization, discovery is in the prior art especially in the phoneme mould of identification phonetic transcriptions of Chinese characters In type, in order to improve the correctness of cutting, when executing phoneme according to time frame cutting, a millisecond grade, cutting can be accurate to The efficiency of phoneme is lower；In addition, due to the inborn unitarian hypothesis of HMM, dualism hypothesis, can there is identification during using HMM Phoneme fall into locally optimal solution, reduce the accuracy of phoneme recognition, according to ternary assume or quaternary assume etc., identification The calculation amount of phoneme is huge.

Summary of the invention

In view of this, recognition methods and the device of a kind of phoneme of speech sound provided by the invention, main purpose is to solve language When sound identifies, cutting phoneme low efficiency, alternatively, the problem of locally optimal solution.

To solve the above-mentioned problems, present invention generally provides following technical solutions:

On the one hand, the present invention provides a kind of recognition methods of phoneme of speech sound, this method comprises:

Voice to be identified is inputted into phoneme recognition model, and to obtain the voice to be identified corresponding pre- according to output result Phase result, wherein the phoneme recognition model by a variety of neural network pattern types and hidden Markov model identification it is described to Identify each phoneme in voice；

According to the model parameter in the expected results training phoneme recognition model, until the phoneme model exports As a result change rate is less than preset threshold；

Determine that the change rate is corresponding final as the voice to be identified less than the output result of the preset threshold Phoneme recognition result.

Optionally, before voice to be identified is inputted phoneme recognition model, the method also includes:

Construct the phoneme recognition model.

Optionally, the building phoneme recognition model, comprising:

Construct the shot and long term memory network LSTM of convolutional neural networks CNN and the preset quantity number of plies；

Add deep neural network DNN and hidden Markov model HMM；

Utilize the convolutional neural networks CNN, the shot and long term memory network LSTM, deep neural network DNN and institute It states hidden Markov model HMM and constructs the phoneme recognition model, and assign initialization value for the phoneme recognition model, wherein Input terminal of the convolutional neural networks CNN as the voice to be identified, the deep neural network DNN is as described wait know The output end of other voice.

Optionally, obtaining the corresponding expected results of the voice to be identified according to output result includes:

The voice to be identified is inputted into the convolutional neural networks CNN, noise reduction process is carried out to the voice to be identified；

The shot and long term memory network LSTM that the voice to be identified after noise reduction is inputted to the preset quantity number of plies, to institute It states voice to be identified to be fitted, wherein shot and long term memory network LSTM forgets the invalid phoneme filtering of goalkeeper by activation, passes through Activation Memory-Gate retains effective phoneme；

Voice to be identified after over-fitting is input to the deep neural network DNN；

Using phoneme in each moment of the output and corresponding probability as visible observation sequence, it is recorded in described hidden In Markov model HMM in probability output matrix；

According to the probability output matrix and forwards algorithms, carry out that the first matrix is calculated；And according to the probability Output matrix and backward algorithm carry out that the second matrix is calculated；

According to first matrix and second matrix, the corresponding maximum likelihood of each phoneme is calculated, and is recorded In three matrixes；

The third matrix is decoded, the maximum likelihood value of each phoneme is obtained, to obtain the expected results.

Optionally, according to the model parameter in the expected results training phoneme recognition model, until the phoneme Model output result change rate include: less than preset threshold

According to the expected results, since the output end of the phoneme model, successively each phoneme is executed under gradient The derivative operation of drop；

According to the derivative operation, the neuron parameter in each phoneme in the phoneme model is adjusted, until the sound The change rate that prime model exports result is less than the preset threshold.

Second aspect, the present invention provide a kind of identification device of phoneme of speech sound, comprising:

Input unit, for voice to be identified to be inputted phoneme recognition model, wherein the phoneme recognition model passes through more Kind neural network pattern type and hidden Markov model identify each phoneme in the voice to be identified；

Output unit, for being tied according to output after voice to be identified is inputted phoneme recognition model by the input unit Fruit obtains the corresponding expected results of the voice to be identified,

Training unit, the expected results for being exported according to the output unit are trained in the phoneme recognition model Model parameter, until the phoneme model output result change rate be less than preset threshold；

Determination unit, for determining that the change rate is less than the output result of the preset threshold as the language to be identified The corresponding final phoneme recognition result of sound.

Optionally, described device further include:

Construction unit, for constructing before the input unit is by the voice input phoneme recognition model to be identified The phoneme recognition model.

Optionally, the construction unit includes:

First building module, for constructing the shot and long term memory network of convolutional neural networks CNN and the preset quantity number of plies LSTM；

Adding module, for adding deep neural network DNN and hidden Markov model HMM；

Second building module, for utilizing the convolutional neural networks CNN, the shot and long term memory network LSTM, depth The neural network DNN and hidden Markov model HMM constructs the phoneme recognition model, wherein the convolutional Neural net Input terminal of the network CNN as the voice to be identified, output of the deep neural network DNN as the voice to be identified End；

Assignment module, for assigning initialization value for the phoneme recognition model.

Optionally, the output unit includes:

Noise reduction module, for the voice to be identified to be inputted the convolutional neural networks CNN, to the voice to be identified Carry out noise reduction process；

First input module, for the voice to be identified after the noise reduction module noise reduction to be inputted the preset quantity The shot and long term memory network LSTM of the number of plies, wherein shot and long term memory network LSTM forgets the invalid phoneme filtering of goalkeeper by activation, Effective phoneme is retained by activation Memory-Gate；

Fitting module, for being fitted to the voice to be identified；

Voice to be identified after over-fitting is input to the deep neural network DNN by the second input module；

Logging modle, in each moment for exporting second output module phoneme and corresponding probability as It can be seen that observation sequence, is recorded in the hidden Markov model HMM in probability output matrix；

First computing module, for carrying out that the first square is calculated according to the probability output matrix and forwards algorithms Battle array；

Second computing module, for carrying out that the second square is calculated according to the probability output matrix and backward algorithm Battle array；

Third computing module, for it is corresponding most to calculate each phoneme according to first matrix and second matrix Maximum-likelihood, and be recorded in third matrix；

Processing module obtains the maximum likelihood value of each phoneme, for being decoded to the third matrix to obtain State expected results.

Optionally, training unit includes:

Computing module is used for according to the expected results, since the output end of the phoneme model, successively to each sound Element executes the derivative operation of gradient decline；

Module is adjusted, for adjusting the neural radix scrophulariae in each phoneme in the phoneme model according to the derivative operation Number, until the change rate of phoneme model output result is less than the preset threshold.

To achieve the goals above, according to another aspect of the present invention, a kind of storage medium, the storage medium are provided Program including storage, wherein equipment where controlling the storage medium in described program operation executes language as described above The recognition methods of sound phoneme.

To achieve the goals above, according to another aspect of the present invention, a kind of processor is provided, the processor is used for Run program, wherein described program executes the recognition methods of phoneme of speech sound as described above when running.

By above-mentioned technical proposal, technical solution provided by the invention is at least had the advantage that

The recognition methods of phoneme of speech sound provided by the invention and device, firstly, voice to be identified is inputted phoneme recognition mould Type, and the corresponding expected results of the voice to be identified are obtained according to output result, wherein the phoneme recognition model passes through more Kind neural network pattern type and hidden Markov model identify each phoneme in the voice to be identified；Secondly, according to described Model parameter in the expected results training phoneme recognition model, until the change rate of phoneme model output result is less than Preset threshold；Finally, determining that the change rate is corresponding as the voice to be identified less than the output result of the preset threshold Final phoneme recognition result；Compared with prior art, the present invention passes through the utilization of phoneme recognition model, and it is congenital to compensate for HMM Unitarian hypothesis, dualism hypothesis defect, avoid the problem that falling into locally optimal solution, in addition, can lead in phoneme recognition model It crosses hidden Markov model and the other decoding of frame level is carried out to voice, reduce operation scale.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 shows a kind of flow chart of the recognition methods of phoneme of speech sound provided in an embodiment of the present invention；

Fig. 2 shows a kind of configuration diagrams of phoneme recognition model provided in an embodiment of the present invention；

Fig. 3 shows a kind of change rate of the output result of recognition result provided in an embodiment of the present invention to time change Schematic diagram；

Fig. 4 shows the flow chart of the recognition methods of another phoneme of speech sound provided in an embodiment of the present invention；

Fig. 5 shows a kind of schematic diagram of shot and long term memory network LSTM fitting result provided in an embodiment of the present invention；

Fig. 6 shows the schematic diagram provided in an embodiment of the present invention with a kind of probability output matrix；

Fig. 7 shows the schematic diagram of partial nerve member in a kind of phoneme recognition model provided in an embodiment of the present invention；

Fig. 8 shows a kind of composition block diagram of the identification device of phoneme of speech sound provided in an embodiment of the present invention；

Fig. 9 shows the composition block diagram of the identification device of another phoneme of speech sound provided in an embodiment of the present invention.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

The embodiment of the present invention provides a kind of recognition methods of phoneme of speech sound, as shown in Figure 1, which comprises

101, voice to be identified is inputted into phoneme recognition model, and the voice to be identified is obtained according to output result and is corresponded to Expected results.

Voice to be identified described in the embodiment of the present invention can be that (such as voice to be identified is in one section to any one section of voice Text), which is input to phoneme recognition model, which can be used for according to time sequencing to voice Phoneme carries out cutting, identification, conversion, voice to be identified is converted to the Chinese phonetic alphabet, then turned by language model by the Chinese phonetic alphabet It is changed to corresponding Chinese.The speech recognition in acoustics is the process of aligned phoneme sequence by emphasis in the embodiment of the present invention.

Wherein, the phoneme recognition model by a variety of neural network pattern types and hidden Markov model identification it is described to Identify each phoneme in voice；Neural network include convolutional neural networks (Convolutional Neural Network, CNN), shot and long term memory network (Long Short-Term Memory, LSTM).In actual application of the embodiment of the present invention In, it may include one layer of convolutional neural networks CNN in the phoneme recognition model of setting, 5 layers of shot and long term memory network LSTM, 1 layer Deep neural network DNN, 1 layer of hidden Markov model (Hidden Markov Model, HMM).But the embodiment of the present invention In can't go limit shot and long term memory network LSTM the specific number of plies, reach as high as 170 layers.

As shown in Fig. 2, Fig. 2 shows a kind of configuration diagram of phoneme recognition model provided in an embodiment of the present invention, In, the meaning of this layer of convolutional neural networks CNN is to adjust the influence of speaker's difference audio bring by convolution, 5 layers long Short-term memory network LSTM transmits the voice of convolutional neural networks CNN input, this for voice based on dynamic The degree of fitting of data is very good, it can be achieved that the message of future time instance is intervened and corrected to the input at current time, last Layer depth neural network DNN is exported.Wherein, the exemplary only explanation of connection shown in Fig. 2, in practical applications, Connection relationship is lateral connection, and it is not limited in the embodiment of the present invention.

102, according to the model parameter in the expected results training phoneme recognition model, until the phoneme model The change rate for exporting result is less than preset threshold.

Under normal conditions, expected results obtained in step 101 are all incorrect phoneme of speech sound identification, are needed according to pre- Phase result carries out repetition training to the model parameter in the phoneme recognition model, and the framework of training being substantially according to Fig. 2 is inverse Each model parameter into adjustment phoneme recognition model, the model parameter are the connection relationship in model between neuron, In a particular application, the model parameter can neuron parameter between each neuron.

In practical applications, there is no unique final result for the result of morpheme identification, but to phoneme recognition mould During type is trained, the change rate of the output result of recognition result tends to fluctuate in lesser range up and down.Such as Fig. 3 institute Show, Fig. 3 shows a kind of signal of the change rate of the output result of recognition result provided in an embodiment of the present invention to time change Figure, the highest point of change rate is that expected results are trained model parameter w since at expected results, every time after training Output result can gradually become smaller, and when change rate fluctuates smaller up and down, export training result at this time.It is reacted to technology realization On, when the change rate for exporting result is less than preset threshold, training result is exported, preset threshold described in the embodiment of the present invention can Think a specific numerical value, such as 0.2, or a section, such as 0.18-0.26, specifically, the embodiment of the present invention It does not limit this.

103, determine that the change rate is corresponding as the voice to be identified less than the output result of the preset threshold Final phoneme recognition result.

It should be noted that including hidden Markov model in phoneme recognition model described in the embodiment of the present invention HMM, output the result is that executed by Viterbi decoded as a result, language model can be directly transferred to, carry out subsequent language Say the identification of model.As the another embodiment of the embodiment of the present invention, if not including in phoneme recognition model has hidden Ma Er Can husband model HMM remain as a voice sequence, it is also necessary to be passed then after deep neural network DNN is exported It transports in hidden Markov model HMM and executes decoding operation.Specifically, it is not limited in the embodiment of the present invention.

The recognition methods of phoneme of speech sound provided by the invention, firstly, voice to be identified is inputted phoneme recognition model, and root The corresponding expected results of the voice to be identified are obtained according to output result, wherein the phoneme recognition model passes through a variety of nerves Network-type model and hidden Markov model identify each phoneme in the voice to be identified；Secondly, according to the expected knot Model parameter in the fruit training phoneme recognition model, until the change rate of phoneme model output result is less than default threshold Value；Finally, determining that the change rate is corresponding final as the voice to be identified less than the output result of the preset threshold Phoneme recognition result；Compared with prior art, the present invention passes through the utilization of phoneme recognition model, compensates for the inborn unitary of HMM Assuming that, dualism hypothesis defect, avoid the problem that falling into locally optimal solution, in addition, hidden horse can be passed through in phoneme recognition model Er Kefu model carries out the other decoding of frame level to voice, reduces operation scale.

Further, as the refinement and extension to the embodiment of the present invention, the embodiment of the present invention provides another voice The recognition methods of phoneme, as shown in Figure 4, which comprises

201, the phoneme recognition model is constructed.

The phoneme recognition model of construction, please continue to refer to Fig. 2, one layer of convolutional neural networks CNN is as described to be identified The input terminal of voice；The shot and long term memory network LSTM of the preset quantity number of plies is constructed, deep neural network DNN, the depth are added Output end of the neural network DNN as the voice to be identified is spent, adds hidden Markov model HMM, and utilize the convolution Neural network CNN, the shot and long term memory network LSTM, deep neural network DNN and the hidden Markov model HMM structure The phoneme recognition model is built, and assigns initialization value for the phoneme recognition model.

202, voice to be identified is inputted into phoneme recognition model, and the voice to be identified is obtained according to output result and is corresponded to Expected results, wherein the phoneme recognition model by a variety of neural network models and hidden Markov model identification described in Each phoneme in voice to be identified is (the same as step 101).

The voice to be identified is inputted into the convolutional neural networks CNN, noise reduction process is carried out to the voice to be identified, The voice to be identified after noise reduction is inputted five layers of shot and long term and remembered by the problem of can effectively solve the problem that voice different channels Network LSTM is fitted the voice to be identified, wherein shot and long term memory network LSTM is invalid by activation forgetting goalkeeper Phoneme filtering is retained effective phoneme by activation Memory-Gate；Illustratively, the forgetting door in LSTM can read output one Probability between 0 to 1 corresponds to the number in cell state to each phoneme.1 indicates " being fully retained ", and 0 indicates " giving up completely ". Cell state may include the gender of current subject, therefore correctly pronoun can be selected, when we have seen that new master Language, it is intended that forget that old subject, old subject will be filtered.

In the embodiment of the present invention, the process of shot and long term memory network LSTM fitting substantially determines one section of voice to be identified Sequence belong in each phoneme of a certain moment probability.Illustratively, such as voice to be identified is " hello ", and corresponding phonetic is " n i h a o ", as shown in figure 5, include 26 letter a in Fig. 5, the corresponding pass between b, c, d, e ... z and moment t System, when the t0 moment, the input of shot and long term memory network LSTM one voice sequence to be identified of reception, " n i h a o ", and according to Secondary judgement belongs to alphabetical a, the probability of b, c, d ... z.In actual application, two sides forward and backward can be used To LSTM realize that the message of future time instance is intervened and corrected to the input at current time.

Voice to be identified after over-fitting is input to the deep neural network DNN, deep neural network DNN is One sorting algorithm, the output layer of deep neural network DNN are corresponding with different neurons, the corresponding class of each neuron Not, deep neural network DNN disaggregated model require be output layer neuron output value between 0 to 1, while all output valves The sum of be 1, i.e., in a line or a column, the phoneme that the sum of output probability value is 1, and with the phoneme number that substantially exports without It closes.In practical applications, when deep neural network DNN is executed and exported, need to complete the output of phoneme by activation primitive, The activation primitive include but is not limited to sigmoid activation primitive, softmax activation primitive, ReLU activation primitive, tanh swash Function living etc., specifically without limitation.

Using phoneme in each moment of the output and corresponding probability as visible observation sequence, it is recorded in described hidden In the probability output matrix of Markov model HMM, the probability output matrix is as shown in fig. 6, Fig. 6 shows implementation of the present invention A kind of schematic diagram with probability output matrix that example provides, the probability P of record, may not with both the probability Ps recorded in Fig. 5 Unanimously, but the sum of the corresponding probability of phoneme in a line or a column is 1.

According to the probability output matrix and forwards algorithms, carry out that the first matrix is calculated；It is defeated according to the probability Matrix and backward algorithm out carry out that the second matrix is calculated；According to first matrix and second matrix, calculate each The corresponding maximum likelihood of a phoneme, and be recorded in third matrix.In the embodiment of the present invention, related forwards algorithms and backward calculation The specific implementation of method, the method for calculating maximum likelihood, the detailed description that refer to the prior art, the embodiment of the present invention is to preceding It is no longer repeated to the calculating process of algorithm and backward algorithm.

The third matrix is decoded, the maximum likelihood value of each phoneme is obtained, to obtain the expected results.? When executing decoding, the embodiment of the present invention is decoded phoneme using Viterbi algorithm.

It should be noted that step 202 is to obtain the detailed process of expected results, when step 203 when being executed, same meeting Identical calculation is taken to be calculated, the difference lies in that the parameter (model parameter) of each neuron is no when calculating every time Together.

203, according to the model parameter in the expected results training phoneme recognition model, until the phoneme model The change rate for exporting result is less than preset threshold (with step 102).

In the training process, according to the expected results, since the output end of the phoneme model, successively to each sound Element executes the derivative operation of gradient decline；According to the derivative operation, adjust in each phoneme in the nerve of the phoneme model First parameter, until the change rate of phoneme model output result is less than the preset threshold.As shown in fig. 7, Fig. 7 shows this Inventive embodiments provide a kind of phoneme recognition model in partial nerve member schematic diagram, in the model, each neuron it Between by full connection be attached, be the explanation carried out by taking one of neuron as an example in Fig. 7, wherein w1w2w3w4 be mind Through first parameter, different probability sizes is corresponded to, when executing the derivative operation of gradient decline, from output end to input terminal successively It executes, after neuron derivation apart from output end side, obtains one group of numerical value, adjacent next column can also be obtained with derivation Results change and change, adjustment foundation be each neuron derivation after result.

204, determine that the change rate is corresponding as the voice to be identified less than the output result of the preset threshold Final phoneme recognition result is (the same as step 103).

If the change rate of the phoneme model output result is more than or equal to preset threshold, recycles and execute step 203.

Further, as the realization to method shown in above-mentioned Fig. 1, another embodiment of the present invention additionally provides a kind of voice The identification device of phoneme.The Installation practice is corresponding with preceding method embodiment, and to be easy to read, present apparatus embodiment is no longer right Detail content in preceding method embodiment is repeated one by one, it should be understood that the device in the present embodiment can correspond to reality Full content in existing preceding method embodiment.

The embodiment of the present invention provides a kind of identification device of phoneme of speech sound, as shown in Figure 8, comprising:

Input unit 31, for voice to be identified to be inputted phoneme recognition model, wherein the phoneme recognition model passes through A variety of neural network pattern types and hidden Markov model identify each phoneme in the voice to be identified；

Output unit 32 is used for after voice to be identified is inputted phoneme recognition model by the input unit 31, according to defeated Result obtains the corresponding expected results of the voice to be identified out,

Training unit 33, the expected results training phoneme recognition mould for being exported according to the output unit 32 Model parameter in type, until the change rate of phoneme model output result is less than preset threshold；

Determination unit 34, for determining that the change rate is less than the output result of the preset threshold as described to be identified The corresponding final phoneme recognition result of voice.

Further, as shown in figure 9, described device further include:

Construction unit 35 is used for before the input unit 31 is by the voice input phoneme recognition model to be identified, Construct the phoneme recognition model.

Further, as shown in figure 9, the construction unit 35 includes:

First building module 351, the shot and long term for constructing convolutional neural networks CNN and the preset quantity number of plies remember net Network LSTM；

Adding module 352, for adding deep neural network DNN and hidden Markov model HMM；

Second building module 353, for utilizing the convolutional neural networks CNN, the shot and long term memory network LSTM, depth It spends the neural network DNN and hidden Markov model HMM and constructs the phoneme recognition model, wherein the convolutional Neural Input terminal of the network C NN as the voice to be identified, output of the deep neural network DNN as the voice to be identified End；

Assignment module 354, for assigning initialization value for the phoneme recognition model.

Further, as shown in figure 9, output unit 32 includes:

Noise reduction module 321, for the voice to be identified to be inputted the convolutional neural networks CNN, to described to be identified Voice carries out noise reduction process；

First input module 322, for the voice input to be identified after the noise reduction module noise reduction is described default The shot and long term memory network LSTM of the quantity number of plies, wherein shot and long term memory network LSTM forgets the invalid phoneme of goalkeeper by activation Filtering is retained effective phoneme by activation Memory-Gate；

Fitting module 323, for being fitted to the voice to be identified；

Second input module 324, for the voice to be identified after over-fitting to be input to the deep neural network DNN；

Logging modle 325, phoneme and corresponding probability in each moment for exporting second output module As visible observation sequence, it is recorded in the hidden Markov model HMM in probability output matrix；

First computing module 326, for carrying out being calculated first according to the probability output matrix and forwards algorithms Matrix；

Second computing module 327, for carrying out being calculated second according to the probability output matrix and backward algorithm Matrix；

Third calculates 328 pieces of mould, for it is corresponding to calculate each phoneme according to first matrix and second matrix Maximum likelihood, and be recorded in third matrix；

Processing module 329 obtains the maximum likelihood value of each phoneme, for being decoded to the third matrix to obtain To the expected results.

Further, as shown in figure 9, training unit 33 includes:

Computing module 331 is used for according to the expected results, since the output end of the phoneme model, successively to every A phoneme executes the derivative operation of gradient decline；

Module 332 is adjusted, for adjusting in each phoneme in the neuron of the phoneme model according to the derivative operation Parameter, until the change rate of phoneme model output result is less than the preset threshold.

The identification device of the phoneme of speech sound includes processor and memory, and above-mentioned input unit, output unit, training are single Member and determination unit etc. store in memory as program unit, are executed by processor stored in memory above-mentioned Program unit realizes corresponding function.

Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, by adjusting kernel parameter come when solving speech recognition, cutting phoneme low efficiency, alternatively, the problem of locally optimal solution.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.

The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The identification of the existing phoneme of speech sound.

The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation The identification of phoneme of speech sound described in Shi Zhihang.

The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor perform the steps of when executing program

Construct the phoneme recognition model.

Optionally, the building phoneme recognition model, comprising:

Add deep neural network DNN and hidden Markov model HMM；

Equipment herein can be server, PC, PAD, mobile phone etc..

Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just The program code of beginningization there are as below methods step: voice to be identified is inputted into phoneme recognition model, and is obtained according to output result The corresponding expected results of the voice to be identified, wherein the phoneme recognition model passes through a variety of neural network pattern types and hidden Markov model identifies each phoneme in the voice to be identified；

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.

It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.

The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims

1. a kind of recognition methods of phoneme of speech sound characterized by comprising

Voice to be identified is inputted into phoneme recognition model, and the corresponding expected knot of the voice to be identified is obtained according to output result Fruit, wherein the phoneme recognition model is identified described to be identified by a variety of neural network pattern types and hidden Markov model Each phoneme in voice；

According to the model parameter in the expected results training phoneme recognition model, until the phoneme model exports result Change rate be less than preset threshold；

Determine that the change rate is less than the output result of the preset threshold as the corresponding final phoneme of the voice to be identified Recognition result.

2. the method according to claim 1, wherein by voice to be identified input phoneme recognition model before, The method also includes:

Construct the phoneme recognition model.

3. according to the method described in claim 2, it is characterized in that, the building phoneme recognition model, comprising:

Add deep neural network DNN and hidden Markov model HMM；

Utilize the convolutional neural networks CNN, the shot and long term memory network LSTM, deep neural network DNN and described hidden Markov model HMM constructs the phoneme recognition model, and assigns initialization value for the phoneme recognition model, wherein described Input terminal of the convolutional neural networks CNN as the voice to be identified, the deep neural network DNN is as the language to be identified The output end of sound.

4. according to the method described in claim 3, it is characterized in that, to obtain the voice to be identified corresponding according to output result Expected results include:

The shot and long term memory network LSTM that the voice to be identified after noise reduction is inputted to the preset quantity number of plies, to it is described to Identification voice is fitted, wherein shot and long term memory network LSTM forgets the invalid phoneme filtering of goalkeeper by activation, passes through activation Memory-Gate retains effective phoneme；

Using phoneme in each moment of the output and corresponding probability as visible observation sequence, it is recorded in the hidden Ma Er It can be in husband's model HMM in probability output matrix；

According to first matrix and second matrix, the corresponding maximum likelihood of each phoneme is calculated, and third square is recorded In battle array；

5. according to the method described in claim 4, it is characterized in that, according to the expected results training phoneme recognition model In model parameter, until the phoneme model output result change rate include: less than preset threshold

According to the expected results, since the output end of the phoneme model, gradient decline successively is executed to each phoneme Derivative operation；

According to the derivative operation, adjust in the neuron parameter of the phoneme recognition model in each phoneme, until the sound The change rate that prime model exports result is less than the preset threshold.

6. a kind of identification device of phoneme of speech sound characterized by comprising

Input unit, for voice to be identified to be inputted phoneme recognition model, wherein the phoneme recognition model passes through a variety of minds Each phoneme in the voice to be identified is identified through network-type model and hidden Markov model；

Output unit, for being obtained according to output result after voice to be identified is inputted phoneme recognition model by the input unit To the corresponding expected results of the voice to be identified,

Training unit, the mould in the expected results training phoneme recognition model for being exported according to the output unit Shape parameter, until the change rate of phoneme model output result is less than preset threshold；

Determination unit, for determining that the change rate is less than the output result of the preset threshold as the voice pair to be identified The final phoneme recognition result answered.

7. device according to claim 6, which is characterized in that described device further include:

Construction unit is used for before the input unit is by the voice input phoneme recognition model to be identified, described in building Phoneme recognition model.

8. device according to claim 7, which is characterized in that the construction unit includes:

First building module, for constructing the shot and long term memory network LSTM of convolutional neural networks CNN and the preset quantity number of plies；

Second building module, for utilizing the convolutional neural networks CNN, the shot and long term memory network LSTM, depth nerve The network DNN and hidden Markov model HMM constructs the phoneme recognition model, wherein the convolutional neural networks CNN As the input terminal of the voice to be identified, output end of the deep neural network DNN as the voice to be identified；

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require 1 to the phoneme of speech sound described in any one of claim 5 Recognition methods.

10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require 1 to the phoneme of speech sound described in any one of claim 5 recognition methods.