CN107871497A

CN107871497A - Audio recognition method and device

Info

Publication number: CN107871497A
Application number: CN201610847843.4A
Authority: CN
Inventors: 刘孟竹; 唐青松; 张祥德
Original assignee: Beijing Eyecool Technology Co Ltd
Current assignee: Beijing Eyecool Technology Co Ltd
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2018-04-03

Abstract

The invention discloses a kind of audio recognition method and device.This method includes：It is determined that training voice signal and semantic label corresponding with training voice signal；Training voice signal input first nerves network model is obtained into speech feature vector；Semantic label input nervus opticus network model is obtained into semantic feature vector；According to the parameter value of target component in speech feature vector and semantic feature vector training first nerves network model；Targeted voice signal is identified according to the first nerves network model after training, wherein, the value of target component is the parameter value after training in first nerves network model after training.By the present invention, solve the problems, such as that the convergence rate for training speech recognition modeling in correlation technique is slower.

Description

Audio recognition method and device

Technical field

The present invention relates to field of speech recognition, in particular to a kind of audio recognition method and device.

Background technology

Speech recognition technology is exactly to allow machine that voice signal is changed into corresponding text by identification and understanding process Technology.Traditional speech recognition technology is strong to the feature dependence of artificial selection, and accuracy rate is low.By deep learning (Deep Learning) technology is applied in field of speech recognition, can imitate pattern of the brain to voice signal study, identification, Neng Gou great Amplitude improves the accuracy of speech recognition.

Deep Learning are used for speech recognition, oneself is through obtaining significant progress at present.Several Deep introduced below Networks：

Deep neural network (Deep Neural Networks, abbreviation DNNs)：The feature that the network extraction goes out has stronger Distinction, therefore the model trained has stronger separating capacity, this network generally use depth belief network (Deep Belief Network, abbreviation DBN) it is used as pre-training process, acoustic model is trained using DNN-HMM hybrid networks, in major term There is wide application in remittance amount speech recognition system.

Convolutional neural networks (Convolutional Neural Networks, abbreviation CNNs)：Compared to DNNs, introduce The concept of convolution and pond.Extraction to phonetic feature local message is realized by convolution, then model pair is strengthened by pondization The robustness of feature.While scale of model is substantially reduced, recognition performance is more preferable, and generalization ability is stronger.

Recurrent neural network (Recurrent Neural Networks, abbreviation RNN)：At present in field of speech recognition most Conventional depth network model is RNN, and it is a kind of series model, and it considers adjacent speech frame on the basis of neutral net Implicit layer unit between annexation, pass through temporally reverse propagated error adjust network parameter training network.RNN point Cloth hidden state can effectively store before information, and as nonlinear dynamic system can make its hide layer unit with one The complicated mode of kind updates, and combines both characteristics, enables it to identify potential time-dependent relation by recurrence layer, enters The task of row speech recognition.

It is coupled chronological classification (Connectionist Temporal Classification, abbreviation CTC)：It is a kind of right Neat model, depth network can be exported and be alignd with label text, calculated the probability of all possible paths and be used as whole sentence Probability, enable to us to carry out advance segmentation or post processing to sample using CTC, effect greatly improved Rate.

But audio recognition method of the prior art still has the problem of certain：

(1) it is the shortcomings that DNNs：DNNs methods assume that each speech frame is independent, do not account for the correlation between frame and frame, And in general hidden layer needs more neuron, phase gradient diffusing phenomenon can be very serious after training, and can only be with it The error of his the models coupling ability sequence of calculation.

(2) it is the shortcomings that CNNs：Single CNNs can only handle the speech recognition of isolated word, connect so being handled using CNNs Continuous voice needs to split voice in advance, very time-consuming and uninteresting.CNNs can also handle continuous with other models couplings Voice, but the quantity of parameter is undoubtedly added, and it is also very time-consuming to manually adjust model parameter.

(3) it is the shortcomings that RNN：Due to needing to remember bulk information, training difficulty is larger, and calculating cost is big, identifies process Slowly, and its recursive structure easily occurs the problem of gradient blast and gradient disappearance in error-duration model, and it is difficult to enter to cause training Row goes down.

(4) it is the shortcomings that CTC：The influence of acoustic information is only considered during training, RNN is destroyed and trains the implicit language come Model is sayed, the harm brought is that word-based error rate and the error rate based on phoneme can not be reduced synchronously, and it is difficult to increase training Degree.

As the above analysis, it is to the independent training of acoustic model in correlation technique, and that acoustic model learns is hidden Real semantic vector can be destroyed containing semantic vector, so as to cause loss function in training process synchronously to be dropped with error rate It is low so that the convergence rate of training is slower.

For the training speech recognition modeling in correlation technique convergence rate it is slower the problem of, not yet propose at present effective Solution.

The content of the invention

It is a primary object of the present invention to provide a kind of audio recognition method and device, to solve the training in correlation technique The problem of convergence rate of speech recognition modeling is slower.

To achieve these goals, according to an aspect of the invention, there is provided a kind of audio recognition method.This method bag Include：It is determined that training voice signal and semantic label corresponding with training voice signal；Will training voice signal input first nerves Network model obtains speech feature vector；Semantic label input nervus opticus network model is obtained into semantic feature vector；According to The parameter value of target component in speech feature vector and semantic feature vector training first nerves network model；After training First nerves network model identifies targeted voice signal, wherein, target component in first nerves network model after training Parameter value after being worth for training.

Further, according to target component in speech feature vector and semantic feature vector training first nerves network model Parameter value include：By network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained；Pass through Preset algorithm calculates the semanteme represented by training result and the semantic error represented by semantic label；According to error transfer factor first The parameter value of target component in neural network model.

Further, by network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained Including：It is determined that output speech feature vector and the joint probability distribution of semantic feature vector, training knot is calculated by preset algorithm Semanteme represented by fruit includes with the semantic error represented by semantic label：According to forward-backward algorithms and joint Probability distribution determines the loss function of conjunctive model, wherein, conjunctive model includes first nerves network model and nervus opticus net Network model；Semanteme according to represented by loss function determines training result and the semantic error represented by semantic label.

Further, align network model be CTC align network model, by align network model align phonetic feature to Amount and semantic feature vector include：Pass through CTC alignment network model alignment speech feature vectors and semantic feature vector.

Further, training voice signal is multiple training voice signal Pn, and semantic label is and multiple training voice letters Number one-to-one multiple semantic label Qn, speech feature vector is obtained by training voice signal input first nerves network model Including：I-th of training voice signal Pi input first nerves network model is obtained into speech feature vector Ri, wherein, current the The parameter value of the target component of one neutral net mould is M (i-1)；Semantic label input nervus opticus network model is obtained into semanteme Characteristic vector includes：I-th of semantic label Qi is inputted into nervus opticus network model and obtains semantic feature vector T i, wherein, when The parameter value of the target component of preceding nervus opticus network mould is S (i-1)；According to speech feature vector and semantic feature vector training The parameter value of target component includes in first nerves network model：Determined according to speech feature vector Ri and semantic feature vector T i The parameter value of the target component of the parameter value Mi of target component and nervus opticus network mould is Si in first nerves network model, according to Secondary execution above step is until i=n.

Further, first nerves network model is RNN models, by training voice signal input first nerves network model Obtaining speech feature vector includes：Sub-frame processing is carried out to training voice signal, obtains training voice sequence；Voice sequence will be trained Row input RNN models obtain speech feature vector.

Further, nervus opticus network model is RNN models, and semantic label input nervus opticus network model is obtained Semantic feature vector includes：Semantic label sequence is determined according to semantic label；Semantic label sequence inputting RNN models are obtained into language Adopted characteristic vector.

To achieve these goals, according to an aspect of the invention, there is provided a kind of speech recognition equipment.The device bag Include：Determining unit, for determining training voice signal and semantic label corresponding with training voice signal；First input block, For training voice signal input first nerves network model to be obtained into speech feature vector；Second input block, for by language Adopted label input nervus opticus network model obtains semantic feature vector；Training unit, for according to speech feature vector and language The parameter value of target component in adopted characteristic vector training first nerves network model；Recognition unit, for according to after training One neural network model identifies targeted voice signal, wherein, the value of target component in first nerves network model after training For the parameter value after training.

Further, training unit includes：Alignment module, for by align network model align speech feature vector and Semantic feature vector, obtains training result；Computing module, for by preset algorithm calculate training result represented by semanteme with Semantic error represented by semantic label；Adjusting module, for being joined according to target in error transfer factor first nerves network model Several parameter values.

Further, alignment module includes：First determination sub-module, for determining that output speech feature vector and semanteme are special The joint probability distribution of vector is levied, computing module includes：Second determination sub-module, for according to forward-backward algorithms The loss function of conjunctive model is determined with joint probability distribution, wherein, conjunctive model includes first nerves network model and second Neural network model；3rd determination sub-module, marked for the semanteme represented by determining training result according to loss function with semantic The represented semantic error of label.

The present invention is by extracting the semantic feature vector in semantic label corresponding with training voice signal, by semantic feature During vector considers training, according to mesh in speech feature vector and semantic feature vector training first nerves network model The parameter value of parameter is marked, targeted voice signal is identified by the first nerves network model after training, solved in correlation technique Training speech recognition modeling convergence rate it is slower the problem of, by training when consider semantic label in carry it is real Semantic information, introduce semantic feature vector and the parameter of model is trained, remain and trained in nervus opticus network model The real semantic vector come, and then accelerate the convergence rate of training speech recognition modeling.

Brief description of the drawings

The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is the flow chart of audio recognition method according to a first embodiment of the present invention；

Fig. 2 is the flow chart of audio recognition method according to a second embodiment of the present invention；

Fig. 3 is the schematic diagram of RNN internetwork connection modes according to embodiments of the present invention；

Fig. 4 is the schematic diagram of Prediction internetwork connection modes according to embodiments of the present invention；

Fig. 5 is the schematic diagram of CTC-Prediction alignment networks according to embodiments of the present invention；And

Fig. 6 is the schematic diagram of speech recognition equipment according to embodiments of the present invention.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model of the application protection Enclose.

It should be noted that term " first " in the description and claims of this application and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments herein described herein.In addition, term " comprising " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing series of steps or unit Process, method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include without clear It is listing to Chu or for the intrinsic other steps of these processes, method, product or equipment or unit.

The embodiment provides a kind of audio recognition method.

Fig. 1 is the flow chart of audio recognition method according to a first embodiment of the present invention.As shown in figure 1, this method includes Following steps：

Step S101, it is determined that training voice signal and semantic label corresponding with training voice signal.

Step S102, training voice signal input first nerves network model is obtained into speech feature vector.

Step S103, semantic label input nervus opticus network model is obtained into semantic feature vector.

Step S104, according to target component in speech feature vector and semantic feature vector training first nerves network model Parameter value.

Step S105, targeted voice signal is identified according to the first nerves network model after training.

Wherein, the value of target component is the parameter value after training in first nerves network model after training.

Speech recognition technology is by the language for the voice signal is exported after speech recognition modeling input speech signal representing Voice, can be changed into corresponding text by the technology of justice.Speech recognition modeling needs to train, and speech recognition modeling includes treating Fixed parameter, the process of training are to carry out constantly adjustment so that language to the undetermined parameter of speech recognition modeling by training sample The higher process of sound identification model discrimination.

Training sample includes training voice signal and semantic label corresponding with training voice signal.Pass through multiple training samples This is repeatedly trained to parameter undetermined in speech recognition modeling, so that speech recognition modeling identifies to training voice signal Semantic error represented by the semanteme arrived semantic label corresponding with training voice signal is minimum.

First, it is determined that voice signal and semantic label corresponding with training voice signal are trained, then respectively to training language Sound signal and corresponding semantic label are handled：

Training voice signal is inputted into first nerves network model, exports speech feature vector, wherein, first nerves network Model includes target component undetermined, first can set an initial value to the target component in first nerves network model, With determine input first training voice signal corresponding to speech feature vector, alternatively, target component can be one or Multiple, the initial value of each target component can be numerical value set in advance, and the numerical value set in advance can be according to history What the characteristics of data, experience, algorithm or neural network model determined, specifically, initial value can also may be used by being manually entered To determine and input by algorithm statistical history data, for example, for some target component, its default value scope for [a, B], a numerical value is randomly selected in the range of [a, b] by random algorithm；

Semantic label is inputted into nervus opticus network model, output semantic feature is vectorial, wherein, nervus opticus network model Include target component undetermined, an initial value first can be set to the target component in nervus opticus network model, with true Surely semantic feature vector corresponding to the first training semantic label inputted.

Wherein, training voice signal is the voice signal for training the speech recognition modeling of embodiment offer, semantic Label is semantic label corresponding with training voice signal.It can be multiple to train voice signal, and correspondingly, semantic label also may be used Think multiple.For example, training voice signal includes n, respectively S1, S2 ... ..., Sn, correspondingly, with training voice signal pair The semantic label answered is X1, X2 ... ..., Xn.Alternatively, can also be before speech recognition modeling be trained to each training voice Signal carries out the pretreatment of framing, and a training voice signal can obtain more sub- voice signals after framing, correspondingly, Semantic label corresponding with the training voice signal can also be subjected to framing, obtain more sub- semantic labels.Wherein, to training It can be separate that voice signal, which carries out the processing procedure of framing and the processing procedure that framing is carried out to the training voice signal, , for example, after carrying out framing to training voice signal S2, m sub- voice signal S2a, S2b ... ..., S2m are obtained, to training After semantic label X2 corresponding to voice signal S2 carries out framing, p sub- semantic labels, X2a, X2b ... ..., X2p are obtained.

Alternatively, first nerves network model and nervus opticus network model can be identical neural network models.Its In, first nerves network model after training, can be used as speech recognition modeling recognition of speech signals.

After speech feature vector and semantic feature vector is obtained, according to speech feature vector and semantic feature vector instruction Practice the parameter value of target component in first nerves network model.

Specifically, in the training process, first nerves network model can extract the phonetic feature in training voice signal Vector, nervus opticus network model can extract the semantic feature that is included in semantic label corresponding with training voice signal to Amount., can be according to speech feature vector and semantic feature vector it is determined that after speech feature vector and semantic feature vector Between error optimization first nerves network model in target component parameter value and nervus opticus network model in target component Parameter value.That is, the error between speech feature vector and semantic feature vector can be counter-propagating to first nerves network mould Type and nervus opticus network model, according to the error transfer factor between the speech feature vector of backpropagation and semantic feature vector In one neural network model in the parameter value of target component and nervus opticus network model target component parameter value.

After training process terminates, targeted voice signal is identified according to the first nerves network model after training.First god , therefore, can be quickly through network model in the training process due to consideration that semantic feature included in semantic label Convergence, and improve the precision of first nerves network model identification.

The audio recognition method that the embodiment provides, by extracting the language in semantic label corresponding with training voice signal Adopted characteristic vector, during semantic feature vector is considered into training, according to speech feature vector and semantic feature vector instruction Practice the parameter value of target component in first nerves network model, target voice is identified by the first nerves network model after training Signal, solving the implicit semantic vector that acoustic model learns during training speech recognition modeling in correlation technique can destroy Real semantic vector so that loss function can not be reduced synchronously with error rate in training process, so as to cause to train voice to know The problem of convergence rate of other model is slower, by the real semantic information for considering to carry in semantic label in training, draw Enter semantic feature vector to be trained the parameter of model, remain the real language for training and in nervus opticus network model Adopted vector, and then accelerate the convergence rate of training speech recognition modeling.

Preferably, according to target component in speech feature vector and semantic feature vector training first nerves network model Parameter value can be with through the following steps that perform：By align network model align speech feature vector and semantic feature vector, Obtain training result；Semanteme represented by training result and the semantic mistake represented by semantic label are calculated by preset algorithm Difference；According to the parameter value of target component in error transfer factor first nerves network model.

Because the speech feature vector that first nerves network model exports according to the training voice signal of input may be with instruction It is different to practice the vector dimension of semantic label corresponding to voice signal, can be alignd by the network model that aligns, the step of alignment Suddenly the mapping that can be regarded as between the vector of two different dimensions, by aliging, speech feature vector is mapped to semantic mark by network The path of label can include the different path of a plurality of probability, according to determine the probability training result, and to the language of training result expression The semantic computation error that justice and semantic label represent, according to the parameter of target component in error transfer factor first nerves network model Value, wherein, included according to the parameter value of target component in error transfer factor first nerves network model according to the god of error transfer factor first The parameter value of target component in parameter value and nervus opticus network model through target component in network model.

By network model alignment speech feature vector and the semantic feature vector of aliging, the specific steps of training result are obtained Output speech feature vector and the joint probability distribution of semantic feature vector can be to determine.Training knot is calculated by preset algorithm Semanteme represented by fruit and the semantic error represented by semantic label can be according to forward-backward algorithms and connection The loss function that probability distribution determines conjunctive model is closed, wherein, conjunctive model includes first nerves network model and nervus opticus Network model, the semanteme according to represented by loss function determines training result and the semantic error represented by semantic label.

Alternatively, alignment network model can be CTC alignment network models, that is, passing through the network model alignment language that aligns Sound characteristic vector and semantic feature vector are by CTC alignment network model alignment speech feature vectors and semantic feature vector.

As a preferred embodiment of above-described embodiment, the process of training can include repeatedly training, specifically, training Voice signal is multiple training voice signal Pn, and semantic label is and the one-to-one multiple semantic marks of multiple training voice signals Qn is signed, training voice signal input first nerves network model is obtained into speech feature vector includes：By i-th of training voice letter Number Pi input first nerves network model obtains speech feature vector Ri, wherein, the target component of current first nerves network mould Parameter value be M (i-1)；Semantic label input nervus opticus network model is obtained into semantic feature vector includes：By i-th of language Adopted label Qi inputs nervus opticus network model obtains semantic feature vector T i, wherein, the target of current nervus opticus network mould The parameter value of parameter is S (i-1)；According to target in speech feature vector and semantic feature vector training first nerves network model The parameter value of parameter includes：Target in first nerves network model is determined according to speech feature vector Ri and semantic feature vector T i The parameter value Mi of parameter and the parameter value of the target component of nervus opticus network mould are Si, perform above step successively until i= n.Wherein, as i=1, that is, during i-1=0, can be to the parameter value M0 of the target component of first nerves network model and The parameter value S0 of the target component of two neural network models assigns initial value.

Alternatively, first nerves network model can be RNN models, by training voice signal input first nerves network mould Type, which obtains speech feature vector, to be included：Sub-frame processing is carried out to training voice signal, obtains training voice sequence；Voice will be trained Sequence inputting RNN models obtain speech feature vector.

Alternatively, nervus opticus network model can be RNN models, it is preferable that the RNN models can be Prediction Recurrent neural network.Semantic label input nervus opticus network model is obtained into semantic feature vector includes：According to semantic label Determine semantic label sequence；Semantic label sequence inputting RNN models are obtained into semantic feature vector.

Fig. 2 is the flow chart of audio recognition method according to a second embodiment of the present invention.The embodiment can be used as above-mentioned The preferred embodiment of first embodiment, as shown in Fig. 2 this method comprises the following steps：

Step S201, determines training sample.Training voice signal can be included in training sample and with training voice signal Corresponding semantic label.

Step S202, speech signal pre-processing.The pretreatment of voice signal includes framing, preemphasis, denoising etc., preferably Ground, the embodiment only carry out sub-frame processing, and for the voice that sample frequency is 8000Hz, for 20ms, frame moves is the frame length used 10ms.After training voice signal framing, multiframe voice can be obtained.

Higher layer voice feature of step S203, the RNN extraction per frame voice.By RNN neural network models extract framing it The higher layer voice feature of the every frame voice obtained afterwards.

RNN is a kind of series model, on the basis of neutral net, consider adjacent time t and t-1 implicit layer unit it Between annexation, have prominent sign ability to the effective information in Nonlinear Time Series signal.

Make x=(x₁,x₂,...,x_T) it is the list entries that length is T, wherein x_tRepresent t frame speech vectors.

RNN connected mode is as shown in figure 3, forward conduction is calculated as follows in the embodiment：

Wherein,For the output vector of time t i-th (i=1,2,3) layer hidden layer, W⁽ⁱ⁾Represent i-th layer of connection and the Weight matrix, the W of i-1 layers_hhFor the weight matrix of recurrence layer, b⁽ⁱ⁾For i-th layer of bias vector, f is that the non-linear of hidden layer swashs Function living, is typically taken as sigmoid functions.For the higher layer voice feature finally obtained.

Alternatively, the RNN extracted by RNN in the higher layer voice feature per frame voice can be the list containing three layers of hidden layer To RNN, the more abstract feature of the network extraction of more deep layer can also be chosen, or, two-way RNN can also be selected to come abundant Study past and the context relation in future.It should be noted that RNN models all in the embodiment can be replaced LSTM models or GRU models.

The label of the result and actual speech sequence of step S204, CTC-Prediction alignment prediction.Pass through CTC- The label of the result and actual speech sequence of the alignment prediction of Prediction forecast models.CTC-Prediction forecast model bags Include Prediction networks and CTC networks.

Prediction networks are a recurrent neural networks, including one layer of input layer, one layer of output layer, and one layer implicit Layer.Semantic feature can be extracted by Prediction networks, semantic label is inputted into Prediction networks.Length is U+1 List entriesBy Prediction network mappings to output sequence g, wherein φ considers in advance.Input Vector coding is one-hot vectors, i.e., if y_u=k, thenIt is the vector that a length is K, wherein k-th of dimension is 1, its Remaining dimension is all 0.So the dimension of each time step of input layer is K, the dimension of each time step of output layer is K+1.

The connected mode of Prediction networks is as shown in Figure 4.It is givenHidden layer sequence vector (h₀,...,h_U), it is pre- Sequencing row (g₀,...,g_U) by u=0 ... U iteration following formula calculates：

g_u=W_hoh_u+b_o

Wherein, W_ihIt is input-hidden layer weight matrix, W_hoIt is hidden layer-output layer weight matrix, b_hAnd b_oIt is that it is corresponding Bias vector, f is the activation primitive of hidden layer, is sigmoid.Obtained g_uFor language model vector.

It is acoustic model vector l by the RNN higher layer voice features extracted_t, 1≤t≤T：

Next, by acoustic model vector l_tWith language model vector g_uCombine, calculate the density function of output：

Normalization, the distribution exported：

Simplify above formula to obtain：

Y (t, u)=Pr (y_u+1|t,u)

φ (t, u)=Pr (φ | t, u)

CTC-Prediction aligns network as shown in figure 5, wherein horizontal arrow represents the output of t times as φ (i.e. Do not export), vertical arrow represents the t times and exports the u+1 element in y, and path is since the lower left corner, until upper right The terminal node (red arrow has marked a wherein paths) at angle, represents the alignment of input and output sequence.

It is l to define forward variable αs (t, u)_[1:t]Export y_[1:u]Probability.To 1≤t≤T, 1≤u≤U can be with iteration Calculating：

α (t, u)=α (t-1, u) φ (t-1, u)+α (t, u-1) y (t, u-1)

Initialize α (1,0)=1.

The probability of whole sequence output is the forward variate-values of terminal node：

Pr (y | x)=α (T, U) φ (T, U)

The corresponding backward variable β (t, u) that define are l_[t:T]Export y_[u+1:U]Probability.Calculate iterative as follows：

β (t, u)=β (t+1, u) φ (t, u)+β (t, u+1) y (t, u)

Initialization condition is：β (T, U)=φ (T, U)

The loss function of whole network model is：

To each training sample, loss function is to acoustic model vector l_tWith language model vector g_uDerivative formula it is as follows：

Carry out error back propagation and undated parameter.

Step S203 and step S204 is the process of training, and higher layer voice and CTC- are extracted to every frame voice by RNN The label of the result and actual speech sequence of Prediction alignment predictions, adjustment RNN models, Prediction models and CTC The parameter of model.

Step S205, the RNN trained.RNN after being trained by the way that step S203 and step S204 is performed a plurality of times.

Step S206, determines test sample.Test sample includes tested speech signal, is identified and surveyed by the RNN trained The semantic feature included in examination voice signal.

Step S207, speech signal pre-processing.Pretreatment is performed to the voice signal in test sample, for example, at framing Reason.

Step S208, output category result.Identify that pretreated voice is believed by the RNN trained in step S205 Number, output category result.

The audio recognition method based on CTC-Prediction models alignment RNN that the embodiment provides, first using deep Degree RNN extracts the higher layer voice feature of every frame voice, then by CTC-Prediction networks align prediction result with The label of actual speech sequence, the error of this calculating is obtained, so as to which backpropagation is to be trained.Trained with existing RNN When only with CTC networks align prediction result compared with the method for the label of actual speech sequence, this method alignment when not Acoustic model is only accounted for, while considers influence of the language model to error, algorithmic statement can be made faster and improved The precision of test.

It should be noted that can be in such as one group of computer executable instructions the flow of accompanying drawing illustrates the step of Performed in computer system, although also, show logical order in flow charts, in some cases, can be with not The order being same as herein performs shown or described step.

Embodiments of the invention additionally provide a kind of speech recognition equipment.It should be noted that the language of the embodiment of the present invention Sound identification device can be used for the audio recognition method for performing the present invention.

Fig. 6 is the schematic diagram of speech recognition equipment according to embodiments of the present invention.As shown in fig. 6, the device includes determining Unit 10, the first input block 20, the second input block 30, training unit 40 and recognition unit 50.

Determining unit 10 is used to determine training voice signal and semantic label corresponding with training voice signal；First input Unit 20 is used to training voice signal input first nerves network model obtaining speech feature vector；Second input block 30 is used It is vectorial in semantic label input nervus opticus network model is obtained into semantic feature；Training unit 40 be used for according to phonetic feature to The parameter value of target component in amount and semantic feature vector training first nerves network model；Recognition unit 50 is used for according to training First nerves network model identification targeted voice signal afterwards, wherein, target is joined in first nerves network model after training Several values is the parameter value after training.

The speech recognition equipment that the embodiment provides solves sound during the training speech recognition modeling in correlation technique Learn model learning to implicit semantic vector can destroy real semantic vector so that loss function and error rate in training process Can not synchronously reduce, so as to cause train speech recognition modeling convergence rate it is slower the problem of, by training when consider language The real semantic information carried in adopted label, introduce semantic feature vector and the parameter of model is trained, remain second The real semantic vector come is trained in neural network model, and then accelerates the convergence rate of training speech recognition modeling.

Preferably, training unit 40 can include：Alignment module, for by align network model align phonetic feature to Amount and semantic feature vector, obtain training result；Computing module, for calculating the language represented by training result by preset algorithm Justice and the semantic error represented by semantic label；Adjusting module, for according to mesh in error transfer factor first nerves network model Mark the parameter value of parameter.

Preferably, alignment module can include：First determination sub-module, for determining output speech feature vector and semanteme The joint probability distribution of characteristic vector, computing module include：Second determination sub-module, for being calculated according to forward-backward Method and joint probability distribution determine the loss function of speech feature vector output conjunctive model, wherein, conjunctive model includes first Neural network model and nervus opticus network model；3rd determination sub-module, for determining training result institute according to loss function The semanteme of expression and the semantic error represented by semantic label.

Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific Hardware and software combines.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

A kind of 1. audio recognition method, it is characterised in that including：

It is determined that training voice signal and semantic label corresponding with the training voice signal；

The training voice signal input first nerves network model is obtained into speech feature vector；

Institute's semantic tags input nervus opticus network model is obtained into semantic feature vector；

Target component in the first nerves network model is trained according to the speech feature vector and semantic feature vector Parameter value；

Targeted voice signal is identified according to the first nerves network model after training, wherein, described in after the training The value of target component described in first nerves network model is the parameter value after training.
2. according to the method for claim 1, it is characterised in that according to the speech feature vector and the semantic feature to Amount trains the parameter value of target component in the first nerves network model to include：

By the network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained；

Semanteme represented by the training result and the semantic error represented by institute's semantic tags are calculated by preset algorithm；

According to the parameter value of target component in first nerves network model described in the error transfer factor.
3. according to the method for claim 2, it is characterised in that

By the network model alignment speech feature vector and the semantic feature vector of aliging, obtaining training result includes： It is determined that the speech feature vector and the joint probability distribution of semantic feature vector are exported,

Semanteme represented by the training result and the semantic error represented by institute's semantic tags are calculated by preset algorithm Including：The loss function of conjunctive model is determined according to forward-backward algorithms and the joint probability distribution, wherein, institute Stating conjunctive model includes the first nerves network model and the nervus opticus network model；Determined according to the loss function Semanteme represented by the training result and the semantic error represented by institute's semantic tags.
4. according to the method for claim 2, it is characterised in that the alignment network model is CTC alignment network models, is led to Crossing the alignment network model alignment speech feature vector and the semantic feature vector includes：

Pass through the CTC alignment network model alignment speech feature vectors and semantic feature vector.
5. according to the method for claim 1, it is characterised in that the training voice signal is multiple training voice signals Pn, institute's semantic tags are to train voice signal multiple semantic label Qn correspondingly with the multiple,

The training voice signal input first nerves network model is obtained into speech feature vector includes：Language is trained by i-th Sound signal Pi inputs the first nerves network model and obtains speech feature vector Ri, wherein, presently described first nerves network The parameter value of the target component of model is M (i-1)；

Institute's semantic tags input nervus opticus network model is obtained into semantic feature vector includes：By i-th of semantic label Qi Input the nervus opticus network model and obtain semantic feature vector T i, wherein, the mesh of presently described nervus opticus network model The parameter value for marking parameter is S (i-1)；

Target component in the first nerves network model is trained according to the speech feature vector and semantic feature vector Parameter value include：The first nerves network mould is determined according to the speech feature vector Ri and the semantic feature vector T i The parameter value of the target component of the parameter value Mi of target component and the nervus opticus network model is Si in type,

Above step is performed successively until i=n.
6. according to the method for claim 1, it is characterised in that the first nerves network model is RNN models, by described in Training voice signal input first nerves network model, which obtains speech feature vector, to be included：

Sub-frame processing is carried out to the training voice signal, obtains training voice sequence；

The training voice sequence is inputted into the RNN models and obtains the speech feature vector.
7. according to the method for claim 1, it is characterised in that the nervus opticus network model is RNN models, by described in Semantic label input nervus opticus network model, which obtains semantic feature vector, to be included：

Semantic label sequence is determined according to institute's semantic tags；

RNN models described in institute's semantic tags sequence inputting are obtained into the semantic feature vector.
A kind of 8. speech recognition equipment, it is characterised in that including：

Determining unit, for determining training voice signal and semantic label corresponding with the training voice signal；

First input block, for the training voice signal input first nerves network model to be obtained into speech feature vector；

Second input block, for institute's semantic tags input nervus opticus network model to be obtained into semantic feature vector；

Training unit, for training the first nerves network mould according to the speech feature vector and semantic feature vector The parameter value of target component in type；

Recognition unit, for identifying targeted voice signal according to the first nerves network model after training, wherein, described The value of target component described in the first nerves network model after training is the parameter value after training.
9. device according to claim 8, it is characterised in that the training unit includes：

Alignment module, for by the network model alignment speech feature vector and the semantic feature vector of aliging, obtaining Training result；

Computing module, for being calculated by preset algorithm represented by semanteme and the institute's semantic tags represented by the training result Semantic error；

Adjusting module, the parameter value for target component in the first nerves network model according to the error transfer factor.
10. device according to claim 9, it is characterised in that

The alignment module includes：First determination sub-module, for determining to export the speech feature vector and the semantic spy The joint probability distribution of vector is levied,

The computing module includes：Second determination sub-module, for general according to forward-backward algorithms and the joint Rate distribution determines the loss function of conjunctive model, wherein, the conjunctive model includes the first nerves network model and described Nervus opticus network model；3rd determination sub-module, represented by determining the training result according to the loss function The semantic semantic error with represented by institute's semantic tags.