CN107871497A - Audio recognition method and device - Google Patents
Audio recognition method and device Download PDFInfo
- Publication number
- CN107871497A CN107871497A CN201610847843.4A CN201610847843A CN107871497A CN 107871497 A CN107871497 A CN 107871497A CN 201610847843 A CN201610847843 A CN 201610847843A CN 107871497 A CN107871497 A CN 107871497A
- Authority
- CN
- China
- Prior art keywords
- semantic
- network model
- training
- feature vector
- voice signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/18—Artificial neural networks; Connectionist approaches
Abstract
The invention discloses a kind of audio recognition method and device.This method includes:It is determined that training voice signal and semantic label corresponding with training voice signal;Training voice signal input first nerves network model is obtained into speech feature vector;Semantic label input nervus opticus network model is obtained into semantic feature vector;According to the parameter value of target component in speech feature vector and semantic feature vector training first nerves network model;Targeted voice signal is identified according to the first nerves network model after training, wherein, the value of target component is the parameter value after training in first nerves network model after training.By the present invention, solve the problems, such as that the convergence rate for training speech recognition modeling in correlation technique is slower.
Description
Technical field
The present invention relates to field of speech recognition, in particular to a kind of audio recognition method and device.
Background technology
Speech recognition technology is exactly to allow machine that voice signal is changed into corresponding text by identification and understanding process
Technology.Traditional speech recognition technology is strong to the feature dependence of artificial selection, and accuracy rate is low.By deep learning (Deep
Learning) technology is applied in field of speech recognition, can imitate pattern of the brain to voice signal study, identification, Neng Gou great
Amplitude improves the accuracy of speech recognition.
Deep Learning are used for speech recognition, oneself is through obtaining significant progress at present.Several Deep introduced below
Networks:
Deep neural network (Deep Neural Networks, abbreviation DNNs):The feature that the network extraction goes out has stronger
Distinction, therefore the model trained has stronger separating capacity, this network generally use depth belief network (Deep
Belief Network, abbreviation DBN) it is used as pre-training process, acoustic model is trained using DNN-HMM hybrid networks, in major term
There is wide application in remittance amount speech recognition system.
Convolutional neural networks (Convolutional Neural Networks, abbreviation CNNs):Compared to DNNs, introduce
The concept of convolution and pond.Extraction to phonetic feature local message is realized by convolution, then model pair is strengthened by pondization
The robustness of feature.While scale of model is substantially reduced, recognition performance is more preferable, and generalization ability is stronger.
Recurrent neural network (Recurrent Neural Networks, abbreviation RNN):At present in field of speech recognition most
Conventional depth network model is RNN, and it is a kind of series model, and it considers adjacent speech frame on the basis of neutral net
Implicit layer unit between annexation, pass through temporally reverse propagated error adjust network parameter training network.RNN point
Cloth hidden state can effectively store before information, and as nonlinear dynamic system can make its hide layer unit with one
The complicated mode of kind updates, and combines both characteristics, enables it to identify potential time-dependent relation by recurrence layer, enters
The task of row speech recognition.
It is coupled chronological classification (Connectionist Temporal Classification, abbreviation CTC):It is a kind of right
Neat model, depth network can be exported and be alignd with label text, calculated the probability of all possible paths and be used as whole sentence
Probability, enable to us to carry out advance segmentation or post processing to sample using CTC, effect greatly improved
Rate.
But audio recognition method of the prior art still has the problem of certain:
(1) it is the shortcomings that DNNs:DNNs methods assume that each speech frame is independent, do not account for the correlation between frame and frame,
And in general hidden layer needs more neuron, phase gradient diffusing phenomenon can be very serious after training, and can only be with it
The error of his the models coupling ability sequence of calculation.
(2) it is the shortcomings that CNNs:Single CNNs can only handle the speech recognition of isolated word, connect so being handled using CNNs
Continuous voice needs to split voice in advance, very time-consuming and uninteresting.CNNs can also handle continuous with other models couplings
Voice, but the quantity of parameter is undoubtedly added, and it is also very time-consuming to manually adjust model parameter.
(3) it is the shortcomings that RNN:Due to needing to remember bulk information, training difficulty is larger, and calculating cost is big, identifies process
Slowly, and its recursive structure easily occurs the problem of gradient blast and gradient disappearance in error-duration model, and it is difficult to enter to cause training
Row goes down.
(4) it is the shortcomings that CTC:The influence of acoustic information is only considered during training, RNN is destroyed and trains the implicit language come
Model is sayed, the harm brought is that word-based error rate and the error rate based on phoneme can not be reduced synchronously, and it is difficult to increase training
Degree.
As the above analysis, it is to the independent training of acoustic model in correlation technique, and that acoustic model learns is hidden
Real semantic vector can be destroyed containing semantic vector, so as to cause loss function in training process synchronously to be dropped with error rate
It is low so that the convergence rate of training is slower.
For the training speech recognition modeling in correlation technique convergence rate it is slower the problem of, not yet propose at present effective
Solution.
The content of the invention
It is a primary object of the present invention to provide a kind of audio recognition method and device, to solve the training in correlation technique
The problem of convergence rate of speech recognition modeling is slower.
To achieve these goals, according to an aspect of the invention, there is provided a kind of audio recognition method.This method bag
Include:It is determined that training voice signal and semantic label corresponding with training voice signal;Will training voice signal input first nerves
Network model obtains speech feature vector;Semantic label input nervus opticus network model is obtained into semantic feature vector;According to
The parameter value of target component in speech feature vector and semantic feature vector training first nerves network model;After training
First nerves network model identifies targeted voice signal, wherein, target component in first nerves network model after training
Parameter value after being worth for training.
Further, according to target component in speech feature vector and semantic feature vector training first nerves network model
Parameter value include:By network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained;Pass through
Preset algorithm calculates the semanteme represented by training result and the semantic error represented by semantic label;According to error transfer factor first
The parameter value of target component in neural network model.
Further, by network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained
Including:It is determined that output speech feature vector and the joint probability distribution of semantic feature vector, training knot is calculated by preset algorithm
Semanteme represented by fruit includes with the semantic error represented by semantic label:According to forward-backward algorithms and joint
Probability distribution determines the loss function of conjunctive model, wherein, conjunctive model includes first nerves network model and nervus opticus net
Network model;Semanteme according to represented by loss function determines training result and the semantic error represented by semantic label.
Further, align network model be CTC align network model, by align network model align phonetic feature to
Amount and semantic feature vector include:Pass through CTC alignment network model alignment speech feature vectors and semantic feature vector.
Further, training voice signal is multiple training voice signal Pn, and semantic label is and multiple training voice letters
Number one-to-one multiple semantic label Qn, speech feature vector is obtained by training voice signal input first nerves network model
Including:I-th of training voice signal Pi input first nerves network model is obtained into speech feature vector Ri, wherein, current the
The parameter value of the target component of one neutral net mould is M (i-1);Semantic label input nervus opticus network model is obtained into semanteme
Characteristic vector includes:I-th of semantic label Qi is inputted into nervus opticus network model and obtains semantic feature vector T i, wherein, when
The parameter value of the target component of preceding nervus opticus network mould is S (i-1);According to speech feature vector and semantic feature vector training
The parameter value of target component includes in first nerves network model:Determined according to speech feature vector Ri and semantic feature vector T i
The parameter value of the target component of the parameter value Mi of target component and nervus opticus network mould is Si in first nerves network model, according to
Secondary execution above step is until i=n.
Further, first nerves network model is RNN models, by training voice signal input first nerves network model
Obtaining speech feature vector includes:Sub-frame processing is carried out to training voice signal, obtains training voice sequence;Voice sequence will be trained
Row input RNN models obtain speech feature vector.
Further, nervus opticus network model is RNN models, and semantic label input nervus opticus network model is obtained
Semantic feature vector includes:Semantic label sequence is determined according to semantic label;Semantic label sequence inputting RNN models are obtained into language
Adopted characteristic vector.
To achieve these goals, according to an aspect of the invention, there is provided a kind of speech recognition equipment.The device bag
Include:Determining unit, for determining training voice signal and semantic label corresponding with training voice signal;First input block,
For training voice signal input first nerves network model to be obtained into speech feature vector;Second input block, for by language
Adopted label input nervus opticus network model obtains semantic feature vector;Training unit, for according to speech feature vector and language
The parameter value of target component in adopted characteristic vector training first nerves network model;Recognition unit, for according to after training
One neural network model identifies targeted voice signal, wherein, the value of target component in first nerves network model after training
For the parameter value after training.
Further, training unit includes:Alignment module, for by align network model align speech feature vector and
Semantic feature vector, obtains training result;Computing module, for by preset algorithm calculate training result represented by semanteme with
Semantic error represented by semantic label;Adjusting module, for being joined according to target in error transfer factor first nerves network model
Several parameter values.
Further, alignment module includes:First determination sub-module, for determining that output speech feature vector and semanteme are special
The joint probability distribution of vector is levied, computing module includes:Second determination sub-module, for according to forward-backward algorithms
The loss function of conjunctive model is determined with joint probability distribution, wherein, conjunctive model includes first nerves network model and second
Neural network model;3rd determination sub-module, marked for the semanteme represented by determining training result according to loss function with semantic
The represented semantic error of label.
The present invention is by extracting the semantic feature vector in semantic label corresponding with training voice signal, by semantic feature
During vector considers training, according to mesh in speech feature vector and semantic feature vector training first nerves network model
The parameter value of parameter is marked, targeted voice signal is identified by the first nerves network model after training, solved in correlation technique
Training speech recognition modeling convergence rate it is slower the problem of, by training when consider semantic label in carry it is real
Semantic information, introduce semantic feature vector and the parameter of model is trained, remain and trained in nervus opticus network model
The real semantic vector come, and then accelerate the convergence rate of training speech recognition modeling.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention
Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of audio recognition method according to a first embodiment of the present invention;
Fig. 2 is the flow chart of audio recognition method according to a second embodiment of the present invention;
Fig. 3 is the schematic diagram of RNN internetwork connection modes according to embodiments of the present invention;
Fig. 4 is the schematic diagram of Prediction internetwork connection modes according to embodiments of the present invention;
Fig. 5 is the schematic diagram of CTC-Prediction alignment networks according to embodiments of the present invention;And
Fig. 6 is the schematic diagram of speech recognition equipment according to embodiments of the present invention.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application
Accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ordinary skill people
The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model of the application protection
Enclose.
It should be noted that term " first " in the description and claims of this application and above-mentioned accompanying drawing, "
Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use
Data can exchange in the appropriate case, so as to embodiments herein described herein.In addition, term " comprising " and " tool
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing series of steps or unit
Process, method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include without clear
It is listing to Chu or for the intrinsic other steps of these processes, method, product or equipment or unit.
The embodiment provides a kind of audio recognition method.
Fig. 1 is the flow chart of audio recognition method according to a first embodiment of the present invention.As shown in figure 1, this method includes
Following steps:
Step S101, it is determined that training voice signal and semantic label corresponding with training voice signal.
Step S102, training voice signal input first nerves network model is obtained into speech feature vector.
Step S103, semantic label input nervus opticus network model is obtained into semantic feature vector.
Step S104, according to target component in speech feature vector and semantic feature vector training first nerves network model
Parameter value.
Step S105, targeted voice signal is identified according to the first nerves network model after training.
Wherein, the value of target component is the parameter value after training in first nerves network model after training.
Speech recognition technology is by the language for the voice signal is exported after speech recognition modeling input speech signal representing
Voice, can be changed into corresponding text by the technology of justice.Speech recognition modeling needs to train, and speech recognition modeling includes treating
Fixed parameter, the process of training are to carry out constantly adjustment so that language to the undetermined parameter of speech recognition modeling by training sample
The higher process of sound identification model discrimination.
Training sample includes training voice signal and semantic label corresponding with training voice signal.Pass through multiple training samples
This is repeatedly trained to parameter undetermined in speech recognition modeling, so that speech recognition modeling identifies to training voice signal
Semantic error represented by the semanteme arrived semantic label corresponding with training voice signal is minimum.
First, it is determined that voice signal and semantic label corresponding with training voice signal are trained, then respectively to training language
Sound signal and corresponding semantic label are handled:
Training voice signal is inputted into first nerves network model, exports speech feature vector, wherein, first nerves network
Model includes target component undetermined, first can set an initial value to the target component in first nerves network model,
With determine input first training voice signal corresponding to speech feature vector, alternatively, target component can be one or
Multiple, the initial value of each target component can be numerical value set in advance, and the numerical value set in advance can be according to history
What the characteristics of data, experience, algorithm or neural network model determined, specifically, initial value can also may be used by being manually entered
To determine and input by algorithm statistical history data, for example, for some target component, its default value scope for [a,
B], a numerical value is randomly selected in the range of [a, b] by random algorithm;
Semantic label is inputted into nervus opticus network model, output semantic feature is vectorial, wherein, nervus opticus network model
Include target component undetermined, an initial value first can be set to the target component in nervus opticus network model, with true
Surely semantic feature vector corresponding to the first training semantic label inputted.
Wherein, training voice signal is the voice signal for training the speech recognition modeling of embodiment offer, semantic
Label is semantic label corresponding with training voice signal.It can be multiple to train voice signal, and correspondingly, semantic label also may be used
Think multiple.For example, training voice signal includes n, respectively S1, S2 ... ..., Sn, correspondingly, with training voice signal pair
The semantic label answered is X1, X2 ... ..., Xn.Alternatively, can also be before speech recognition modeling be trained to each training voice
Signal carries out the pretreatment of framing, and a training voice signal can obtain more sub- voice signals after framing, correspondingly,
Semantic label corresponding with the training voice signal can also be subjected to framing, obtain more sub- semantic labels.Wherein, to training
It can be separate that voice signal, which carries out the processing procedure of framing and the processing procedure that framing is carried out to the training voice signal,
, for example, after carrying out framing to training voice signal S2, m sub- voice signal S2a, S2b ... ..., S2m are obtained, to training
After semantic label X2 corresponding to voice signal S2 carries out framing, p sub- semantic labels, X2a, X2b ... ..., X2p are obtained.
Alternatively, first nerves network model and nervus opticus network model can be identical neural network models.Its
In, first nerves network model after training, can be used as speech recognition modeling recognition of speech signals.
After speech feature vector and semantic feature vector is obtained, according to speech feature vector and semantic feature vector instruction
Practice the parameter value of target component in first nerves network model.
Specifically, in the training process, first nerves network model can extract the phonetic feature in training voice signal
Vector, nervus opticus network model can extract the semantic feature that is included in semantic label corresponding with training voice signal to
Amount., can be according to speech feature vector and semantic feature vector it is determined that after speech feature vector and semantic feature vector
Between error optimization first nerves network model in target component parameter value and nervus opticus network model in target component
Parameter value.That is, the error between speech feature vector and semantic feature vector can be counter-propagating to first nerves network mould
Type and nervus opticus network model, according to the error transfer factor between the speech feature vector of backpropagation and semantic feature vector
In one neural network model in the parameter value of target component and nervus opticus network model target component parameter value.
After training process terminates, targeted voice signal is identified according to the first nerves network model after training.First god
, therefore, can be quickly through network model in the training process due to consideration that semantic feature included in semantic label
Convergence, and improve the precision of first nerves network model identification.
The audio recognition method that the embodiment provides, by extracting the language in semantic label corresponding with training voice signal
Adopted characteristic vector, during semantic feature vector is considered into training, according to speech feature vector and semantic feature vector instruction
Practice the parameter value of target component in first nerves network model, target voice is identified by the first nerves network model after training
Signal, solving the implicit semantic vector that acoustic model learns during training speech recognition modeling in correlation technique can destroy
Real semantic vector so that loss function can not be reduced synchronously with error rate in training process, so as to cause to train voice to know
The problem of convergence rate of other model is slower, by the real semantic information for considering to carry in semantic label in training, draw
Enter semantic feature vector to be trained the parameter of model, remain the real language for training and in nervus opticus network model
Adopted vector, and then accelerate the convergence rate of training speech recognition modeling.
Preferably, according to target component in speech feature vector and semantic feature vector training first nerves network model
Parameter value can be with through the following steps that perform:By align network model align speech feature vector and semantic feature vector,
Obtain training result;Semanteme represented by training result and the semantic mistake represented by semantic label are calculated by preset algorithm
Difference;According to the parameter value of target component in error transfer factor first nerves network model.
Because the speech feature vector that first nerves network model exports according to the training voice signal of input may be with instruction
It is different to practice the vector dimension of semantic label corresponding to voice signal, can be alignd by the network model that aligns, the step of alignment
Suddenly the mapping that can be regarded as between the vector of two different dimensions, by aliging, speech feature vector is mapped to semantic mark by network
The path of label can include the different path of a plurality of probability, according to determine the probability training result, and to the language of training result expression
The semantic computation error that justice and semantic label represent, according to the parameter of target component in error transfer factor first nerves network model
Value, wherein, included according to the parameter value of target component in error transfer factor first nerves network model according to the god of error transfer factor first
The parameter value of target component in parameter value and nervus opticus network model through target component in network model.
By network model alignment speech feature vector and the semantic feature vector of aliging, the specific steps of training result are obtained
Output speech feature vector and the joint probability distribution of semantic feature vector can be to determine.Training knot is calculated by preset algorithm
Semanteme represented by fruit and the semantic error represented by semantic label can be according to forward-backward algorithms and connection
The loss function that probability distribution determines conjunctive model is closed, wherein, conjunctive model includes first nerves network model and nervus opticus
Network model, the semanteme according to represented by loss function determines training result and the semantic error represented by semantic label.
Alternatively, alignment network model can be CTC alignment network models, that is, passing through the network model alignment language that aligns
Sound characteristic vector and semantic feature vector are by CTC alignment network model alignment speech feature vectors and semantic feature vector.
As a preferred embodiment of above-described embodiment, the process of training can include repeatedly training, specifically, training
Voice signal is multiple training voice signal Pn, and semantic label is and the one-to-one multiple semantic marks of multiple training voice signals
Qn is signed, training voice signal input first nerves network model is obtained into speech feature vector includes:By i-th of training voice letter
Number Pi input first nerves network model obtains speech feature vector Ri, wherein, the target component of current first nerves network mould
Parameter value be M (i-1);Semantic label input nervus opticus network model is obtained into semantic feature vector includes:By i-th of language
Adopted label Qi inputs nervus opticus network model obtains semantic feature vector T i, wherein, the target of current nervus opticus network mould
The parameter value of parameter is S (i-1);According to target in speech feature vector and semantic feature vector training first nerves network model
The parameter value of parameter includes:Target in first nerves network model is determined according to speech feature vector Ri and semantic feature vector T i
The parameter value Mi of parameter and the parameter value of the target component of nervus opticus network mould are Si, perform above step successively until i=
n.Wherein, as i=1, that is, during i-1=0, can be to the parameter value M0 of the target component of first nerves network model and
The parameter value S0 of the target component of two neural network models assigns initial value.
Alternatively, first nerves network model can be RNN models, by training voice signal input first nerves network mould
Type, which obtains speech feature vector, to be included:Sub-frame processing is carried out to training voice signal, obtains training voice sequence;Voice will be trained
Sequence inputting RNN models obtain speech feature vector.
Alternatively, nervus opticus network model can be RNN models, it is preferable that the RNN models can be Prediction
Recurrent neural network.Semantic label input nervus opticus network model is obtained into semantic feature vector includes:According to semantic label
Determine semantic label sequence;Semantic label sequence inputting RNN models are obtained into semantic feature vector.
Fig. 2 is the flow chart of audio recognition method according to a second embodiment of the present invention.The embodiment can be used as above-mentioned
The preferred embodiment of first embodiment, as shown in Fig. 2 this method comprises the following steps:
Step S201, determines training sample.Training voice signal can be included in training sample and with training voice signal
Corresponding semantic label.
Step S202, speech signal pre-processing.The pretreatment of voice signal includes framing, preemphasis, denoising etc., preferably
Ground, the embodiment only carry out sub-frame processing, and for the voice that sample frequency is 8000Hz, for 20ms, frame moves is the frame length used
10ms.After training voice signal framing, multiframe voice can be obtained.
Higher layer voice feature of step S203, the RNN extraction per frame voice.By RNN neural network models extract framing it
The higher layer voice feature of the every frame voice obtained afterwards.
RNN is a kind of series model, on the basis of neutral net, consider adjacent time t and t-1 implicit layer unit it
Between annexation, have prominent sign ability to the effective information in Nonlinear Time Series signal.
Make x=(x1,x2,...,xT) it is the list entries that length is T, wherein xtRepresent t frame speech vectors.
RNN connected mode is as shown in figure 3, forward conduction is calculated as follows in the embodiment:
Wherein,For the output vector of time t i-th (i=1,2,3) layer hidden layer, W(i)Represent i-th layer of connection and the
Weight matrix, the W of i-1 layershhFor the weight matrix of recurrence layer, b(i)For i-th layer of bias vector, f is that the non-linear of hidden layer swashs
Function living, is typically taken as sigmoid functions.For the higher layer voice feature finally obtained.
Alternatively, the RNN extracted by RNN in the higher layer voice feature per frame voice can be the list containing three layers of hidden layer
To RNN, the more abstract feature of the network extraction of more deep layer can also be chosen, or, two-way RNN can also be selected to come abundant
Study past and the context relation in future.It should be noted that RNN models all in the embodiment can be replaced
LSTM models or GRU models.
The label of the result and actual speech sequence of step S204, CTC-Prediction alignment prediction.Pass through CTC-
The label of the result and actual speech sequence of the alignment prediction of Prediction forecast models.CTC-Prediction forecast model bags
Include Prediction networks and CTC networks.
Prediction networks are a recurrent neural networks, including one layer of input layer, one layer of output layer, and one layer implicit
Layer.Semantic feature can be extracted by Prediction networks, semantic label is inputted into Prediction networks.Length is U+1
List entriesBy Prediction network mappings to output sequence g, wherein φ considers in advance.Input
Vector coding is one-hot vectors, i.e., if yu=k, thenIt is the vector that a length is K, wherein k-th of dimension is 1, its
Remaining dimension is all 0.So the dimension of each time step of input layer is K, the dimension of each time step of output layer is K+1.
The connected mode of Prediction networks is as shown in Figure 4.It is givenHidden layer sequence vector (h0,...,hU), it is pre-
Sequencing row (g0,...,gU) by u=0 ... U iteration following formula calculates:
gu=Whohu+bo
Wherein, WihIt is input-hidden layer weight matrix, WhoIt is hidden layer-output layer weight matrix, bhAnd boIt is that it is corresponding
Bias vector, f is the activation primitive of hidden layer, is sigmoid.Obtained guFor language model vector.
It is acoustic model vector l by the RNN higher layer voice features extractedt, 1≤t≤T:
Next, by acoustic model vector ltWith language model vector guCombine, calculate the density function of output:
Normalization, the distribution exported:
Simplify above formula to obtain:
Y (t, u)=Pr (yu+1|t,u)
φ (t, u)=Pr (φ | t, u)
CTC-Prediction aligns network as shown in figure 5, wherein horizontal arrow represents the output of t times as φ (i.e.
Do not export), vertical arrow represents the t times and exports the u+1 element in y, and path is since the lower left corner, until upper right
The terminal node (red arrow has marked a wherein paths) at angle, represents the alignment of input and output sequence.
It is l to define forward variable αs (t, u)[1:t]Export y[1:u]Probability.To 1≤t≤T, 1≤u≤U can be with iteration
Calculating:
α (t, u)=α (t-1, u) φ (t-1, u)+α (t, u-1) y (t, u-1)
Initialize α (1,0)=1.
The probability of whole sequence output is the forward variate-values of terminal node:
Pr (y | x)=α (T, U) φ (T, U)
The corresponding backward variable β (t, u) that define are l[t:T]Export y[u+1:U]Probability.Calculate iterative as follows:
β (t, u)=β (t+1, u) φ (t, u)+β (t, u+1) y (t, u)
Initialization condition is:β (T, U)=φ (T, U)
The loss function of whole network model is:
To each training sample, loss function is to acoustic model vector ltWith language model vector guDerivative formula it is as follows:
Carry out error back propagation and undated parameter.
Step S203 and step S204 is the process of training, and higher layer voice and CTC- are extracted to every frame voice by RNN
The label of the result and actual speech sequence of Prediction alignment predictions, adjustment RNN models, Prediction models and CTC
The parameter of model.
Step S205, the RNN trained.RNN after being trained by the way that step S203 and step S204 is performed a plurality of times.
Step S206, determines test sample.Test sample includes tested speech signal, is identified and surveyed by the RNN trained
The semantic feature included in examination voice signal.
Step S207, speech signal pre-processing.Pretreatment is performed to the voice signal in test sample, for example, at framing
Reason.
Step S208, output category result.Identify that pretreated voice is believed by the RNN trained in step S205
Number, output category result.
The audio recognition method based on CTC-Prediction models alignment RNN that the embodiment provides, first using deep
Degree RNN extracts the higher layer voice feature of every frame voice, then by CTC-Prediction networks align prediction result with
The label of actual speech sequence, the error of this calculating is obtained, so as to which backpropagation is to be trained.Trained with existing RNN
When only with CTC networks align prediction result compared with the method for the label of actual speech sequence, this method alignment when not
Acoustic model is only accounted for, while considers influence of the language model to error, algorithmic statement can be made faster and improved
The precision of test.
It should be noted that can be in such as one group of computer executable instructions the flow of accompanying drawing illustrates the step of
Performed in computer system, although also, show logical order in flow charts, in some cases, can be with not
The order being same as herein performs shown or described step.
Embodiments of the invention additionally provide a kind of speech recognition equipment.It should be noted that the language of the embodiment of the present invention
Sound identification device can be used for the audio recognition method for performing the present invention.
Fig. 6 is the schematic diagram of speech recognition equipment according to embodiments of the present invention.As shown in fig. 6, the device includes determining
Unit 10, the first input block 20, the second input block 30, training unit 40 and recognition unit 50.
Determining unit 10 is used to determine training voice signal and semantic label corresponding with training voice signal;First input
Unit 20 is used to training voice signal input first nerves network model obtaining speech feature vector;Second input block 30 is used
It is vectorial in semantic label input nervus opticus network model is obtained into semantic feature;Training unit 40 be used for according to phonetic feature to
The parameter value of target component in amount and semantic feature vector training first nerves network model;Recognition unit 50 is used for according to training
First nerves network model identification targeted voice signal afterwards, wherein, target is joined in first nerves network model after training
Several values is the parameter value after training.
The speech recognition equipment that the embodiment provides solves sound during the training speech recognition modeling in correlation technique
Learn model learning to implicit semantic vector can destroy real semantic vector so that loss function and error rate in training process
Can not synchronously reduce, so as to cause train speech recognition modeling convergence rate it is slower the problem of, by training when consider language
The real semantic information carried in adopted label, introduce semantic feature vector and the parameter of model is trained, remain second
The real semantic vector come is trained in neural network model, and then accelerates the convergence rate of training speech recognition modeling.
Preferably, training unit 40 can include:Alignment module, for by align network model align phonetic feature to
Amount and semantic feature vector, obtain training result;Computing module, for calculating the language represented by training result by preset algorithm
Justice and the semantic error represented by semantic label;Adjusting module, for according to mesh in error transfer factor first nerves network model
Mark the parameter value of parameter.
Preferably, alignment module can include:First determination sub-module, for determining output speech feature vector and semanteme
The joint probability distribution of characteristic vector, computing module include:Second determination sub-module, for being calculated according to forward-backward
Method and joint probability distribution determine the loss function of speech feature vector output conjunctive model, wherein, conjunctive model includes first
Neural network model and nervus opticus network model;3rd determination sub-module, for determining training result institute according to loss function
The semanteme of expression and the semantic error represented by semantic label.
Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general
Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed
Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored
Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific
Hardware and software combines.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.
Claims (10)
- A kind of 1. audio recognition method, it is characterised in that including:It is determined that training voice signal and semantic label corresponding with the training voice signal;The training voice signal input first nerves network model is obtained into speech feature vector;Institute's semantic tags input nervus opticus network model is obtained into semantic feature vector;Target component in the first nerves network model is trained according to the speech feature vector and semantic feature vector Parameter value;Targeted voice signal is identified according to the first nerves network model after training, wherein, described in after the training The value of target component described in first nerves network model is the parameter value after training.
- 2. according to the method for claim 1, it is characterised in that according to the speech feature vector and the semantic feature to Amount trains the parameter value of target component in the first nerves network model to include:By the network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained;Semanteme represented by the training result and the semantic error represented by institute's semantic tags are calculated by preset algorithm;According to the parameter value of target component in first nerves network model described in the error transfer factor.
- 3. according to the method for claim 2, it is characterised in thatBy the network model alignment speech feature vector and the semantic feature vector of aliging, obtaining training result includes: It is determined that the speech feature vector and the joint probability distribution of semantic feature vector are exported,Semanteme represented by the training result and the semantic error represented by institute's semantic tags are calculated by preset algorithm Including:The loss function of conjunctive model is determined according to forward-backward algorithms and the joint probability distribution, wherein, institute Stating conjunctive model includes the first nerves network model and the nervus opticus network model;Determined according to the loss function Semanteme represented by the training result and the semantic error represented by institute's semantic tags.
- 4. according to the method for claim 2, it is characterised in that the alignment network model is CTC alignment network models, is led to Crossing the alignment network model alignment speech feature vector and the semantic feature vector includes:Pass through the CTC alignment network model alignment speech feature vectors and semantic feature vector.
- 5. according to the method for claim 1, it is characterised in that the training voice signal is multiple training voice signals Pn, institute's semantic tags are to train voice signal multiple semantic label Qn correspondingly with the multiple,The training voice signal input first nerves network model is obtained into speech feature vector includes:Language is trained by i-th Sound signal Pi inputs the first nerves network model and obtains speech feature vector Ri, wherein, presently described first nerves network The parameter value of the target component of model is M (i-1);Institute's semantic tags input nervus opticus network model is obtained into semantic feature vector includes:By i-th of semantic label Qi Input the nervus opticus network model and obtain semantic feature vector T i, wherein, the mesh of presently described nervus opticus network model The parameter value for marking parameter is S (i-1);Target component in the first nerves network model is trained according to the speech feature vector and semantic feature vector Parameter value include:The first nerves network mould is determined according to the speech feature vector Ri and the semantic feature vector T i The parameter value of the target component of the parameter value Mi of target component and the nervus opticus network model is Si in type,Above step is performed successively until i=n.
- 6. according to the method for claim 1, it is characterised in that the first nerves network model is RNN models, by described in Training voice signal input first nerves network model, which obtains speech feature vector, to be included:Sub-frame processing is carried out to the training voice signal, obtains training voice sequence;The training voice sequence is inputted into the RNN models and obtains the speech feature vector.
- 7. according to the method for claim 1, it is characterised in that the nervus opticus network model is RNN models, by described in Semantic label input nervus opticus network model, which obtains semantic feature vector, to be included:Semantic label sequence is determined according to institute's semantic tags;RNN models described in institute's semantic tags sequence inputting are obtained into the semantic feature vector.
- A kind of 8. speech recognition equipment, it is characterised in that including:Determining unit, for determining training voice signal and semantic label corresponding with the training voice signal;First input block, for the training voice signal input first nerves network model to be obtained into speech feature vector;Second input block, for institute's semantic tags input nervus opticus network model to be obtained into semantic feature vector;Training unit, for training the first nerves network mould according to the speech feature vector and semantic feature vector The parameter value of target component in type;Recognition unit, for identifying targeted voice signal according to the first nerves network model after training, wherein, described The value of target component described in the first nerves network model after training is the parameter value after training.
- 9. device according to claim 8, it is characterised in that the training unit includes:Alignment module, for by the network model alignment speech feature vector and the semantic feature vector of aliging, obtaining Training result;Computing module, for being calculated by preset algorithm represented by semanteme and the institute's semantic tags represented by the training result Semantic error;Adjusting module, the parameter value for target component in the first nerves network model according to the error transfer factor.
- 10. device according to claim 9, it is characterised in thatThe alignment module includes:First determination sub-module, for determining to export the speech feature vector and the semantic spy The joint probability distribution of vector is levied,The computing module includes:Second determination sub-module, for general according to forward-backward algorithms and the joint Rate distribution determines the loss function of conjunctive model, wherein, the conjunctive model includes the first nerves network model and described Nervus opticus network model;3rd determination sub-module, represented by determining the training result according to the loss function The semantic semantic error with represented by institute's semantic tags.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610847843.4A CN107871497A (en) | 2016-09-23 | 2016-09-23 | Audio recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610847843.4A CN107871497A (en) | 2016-09-23 | 2016-09-23 | Audio recognition method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107871497A true CN107871497A (en) | 2018-04-03 |
Family
ID=61751546
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610847843.4A Pending CN107871497A (en) | 2016-09-23 | 2016-09-23 | Audio recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107871497A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108833722A (en) * | 2018-05-29 | 2018-11-16 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN108847224A (en) * | 2018-07-05 | 2018-11-20 | 广州势必可赢网络科技有限公司 | A kind of sound mural painting plane display method and device |
CN108922513A (en) * | 2018-06-04 | 2018-11-30 | 平安科技(深圳)有限公司 | Speech differentiation method, apparatus, computer equipment and storage medium |
CN109326299A (en) * | 2018-11-14 | 2019-02-12 | 平安科技(深圳)有限公司 | Sound enhancement method, device and storage medium based on full convolutional neural networks |
CN109559735A (en) * | 2018-10-11 | 2019-04-02 | 平安科技(深圳)有限公司 | A kind of audio recognition method neural network based, terminal device and medium |
CN109866713A (en) * | 2019-03-21 | 2019-06-11 | 斑马网络技术有限公司 | Safety detection method and device, vehicle |
CN109887511A (en) * | 2019-04-24 | 2019-06-14 | 武汉水象电子科技有限公司 | A kind of voice wake-up optimization method based on cascade DNN |
CN109887497A (en) * | 2019-04-12 | 2019-06-14 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110379407A (en) * | 2019-07-22 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment |
CN110517666A (en) * | 2019-01-29 | 2019-11-29 | 腾讯科技(深圳)有限公司 | Audio identification methods, system, machinery equipment and computer-readable medium |
CN110895935A (en) * | 2018-09-13 | 2020-03-20 | 阿里巴巴集团控股有限公司 | Speech recognition method, system, device and medium |
CN111128137A (en) * | 2019-12-30 | 2020-05-08 | 广州市百果园信息技术有限公司 | Acoustic model training method and device, computer equipment and storage medium |
CN111223476A (en) * | 2020-04-23 | 2020-06-02 | 深圳市友杰智新科技有限公司 | Method and device for extracting voice feature vector, computer equipment and storage medium |
CN111477212A (en) * | 2019-01-04 | 2020-07-31 | 阿里巴巴集团控股有限公司 | Content recognition, model training and data processing method, system and equipment |
CN111739537A (en) * | 2020-06-08 | 2020-10-02 | 北京灵蚌科技有限公司 | Semantic recognition method and device, storage medium and processor |
CN111768761A (en) * | 2019-03-14 | 2020-10-13 | 京东数字科技控股有限公司 | Training method and device of voice recognition model |
CN111862985A (en) * | 2019-05-17 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | Voice recognition device, method, electronic equipment and storage medium |
CN112949107A (en) * | 2019-12-10 | 2021-06-11 | 通用汽车环球科技运作有限责任公司 | Composite neural network architecture for stress distribution prediction |
CN113112993A (en) * | 2020-01-10 | 2021-07-13 | 阿里巴巴集团控股有限公司 | Audio information processing method and device, electronic equipment and storage medium |
CN113129867A (en) * | 2019-12-28 | 2021-07-16 | 中移(上海)信息通信科技有限公司 | Training method of voice recognition model, voice recognition method, device and equipment |
CN113129869A (en) * | 2021-03-22 | 2021-07-16 | 北京百度网讯科技有限公司 | Method and device for training and recognizing voice recognition model |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7280963B1 (en) * | 2003-09-12 | 2007-10-09 | Nuance Communications, Inc. | Method for learning linguistically valid word pronunciations from acoustic data |
CN102810311A (en) * | 2011-06-01 | 2012-12-05 | 株式会社理光 | Speaker estimation method and speaker estimation equipment |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103021418A (en) * | 2012-12-13 | 2013-04-03 | 南京邮电大学 | Voice conversion method facing to multi-time scale prosodic features |
US20130085756A1 (en) * | 2005-11-30 | 2013-04-04 | At&T Corp. | System and Method of Semi-Supervised Learning for Spoken Language Understanding Using Semantic Role Labeling |
CN103531205A (en) * | 2013-10-09 | 2014-01-22 | 常州工学院 | Asymmetrical voice conversion method based on deep neural network feature mapping |
CN103984959A (en) * | 2014-05-26 | 2014-08-13 | 中国科学院自动化研究所 | Data-driven and task-driven image classification method |
CN104575519A (en) * | 2013-10-17 | 2015-04-29 | 清华大学 | Feature extraction method and device as well as stress detection method and device |
CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
CN102831184B (en) * | 2012-08-01 | 2016-03-02 | 中国科学院自动化研究所 | According to the method and system text description of social event being predicted to social affection |
CN105469785A (en) * | 2015-11-25 | 2016-04-06 | 南京师范大学 | Voice activity detection method in communication-terminal double-microphone denoising system and apparatus thereof |
CN105551483A (en) * | 2015-12-11 | 2016-05-04 | 百度在线网络技术(北京)有限公司 | Speech recognition modeling method and speech recognition modeling device |
CN105895082A (en) * | 2016-05-30 | 2016-08-24 | 乐视控股(北京)有限公司 | Acoustic model training method and device as well as speech recognition method and device |
-
2016
- 2016-09-23 CN CN201610847843.4A patent/CN107871497A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7280963B1 (en) * | 2003-09-12 | 2007-10-09 | Nuance Communications, Inc. | Method for learning linguistically valid word pronunciations from acoustic data |
US20130085756A1 (en) * | 2005-11-30 | 2013-04-04 | At&T Corp. | System and Method of Semi-Supervised Learning for Spoken Language Understanding Using Semantic Role Labeling |
CN102810311A (en) * | 2011-06-01 | 2012-12-05 | 株式会社理光 | Speaker estimation method and speaker estimation equipment |
CN102831184B (en) * | 2012-08-01 | 2016-03-02 | 中国科学院自动化研究所 | According to the method and system text description of social event being predicted to social affection |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103021418A (en) * | 2012-12-13 | 2013-04-03 | 南京邮电大学 | Voice conversion method facing to multi-time scale prosodic features |
CN103531205A (en) * | 2013-10-09 | 2014-01-22 | 常州工学院 | Asymmetrical voice conversion method based on deep neural network feature mapping |
CN104575519A (en) * | 2013-10-17 | 2015-04-29 | 清华大学 | Feature extraction method and device as well as stress detection method and device |
CN103984959A (en) * | 2014-05-26 | 2014-08-13 | 中国科学院自动化研究所 | Data-driven and task-driven image classification method |
CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
CN105469785A (en) * | 2015-11-25 | 2016-04-06 | 南京师范大学 | Voice activity detection method in communication-terminal double-microphone denoising system and apparatus thereof |
CN105551483A (en) * | 2015-12-11 | 2016-05-04 | 百度在线网络技术(北京)有限公司 | Speech recognition modeling method and speech recognition modeling device |
CN105895082A (en) * | 2016-05-30 | 2016-08-24 | 乐视控股(北京)有限公司 | Acoustic model training method and device as well as speech recognition method and device |
Non-Patent Citations (5)
Title |
---|
BRETT MATTHEWS: "Fast audio search using vector space modelling", 《2007 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU)》 * |
FERREIRA, EMMANUEL: "ADVERSARIAL BANDIT FOR ONLINE INTERACTIVE ACTIVE LEARNING OF ZERO-SHOT SPOKEN LANGUAGE UNDERSTANDING", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING》 * |
任纪生: "基于潜在语义信息的汉语语音识别方法", 《2004中文信息处理技术研讨会》 * |
杨南: "基于神经网络学习的统计机器翻译研究", 《中国优秀硕博论文全文数据库》 * |
贾永红: "《数字图像处理》", 31 July 2015 * |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108833722A (en) * | 2018-05-29 | 2018-11-16 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN108833722B (en) * | 2018-05-29 | 2021-05-11 | 平安科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
CN108922513A (en) * | 2018-06-04 | 2018-11-30 | 平安科技(深圳)有限公司 | Speech differentiation method, apparatus, computer equipment and storage medium |
CN108847224A (en) * | 2018-07-05 | 2018-11-20 | 广州势必可赢网络科技有限公司 | A kind of sound mural painting plane display method and device |
CN110895935A (en) * | 2018-09-13 | 2020-03-20 | 阿里巴巴集团控股有限公司 | Speech recognition method, system, device and medium |
CN110895935B (en) * | 2018-09-13 | 2023-10-27 | 阿里巴巴集团控股有限公司 | Speech recognition method, system, equipment and medium |
CN109559735A (en) * | 2018-10-11 | 2019-04-02 | 平安科技(深圳)有限公司 | A kind of audio recognition method neural network based, terminal device and medium |
CN109559735B (en) * | 2018-10-11 | 2023-10-27 | 平安科技(深圳)有限公司 | Voice recognition method, terminal equipment and medium based on neural network |
CN109326299B (en) * | 2018-11-14 | 2023-04-25 | 平安科技(深圳)有限公司 | Speech enhancement method, device and storage medium based on full convolution neural network |
CN109326299A (en) * | 2018-11-14 | 2019-02-12 | 平安科技(深圳)有限公司 | Sound enhancement method, device and storage medium based on full convolutional neural networks |
CN111477212B (en) * | 2019-01-04 | 2023-10-24 | 阿里巴巴集团控股有限公司 | Content identification, model training and data processing method, system and equipment |
CN111477212A (en) * | 2019-01-04 | 2020-07-31 | 阿里巴巴集团控股有限公司 | Content recognition, model training and data processing method, system and equipment |
CN110517666A (en) * | 2019-01-29 | 2019-11-29 | 腾讯科技(深圳)有限公司 | Audio identification methods, system, machinery equipment and computer-readable medium |
CN110517666B (en) * | 2019-01-29 | 2021-03-02 | 腾讯科技(深圳)有限公司 | Audio recognition method, system, machine device and computer readable medium |
CN111768761B (en) * | 2019-03-14 | 2024-03-01 | 京东科技控股股份有限公司 | Training method and device for speech recognition model |
CN111768761A (en) * | 2019-03-14 | 2020-10-13 | 京东数字科技控股有限公司 | Training method and device of voice recognition model |
CN109866713A (en) * | 2019-03-21 | 2019-06-11 | 斑马网络技术有限公司 | Safety detection method and device, vehicle |
CN109887497A (en) * | 2019-04-12 | 2019-06-14 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN109887497B (en) * | 2019-04-12 | 2021-01-29 | 北京百度网讯科技有限公司 | Modeling method, device and equipment for speech recognition |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
US11688391B2 (en) | 2019-04-15 | 2023-06-27 | Beijing Baidu Netcom Science And Technology Co. | Mandarin and dialect mixed modeling and speech recognition |
CN110033760B (en) * | 2019-04-15 | 2021-01-29 | 北京百度网讯科技有限公司 | Modeling method, device and equipment for speech recognition |
CN109887511A (en) * | 2019-04-24 | 2019-06-14 | 武汉水象电子科技有限公司 | A kind of voice wake-up optimization method based on cascade DNN |
CN111862985A (en) * | 2019-05-17 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | Voice recognition device, method, electronic equipment and storage medium |
CN110379407A (en) * | 2019-07-22 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment |
CN112949107A (en) * | 2019-12-10 | 2021-06-11 | 通用汽车环球科技运作有限责任公司 | Composite neural network architecture for stress distribution prediction |
CN113129867A (en) * | 2019-12-28 | 2021-07-16 | 中移(上海)信息通信科技有限公司 | Training method of voice recognition model, voice recognition method, device and equipment |
CN111128137A (en) * | 2019-12-30 | 2020-05-08 | 广州市百果园信息技术有限公司 | Acoustic model training method and device, computer equipment and storage medium |
CN113112993A (en) * | 2020-01-10 | 2021-07-13 | 阿里巴巴集团控股有限公司 | Audio information processing method and device, electronic equipment and storage medium |
CN113112993B (en) * | 2020-01-10 | 2024-04-02 | 阿里巴巴集团控股有限公司 | Audio information processing method and device, electronic equipment and storage medium |
CN111223476A (en) * | 2020-04-23 | 2020-06-02 | 深圳市友杰智新科技有限公司 | Method and device for extracting voice feature vector, computer equipment and storage medium |
CN111739537A (en) * | 2020-06-08 | 2020-10-02 | 北京灵蚌科技有限公司 | Semantic recognition method and device, storage medium and processor |
CN111739537B (en) * | 2020-06-08 | 2023-01-24 | 北京灵蚌科技有限公司 | Semantic recognition method and device, storage medium and processor |
CN113129869B (en) * | 2021-03-22 | 2022-01-28 | 北京百度网讯科技有限公司 | Method and device for training and recognizing voice recognition model |
CN113129869A (en) * | 2021-03-22 | 2021-07-16 | 北京百度网讯科技有限公司 | Method and device for training and recognizing voice recognition model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107871497A (en) | Audio recognition method and device | |
CN110534132A (en) | A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic | |
CN107885853A (en) | A kind of combined type file classification method based on deep learning | |
CN108133038A (en) | A kind of entity level emotional semantic classification system and method based on dynamic memory network | |
CN108280064A (en) | Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis | |
CN109062939A (en) | A kind of intelligence towards Chinese international education leads method | |
CN107330444A (en) | A kind of image autotext mask method based on generation confrontation network | |
CN107705806A (en) | A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks | |
CN112395945A (en) | Graph volume behavior identification method and device based on skeletal joint points | |
CN107239446A (en) | A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism | |
CN108960419A (en) | For using student-teacher's transfer learning network device and method of knowledge bridge | |
CN108764292A (en) | Deep learning image object mapping based on Weakly supervised information and localization method | |
CN106202044A (en) | A kind of entity relation extraction method based on deep neural network | |
CN107679462A (en) | A kind of depth multiple features fusion sorting technique based on small echo | |
CN106980858A (en) | The language text detection of a kind of language text detection with alignment system and the application system and localization method | |
CN106897738A (en) | A kind of pedestrian detection method based on semi-supervised learning | |
CN111581966B (en) | Context feature-fused aspect-level emotion classification method and device | |
CN110222163A (en) | A kind of intelligent answer method and system merging CNN and two-way LSTM | |
CN109934261A (en) | A kind of Knowledge driving parameter transformation model and its few sample learning method | |
CN106469560A (en) | A kind of speech-emotion recognition method being adapted to based on unsupervised domain | |
CN109753567A (en) | A kind of file classification method of combination title and text attention mechanism | |
CN112990296B (en) | Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation | |
CN109410974A (en) | Sound enhancement method, device, equipment and storage medium | |
CN112364719A (en) | Method for rapidly detecting remote sensing image target | |
CN108765383A (en) | Video presentation method based on depth migration study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180403 |