CN107871497A - Audio recognition method and device - Google Patents

Audio recognition method and device Download PDF

Info

Publication number
CN107871497A
CN107871497A CN201610847843.4A CN201610847843A CN107871497A CN 107871497 A CN107871497 A CN 107871497A CN 201610847843 A CN201610847843 A CN 201610847843A CN 107871497 A CN107871497 A CN 107871497A
Authority
CN
China
Prior art keywords
semantic
network model
training
feature vector
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610847843.4A
Other languages
Chinese (zh)
Inventor
刘孟竹
唐青松
张祥德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Eyecool Technology Co Ltd
Original Assignee
Beijing Eyecool Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Eyecool Technology Co Ltd filed Critical Beijing Eyecool Technology Co Ltd
Priority to CN201610847843.4A priority Critical patent/CN107871497A/en
Publication of CN107871497A publication Critical patent/CN107871497A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Abstract

The invention discloses a kind of audio recognition method and device.This method includes:It is determined that training voice signal and semantic label corresponding with training voice signal;Training voice signal input first nerves network model is obtained into speech feature vector;Semantic label input nervus opticus network model is obtained into semantic feature vector;According to the parameter value of target component in speech feature vector and semantic feature vector training first nerves network model;Targeted voice signal is identified according to the first nerves network model after training, wherein, the value of target component is the parameter value after training in first nerves network model after training.By the present invention, solve the problems, such as that the convergence rate for training speech recognition modeling in correlation technique is slower.

Description

Audio recognition method and device
Technical field
The present invention relates to field of speech recognition, in particular to a kind of audio recognition method and device.
Background technology
Speech recognition technology is exactly to allow machine that voice signal is changed into corresponding text by identification and understanding process Technology.Traditional speech recognition technology is strong to the feature dependence of artificial selection, and accuracy rate is low.By deep learning (Deep Learning) technology is applied in field of speech recognition, can imitate pattern of the brain to voice signal study, identification, Neng Gou great Amplitude improves the accuracy of speech recognition.
Deep Learning are used for speech recognition, oneself is through obtaining significant progress at present.Several Deep introduced below Networks:
Deep neural network (Deep Neural Networks, abbreviation DNNs):The feature that the network extraction goes out has stronger Distinction, therefore the model trained has stronger separating capacity, this network generally use depth belief network (Deep Belief Network, abbreviation DBN) it is used as pre-training process, acoustic model is trained using DNN-HMM hybrid networks, in major term There is wide application in remittance amount speech recognition system.
Convolutional neural networks (Convolutional Neural Networks, abbreviation CNNs):Compared to DNNs, introduce The concept of convolution and pond.Extraction to phonetic feature local message is realized by convolution, then model pair is strengthened by pondization The robustness of feature.While scale of model is substantially reduced, recognition performance is more preferable, and generalization ability is stronger.
Recurrent neural network (Recurrent Neural Networks, abbreviation RNN):At present in field of speech recognition most Conventional depth network model is RNN, and it is a kind of series model, and it considers adjacent speech frame on the basis of neutral net Implicit layer unit between annexation, pass through temporally reverse propagated error adjust network parameter training network.RNN point Cloth hidden state can effectively store before information, and as nonlinear dynamic system can make its hide layer unit with one The complicated mode of kind updates, and combines both characteristics, enables it to identify potential time-dependent relation by recurrence layer, enters The task of row speech recognition.
It is coupled chronological classification (Connectionist Temporal Classification, abbreviation CTC):It is a kind of right Neat model, depth network can be exported and be alignd with label text, calculated the probability of all possible paths and be used as whole sentence Probability, enable to us to carry out advance segmentation or post processing to sample using CTC, effect greatly improved Rate.
But audio recognition method of the prior art still has the problem of certain:
(1) it is the shortcomings that DNNs:DNNs methods assume that each speech frame is independent, do not account for the correlation between frame and frame, And in general hidden layer needs more neuron, phase gradient diffusing phenomenon can be very serious after training, and can only be with it The error of his the models coupling ability sequence of calculation.
(2) it is the shortcomings that CNNs:Single CNNs can only handle the speech recognition of isolated word, connect so being handled using CNNs Continuous voice needs to split voice in advance, very time-consuming and uninteresting.CNNs can also handle continuous with other models couplings Voice, but the quantity of parameter is undoubtedly added, and it is also very time-consuming to manually adjust model parameter.
(3) it is the shortcomings that RNN:Due to needing to remember bulk information, training difficulty is larger, and calculating cost is big, identifies process Slowly, and its recursive structure easily occurs the problem of gradient blast and gradient disappearance in error-duration model, and it is difficult to enter to cause training Row goes down.
(4) it is the shortcomings that CTC:The influence of acoustic information is only considered during training, RNN is destroyed and trains the implicit language come Model is sayed, the harm brought is that word-based error rate and the error rate based on phoneme can not be reduced synchronously, and it is difficult to increase training Degree.
As the above analysis, it is to the independent training of acoustic model in correlation technique, and that acoustic model learns is hidden Real semantic vector can be destroyed containing semantic vector, so as to cause loss function in training process synchronously to be dropped with error rate It is low so that the convergence rate of training is slower.
For the training speech recognition modeling in correlation technique convergence rate it is slower the problem of, not yet propose at present effective Solution.
The content of the invention
It is a primary object of the present invention to provide a kind of audio recognition method and device, to solve the training in correlation technique The problem of convergence rate of speech recognition modeling is slower.
To achieve these goals, according to an aspect of the invention, there is provided a kind of audio recognition method.This method bag Include:It is determined that training voice signal and semantic label corresponding with training voice signal;Will training voice signal input first nerves Network model obtains speech feature vector;Semantic label input nervus opticus network model is obtained into semantic feature vector;According to The parameter value of target component in speech feature vector and semantic feature vector training first nerves network model;After training First nerves network model identifies targeted voice signal, wherein, target component in first nerves network model after training Parameter value after being worth for training.
Further, according to target component in speech feature vector and semantic feature vector training first nerves network model Parameter value include:By network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained;Pass through Preset algorithm calculates the semanteme represented by training result and the semantic error represented by semantic label;According to error transfer factor first The parameter value of target component in neural network model.
Further, by network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained Including:It is determined that output speech feature vector and the joint probability distribution of semantic feature vector, training knot is calculated by preset algorithm Semanteme represented by fruit includes with the semantic error represented by semantic label:According to forward-backward algorithms and joint Probability distribution determines the loss function of conjunctive model, wherein, conjunctive model includes first nerves network model and nervus opticus net Network model;Semanteme according to represented by loss function determines training result and the semantic error represented by semantic label.
Further, align network model be CTC align network model, by align network model align phonetic feature to Amount and semantic feature vector include:Pass through CTC alignment network model alignment speech feature vectors and semantic feature vector.
Further, training voice signal is multiple training voice signal Pn, and semantic label is and multiple training voice letters Number one-to-one multiple semantic label Qn, speech feature vector is obtained by training voice signal input first nerves network model Including:I-th of training voice signal Pi input first nerves network model is obtained into speech feature vector Ri, wherein, current the The parameter value of the target component of one neutral net mould is M (i-1);Semantic label input nervus opticus network model is obtained into semanteme Characteristic vector includes:I-th of semantic label Qi is inputted into nervus opticus network model and obtains semantic feature vector T i, wherein, when The parameter value of the target component of preceding nervus opticus network mould is S (i-1);According to speech feature vector and semantic feature vector training The parameter value of target component includes in first nerves network model:Determined according to speech feature vector Ri and semantic feature vector T i The parameter value of the target component of the parameter value Mi of target component and nervus opticus network mould is Si in first nerves network model, according to Secondary execution above step is until i=n.
Further, first nerves network model is RNN models, by training voice signal input first nerves network model Obtaining speech feature vector includes:Sub-frame processing is carried out to training voice signal, obtains training voice sequence;Voice sequence will be trained Row input RNN models obtain speech feature vector.
Further, nervus opticus network model is RNN models, and semantic label input nervus opticus network model is obtained Semantic feature vector includes:Semantic label sequence is determined according to semantic label;Semantic label sequence inputting RNN models are obtained into language Adopted characteristic vector.
To achieve these goals, according to an aspect of the invention, there is provided a kind of speech recognition equipment.The device bag Include:Determining unit, for determining training voice signal and semantic label corresponding with training voice signal;First input block, For training voice signal input first nerves network model to be obtained into speech feature vector;Second input block, for by language Adopted label input nervus opticus network model obtains semantic feature vector;Training unit, for according to speech feature vector and language The parameter value of target component in adopted characteristic vector training first nerves network model;Recognition unit, for according to after training One neural network model identifies targeted voice signal, wherein, the value of target component in first nerves network model after training For the parameter value after training.
Further, training unit includes:Alignment module, for by align network model align speech feature vector and Semantic feature vector, obtains training result;Computing module, for by preset algorithm calculate training result represented by semanteme with Semantic error represented by semantic label;Adjusting module, for being joined according to target in error transfer factor first nerves network model Several parameter values.
Further, alignment module includes:First determination sub-module, for determining that output speech feature vector and semanteme are special The joint probability distribution of vector is levied, computing module includes:Second determination sub-module, for according to forward-backward algorithms The loss function of conjunctive model is determined with joint probability distribution, wherein, conjunctive model includes first nerves network model and second Neural network model;3rd determination sub-module, marked for the semanteme represented by determining training result according to loss function with semantic The represented semantic error of label.
The present invention is by extracting the semantic feature vector in semantic label corresponding with training voice signal, by semantic feature During vector considers training, according to mesh in speech feature vector and semantic feature vector training first nerves network model The parameter value of parameter is marked, targeted voice signal is identified by the first nerves network model after training, solved in correlation technique Training speech recognition modeling convergence rate it is slower the problem of, by training when consider semantic label in carry it is real Semantic information, introduce semantic feature vector and the parameter of model is trained, remain and trained in nervus opticus network model The real semantic vector come, and then accelerate the convergence rate of training speech recognition modeling.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of audio recognition method according to a first embodiment of the present invention;
Fig. 2 is the flow chart of audio recognition method according to a second embodiment of the present invention;
Fig. 3 is the schematic diagram of RNN internetwork connection modes according to embodiments of the present invention;
Fig. 4 is the schematic diagram of Prediction internetwork connection modes according to embodiments of the present invention;
Fig. 5 is the schematic diagram of CTC-Prediction alignment networks according to embodiments of the present invention;And
Fig. 6 is the schematic diagram of speech recognition equipment according to embodiments of the present invention.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model of the application protection Enclose.
It should be noted that term " first " in the description and claims of this application and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments herein described herein.In addition, term " comprising " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing series of steps or unit Process, method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include without clear It is listing to Chu or for the intrinsic other steps of these processes, method, product or equipment or unit.
The embodiment provides a kind of audio recognition method.
Fig. 1 is the flow chart of audio recognition method according to a first embodiment of the present invention.As shown in figure 1, this method includes Following steps:
Step S101, it is determined that training voice signal and semantic label corresponding with training voice signal.
Step S102, training voice signal input first nerves network model is obtained into speech feature vector.
Step S103, semantic label input nervus opticus network model is obtained into semantic feature vector.
Step S104, according to target component in speech feature vector and semantic feature vector training first nerves network model Parameter value.
Step S105, targeted voice signal is identified according to the first nerves network model after training.
Wherein, the value of target component is the parameter value after training in first nerves network model after training.
Speech recognition technology is by the language for the voice signal is exported after speech recognition modeling input speech signal representing Voice, can be changed into corresponding text by the technology of justice.Speech recognition modeling needs to train, and speech recognition modeling includes treating Fixed parameter, the process of training are to carry out constantly adjustment so that language to the undetermined parameter of speech recognition modeling by training sample The higher process of sound identification model discrimination.
Training sample includes training voice signal and semantic label corresponding with training voice signal.Pass through multiple training samples This is repeatedly trained to parameter undetermined in speech recognition modeling, so that speech recognition modeling identifies to training voice signal Semantic error represented by the semanteme arrived semantic label corresponding with training voice signal is minimum.
First, it is determined that voice signal and semantic label corresponding with training voice signal are trained, then respectively to training language Sound signal and corresponding semantic label are handled:
Training voice signal is inputted into first nerves network model, exports speech feature vector, wherein, first nerves network Model includes target component undetermined, first can set an initial value to the target component in first nerves network model, With determine input first training voice signal corresponding to speech feature vector, alternatively, target component can be one or Multiple, the initial value of each target component can be numerical value set in advance, and the numerical value set in advance can be according to history What the characteristics of data, experience, algorithm or neural network model determined, specifically, initial value can also may be used by being manually entered To determine and input by algorithm statistical history data, for example, for some target component, its default value scope for [a, B], a numerical value is randomly selected in the range of [a, b] by random algorithm;
Semantic label is inputted into nervus opticus network model, output semantic feature is vectorial, wherein, nervus opticus network model Include target component undetermined, an initial value first can be set to the target component in nervus opticus network model, with true Surely semantic feature vector corresponding to the first training semantic label inputted.
Wherein, training voice signal is the voice signal for training the speech recognition modeling of embodiment offer, semantic Label is semantic label corresponding with training voice signal.It can be multiple to train voice signal, and correspondingly, semantic label also may be used Think multiple.For example, training voice signal includes n, respectively S1, S2 ... ..., Sn, correspondingly, with training voice signal pair The semantic label answered is X1, X2 ... ..., Xn.Alternatively, can also be before speech recognition modeling be trained to each training voice Signal carries out the pretreatment of framing, and a training voice signal can obtain more sub- voice signals after framing, correspondingly, Semantic label corresponding with the training voice signal can also be subjected to framing, obtain more sub- semantic labels.Wherein, to training It can be separate that voice signal, which carries out the processing procedure of framing and the processing procedure that framing is carried out to the training voice signal, , for example, after carrying out framing to training voice signal S2, m sub- voice signal S2a, S2b ... ..., S2m are obtained, to training After semantic label X2 corresponding to voice signal S2 carries out framing, p sub- semantic labels, X2a, X2b ... ..., X2p are obtained.
Alternatively, first nerves network model and nervus opticus network model can be identical neural network models.Its In, first nerves network model after training, can be used as speech recognition modeling recognition of speech signals.
After speech feature vector and semantic feature vector is obtained, according to speech feature vector and semantic feature vector instruction Practice the parameter value of target component in first nerves network model.
Specifically, in the training process, first nerves network model can extract the phonetic feature in training voice signal Vector, nervus opticus network model can extract the semantic feature that is included in semantic label corresponding with training voice signal to Amount., can be according to speech feature vector and semantic feature vector it is determined that after speech feature vector and semantic feature vector Between error optimization first nerves network model in target component parameter value and nervus opticus network model in target component Parameter value.That is, the error between speech feature vector and semantic feature vector can be counter-propagating to first nerves network mould Type and nervus opticus network model, according to the error transfer factor between the speech feature vector of backpropagation and semantic feature vector In one neural network model in the parameter value of target component and nervus opticus network model target component parameter value.
After training process terminates, targeted voice signal is identified according to the first nerves network model after training.First god , therefore, can be quickly through network model in the training process due to consideration that semantic feature included in semantic label Convergence, and improve the precision of first nerves network model identification.
The audio recognition method that the embodiment provides, by extracting the language in semantic label corresponding with training voice signal Adopted characteristic vector, during semantic feature vector is considered into training, according to speech feature vector and semantic feature vector instruction Practice the parameter value of target component in first nerves network model, target voice is identified by the first nerves network model after training Signal, solving the implicit semantic vector that acoustic model learns during training speech recognition modeling in correlation technique can destroy Real semantic vector so that loss function can not be reduced synchronously with error rate in training process, so as to cause to train voice to know The problem of convergence rate of other model is slower, by the real semantic information for considering to carry in semantic label in training, draw Enter semantic feature vector to be trained the parameter of model, remain the real language for training and in nervus opticus network model Adopted vector, and then accelerate the convergence rate of training speech recognition modeling.
Preferably, according to target component in speech feature vector and semantic feature vector training first nerves network model Parameter value can be with through the following steps that perform:By align network model align speech feature vector and semantic feature vector, Obtain training result;Semanteme represented by training result and the semantic mistake represented by semantic label are calculated by preset algorithm Difference;According to the parameter value of target component in error transfer factor first nerves network model.
Because the speech feature vector that first nerves network model exports according to the training voice signal of input may be with instruction It is different to practice the vector dimension of semantic label corresponding to voice signal, can be alignd by the network model that aligns, the step of alignment Suddenly the mapping that can be regarded as between the vector of two different dimensions, by aliging, speech feature vector is mapped to semantic mark by network The path of label can include the different path of a plurality of probability, according to determine the probability training result, and to the language of training result expression The semantic computation error that justice and semantic label represent, according to the parameter of target component in error transfer factor first nerves network model Value, wherein, included according to the parameter value of target component in error transfer factor first nerves network model according to the god of error transfer factor first The parameter value of target component in parameter value and nervus opticus network model through target component in network model.
By network model alignment speech feature vector and the semantic feature vector of aliging, the specific steps of training result are obtained Output speech feature vector and the joint probability distribution of semantic feature vector can be to determine.Training knot is calculated by preset algorithm Semanteme represented by fruit and the semantic error represented by semantic label can be according to forward-backward algorithms and connection The loss function that probability distribution determines conjunctive model is closed, wherein, conjunctive model includes first nerves network model and nervus opticus Network model, the semanteme according to represented by loss function determines training result and the semantic error represented by semantic label.
Alternatively, alignment network model can be CTC alignment network models, that is, passing through the network model alignment language that aligns Sound characteristic vector and semantic feature vector are by CTC alignment network model alignment speech feature vectors and semantic feature vector.
As a preferred embodiment of above-described embodiment, the process of training can include repeatedly training, specifically, training Voice signal is multiple training voice signal Pn, and semantic label is and the one-to-one multiple semantic marks of multiple training voice signals Qn is signed, training voice signal input first nerves network model is obtained into speech feature vector includes:By i-th of training voice letter Number Pi input first nerves network model obtains speech feature vector Ri, wherein, the target component of current first nerves network mould Parameter value be M (i-1);Semantic label input nervus opticus network model is obtained into semantic feature vector includes:By i-th of language Adopted label Qi inputs nervus opticus network model obtains semantic feature vector T i, wherein, the target of current nervus opticus network mould The parameter value of parameter is S (i-1);According to target in speech feature vector and semantic feature vector training first nerves network model The parameter value of parameter includes:Target in first nerves network model is determined according to speech feature vector Ri and semantic feature vector T i The parameter value Mi of parameter and the parameter value of the target component of nervus opticus network mould are Si, perform above step successively until i= n.Wherein, as i=1, that is, during i-1=0, can be to the parameter value M0 of the target component of first nerves network model and The parameter value S0 of the target component of two neural network models assigns initial value.
Alternatively, first nerves network model can be RNN models, by training voice signal input first nerves network mould Type, which obtains speech feature vector, to be included:Sub-frame processing is carried out to training voice signal, obtains training voice sequence;Voice will be trained Sequence inputting RNN models obtain speech feature vector.
Alternatively, nervus opticus network model can be RNN models, it is preferable that the RNN models can be Prediction Recurrent neural network.Semantic label input nervus opticus network model is obtained into semantic feature vector includes:According to semantic label Determine semantic label sequence;Semantic label sequence inputting RNN models are obtained into semantic feature vector.
Fig. 2 is the flow chart of audio recognition method according to a second embodiment of the present invention.The embodiment can be used as above-mentioned The preferred embodiment of first embodiment, as shown in Fig. 2 this method comprises the following steps:
Step S201, determines training sample.Training voice signal can be included in training sample and with training voice signal Corresponding semantic label.
Step S202, speech signal pre-processing.The pretreatment of voice signal includes framing, preemphasis, denoising etc., preferably Ground, the embodiment only carry out sub-frame processing, and for the voice that sample frequency is 8000Hz, for 20ms, frame moves is the frame length used 10ms.After training voice signal framing, multiframe voice can be obtained.
Higher layer voice feature of step S203, the RNN extraction per frame voice.By RNN neural network models extract framing it The higher layer voice feature of the every frame voice obtained afterwards.
RNN is a kind of series model, on the basis of neutral net, consider adjacent time t and t-1 implicit layer unit it Between annexation, have prominent sign ability to the effective information in Nonlinear Time Series signal.
Make x=(x1,x2,...,xT) it is the list entries that length is T, wherein xtRepresent t frame speech vectors.
RNN connected mode is as shown in figure 3, forward conduction is calculated as follows in the embodiment:
Wherein,For the output vector of time t i-th (i=1,2,3) layer hidden layer, W(i)Represent i-th layer of connection and the Weight matrix, the W of i-1 layershhFor the weight matrix of recurrence layer, b(i)For i-th layer of bias vector, f is that the non-linear of hidden layer swashs Function living, is typically taken as sigmoid functions.For the higher layer voice feature finally obtained.
Alternatively, the RNN extracted by RNN in the higher layer voice feature per frame voice can be the list containing three layers of hidden layer To RNN, the more abstract feature of the network extraction of more deep layer can also be chosen, or, two-way RNN can also be selected to come abundant Study past and the context relation in future.It should be noted that RNN models all in the embodiment can be replaced LSTM models or GRU models.
The label of the result and actual speech sequence of step S204, CTC-Prediction alignment prediction.Pass through CTC- The label of the result and actual speech sequence of the alignment prediction of Prediction forecast models.CTC-Prediction forecast model bags Include Prediction networks and CTC networks.
Prediction networks are a recurrent neural networks, including one layer of input layer, one layer of output layer, and one layer implicit Layer.Semantic feature can be extracted by Prediction networks, semantic label is inputted into Prediction networks.Length is U+1 List entriesBy Prediction network mappings to output sequence g, wherein φ considers in advance.Input Vector coding is one-hot vectors, i.e., if yu=k, thenIt is the vector that a length is K, wherein k-th of dimension is 1, its Remaining dimension is all 0.So the dimension of each time step of input layer is K, the dimension of each time step of output layer is K+1.
The connected mode of Prediction networks is as shown in Figure 4.It is givenHidden layer sequence vector (h0,...,hU), it is pre- Sequencing row (g0,...,gU) by u=0 ... U iteration following formula calculates:
gu=Whohu+bo
Wherein, WihIt is input-hidden layer weight matrix, WhoIt is hidden layer-output layer weight matrix, bhAnd boIt is that it is corresponding Bias vector, f is the activation primitive of hidden layer, is sigmoid.Obtained guFor language model vector.
It is acoustic model vector l by the RNN higher layer voice features extractedt, 1≤t≤T:
Next, by acoustic model vector ltWith language model vector guCombine, calculate the density function of output:
Normalization, the distribution exported:
Simplify above formula to obtain:
Y (t, u)=Pr (yu+1|t,u)
φ (t, u)=Pr (φ | t, u)
CTC-Prediction aligns network as shown in figure 5, wherein horizontal arrow represents the output of t times as φ (i.e. Do not export), vertical arrow represents the t times and exports the u+1 element in y, and path is since the lower left corner, until upper right The terminal node (red arrow has marked a wherein paths) at angle, represents the alignment of input and output sequence.
It is l to define forward variable αs (t, u)[1:t]Export y[1:u]Probability.To 1≤t≤T, 1≤u≤U can be with iteration Calculating:
α (t, u)=α (t-1, u) φ (t-1, u)+α (t, u-1) y (t, u-1)
Initialize α (1,0)=1.
The probability of whole sequence output is the forward variate-values of terminal node:
Pr (y | x)=α (T, U) φ (T, U)
The corresponding backward variable β (t, u) that define are l[t:T]Export y[u+1:U]Probability.Calculate iterative as follows:
β (t, u)=β (t+1, u) φ (t, u)+β (t, u+1) y (t, u)
Initialization condition is:β (T, U)=φ (T, U)
The loss function of whole network model is:
To each training sample, loss function is to acoustic model vector ltWith language model vector guDerivative formula it is as follows:
Carry out error back propagation and undated parameter.
Step S203 and step S204 is the process of training, and higher layer voice and CTC- are extracted to every frame voice by RNN The label of the result and actual speech sequence of Prediction alignment predictions, adjustment RNN models, Prediction models and CTC The parameter of model.
Step S205, the RNN trained.RNN after being trained by the way that step S203 and step S204 is performed a plurality of times.
Step S206, determines test sample.Test sample includes tested speech signal, is identified and surveyed by the RNN trained The semantic feature included in examination voice signal.
Step S207, speech signal pre-processing.Pretreatment is performed to the voice signal in test sample, for example, at framing Reason.
Step S208, output category result.Identify that pretreated voice is believed by the RNN trained in step S205 Number, output category result.
The audio recognition method based on CTC-Prediction models alignment RNN that the embodiment provides, first using deep Degree RNN extracts the higher layer voice feature of every frame voice, then by CTC-Prediction networks align prediction result with The label of actual speech sequence, the error of this calculating is obtained, so as to which backpropagation is to be trained.Trained with existing RNN When only with CTC networks align prediction result compared with the method for the label of actual speech sequence, this method alignment when not Acoustic model is only accounted for, while considers influence of the language model to error, algorithmic statement can be made faster and improved The precision of test.
It should be noted that can be in such as one group of computer executable instructions the flow of accompanying drawing illustrates the step of Performed in computer system, although also, show logical order in flow charts, in some cases, can be with not The order being same as herein performs shown or described step.
Embodiments of the invention additionally provide a kind of speech recognition equipment.It should be noted that the language of the embodiment of the present invention Sound identification device can be used for the audio recognition method for performing the present invention.
Fig. 6 is the schematic diagram of speech recognition equipment according to embodiments of the present invention.As shown in fig. 6, the device includes determining Unit 10, the first input block 20, the second input block 30, training unit 40 and recognition unit 50.
Determining unit 10 is used to determine training voice signal and semantic label corresponding with training voice signal;First input Unit 20 is used to training voice signal input first nerves network model obtaining speech feature vector;Second input block 30 is used It is vectorial in semantic label input nervus opticus network model is obtained into semantic feature;Training unit 40 be used for according to phonetic feature to The parameter value of target component in amount and semantic feature vector training first nerves network model;Recognition unit 50 is used for according to training First nerves network model identification targeted voice signal afterwards, wherein, target is joined in first nerves network model after training Several values is the parameter value after training.
The speech recognition equipment that the embodiment provides solves sound during the training speech recognition modeling in correlation technique Learn model learning to implicit semantic vector can destroy real semantic vector so that loss function and error rate in training process Can not synchronously reduce, so as to cause train speech recognition modeling convergence rate it is slower the problem of, by training when consider language The real semantic information carried in adopted label, introduce semantic feature vector and the parameter of model is trained, remain second The real semantic vector come is trained in neural network model, and then accelerates the convergence rate of training speech recognition modeling.
Preferably, training unit 40 can include:Alignment module, for by align network model align phonetic feature to Amount and semantic feature vector, obtain training result;Computing module, for calculating the language represented by training result by preset algorithm Justice and the semantic error represented by semantic label;Adjusting module, for according to mesh in error transfer factor first nerves network model Mark the parameter value of parameter.
Preferably, alignment module can include:First determination sub-module, for determining output speech feature vector and semanteme The joint probability distribution of characteristic vector, computing module include:Second determination sub-module, for being calculated according to forward-backward Method and joint probability distribution determine the loss function of speech feature vector output conjunctive model, wherein, conjunctive model includes first Neural network model and nervus opticus network model;3rd determination sub-module, for determining training result institute according to loss function The semanteme of expression and the semantic error represented by semantic label.
Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific Hardware and software combines.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (10)

  1. A kind of 1. audio recognition method, it is characterised in that including:
    It is determined that training voice signal and semantic label corresponding with the training voice signal;
    The training voice signal input first nerves network model is obtained into speech feature vector;
    Institute's semantic tags input nervus opticus network model is obtained into semantic feature vector;
    Target component in the first nerves network model is trained according to the speech feature vector and semantic feature vector Parameter value;
    Targeted voice signal is identified according to the first nerves network model after training, wherein, described in after the training The value of target component described in first nerves network model is the parameter value after training.
  2. 2. according to the method for claim 1, it is characterised in that according to the speech feature vector and the semantic feature to Amount trains the parameter value of target component in the first nerves network model to include:
    By the network model alignment speech feature vector and the semantic feature vector of aliging, training result is obtained;
    Semanteme represented by the training result and the semantic error represented by institute's semantic tags are calculated by preset algorithm;
    According to the parameter value of target component in first nerves network model described in the error transfer factor.
  3. 3. according to the method for claim 2, it is characterised in that
    By the network model alignment speech feature vector and the semantic feature vector of aliging, obtaining training result includes: It is determined that the speech feature vector and the joint probability distribution of semantic feature vector are exported,
    Semanteme represented by the training result and the semantic error represented by institute's semantic tags are calculated by preset algorithm Including:The loss function of conjunctive model is determined according to forward-backward algorithms and the joint probability distribution, wherein, institute Stating conjunctive model includes the first nerves network model and the nervus opticus network model;Determined according to the loss function Semanteme represented by the training result and the semantic error represented by institute's semantic tags.
  4. 4. according to the method for claim 2, it is characterised in that the alignment network model is CTC alignment network models, is led to Crossing the alignment network model alignment speech feature vector and the semantic feature vector includes:
    Pass through the CTC alignment network model alignment speech feature vectors and semantic feature vector.
  5. 5. according to the method for claim 1, it is characterised in that the training voice signal is multiple training voice signals Pn, institute's semantic tags are to train voice signal multiple semantic label Qn correspondingly with the multiple,
    The training voice signal input first nerves network model is obtained into speech feature vector includes:Language is trained by i-th Sound signal Pi inputs the first nerves network model and obtains speech feature vector Ri, wherein, presently described first nerves network The parameter value of the target component of model is M (i-1);
    Institute's semantic tags input nervus opticus network model is obtained into semantic feature vector includes:By i-th of semantic label Qi Input the nervus opticus network model and obtain semantic feature vector T i, wherein, the mesh of presently described nervus opticus network model The parameter value for marking parameter is S (i-1);
    Target component in the first nerves network model is trained according to the speech feature vector and semantic feature vector Parameter value include:The first nerves network mould is determined according to the speech feature vector Ri and the semantic feature vector T i The parameter value of the target component of the parameter value Mi of target component and the nervus opticus network model is Si in type,
    Above step is performed successively until i=n.
  6. 6. according to the method for claim 1, it is characterised in that the first nerves network model is RNN models, by described in Training voice signal input first nerves network model, which obtains speech feature vector, to be included:
    Sub-frame processing is carried out to the training voice signal, obtains training voice sequence;
    The training voice sequence is inputted into the RNN models and obtains the speech feature vector.
  7. 7. according to the method for claim 1, it is characterised in that the nervus opticus network model is RNN models, by described in Semantic label input nervus opticus network model, which obtains semantic feature vector, to be included:
    Semantic label sequence is determined according to institute's semantic tags;
    RNN models described in institute's semantic tags sequence inputting are obtained into the semantic feature vector.
  8. A kind of 8. speech recognition equipment, it is characterised in that including:
    Determining unit, for determining training voice signal and semantic label corresponding with the training voice signal;
    First input block, for the training voice signal input first nerves network model to be obtained into speech feature vector;
    Second input block, for institute's semantic tags input nervus opticus network model to be obtained into semantic feature vector;
    Training unit, for training the first nerves network mould according to the speech feature vector and semantic feature vector The parameter value of target component in type;
    Recognition unit, for identifying targeted voice signal according to the first nerves network model after training, wherein, described The value of target component described in the first nerves network model after training is the parameter value after training.
  9. 9. device according to claim 8, it is characterised in that the training unit includes:
    Alignment module, for by the network model alignment speech feature vector and the semantic feature vector of aliging, obtaining Training result;
    Computing module, for being calculated by preset algorithm represented by semanteme and the institute's semantic tags represented by the training result Semantic error;
    Adjusting module, the parameter value for target component in the first nerves network model according to the error transfer factor.
  10. 10. device according to claim 9, it is characterised in that
    The alignment module includes:First determination sub-module, for determining to export the speech feature vector and the semantic spy The joint probability distribution of vector is levied,
    The computing module includes:Second determination sub-module, for general according to forward-backward algorithms and the joint Rate distribution determines the loss function of conjunctive model, wherein, the conjunctive model includes the first nerves network model and described Nervus opticus network model;3rd determination sub-module, represented by determining the training result according to the loss function The semantic semantic error with represented by institute's semantic tags.
CN201610847843.4A 2016-09-23 2016-09-23 Audio recognition method and device Pending CN107871497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610847843.4A CN107871497A (en) 2016-09-23 2016-09-23 Audio recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610847843.4A CN107871497A (en) 2016-09-23 2016-09-23 Audio recognition method and device

Publications (1)

Publication Number Publication Date
CN107871497A true CN107871497A (en) 2018-04-03

Family

ID=61751546

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610847843.4A Pending CN107871497A (en) 2016-09-23 2016-09-23 Audio recognition method and device

Country Status (1)

Country Link
CN (1) CN107871497A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108833722A (en) * 2018-05-29 2018-11-16 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN108847224A (en) * 2018-07-05 2018-11-20 广州势必可赢网络科技有限公司 A kind of sound mural painting plane display method and device
CN108922513A (en) * 2018-06-04 2018-11-30 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
CN109559735A (en) * 2018-10-11 2019-04-02 平安科技(深圳)有限公司 A kind of audio recognition method neural network based, terminal device and medium
CN109866713A (en) * 2019-03-21 2019-06-11 斑马网络技术有限公司 Safety detection method and device, vehicle
CN109887511A (en) * 2019-04-24 2019-06-14 武汉水象电子科技有限公司 A kind of voice wake-up optimization method based on cascade DNN
CN109887497A (en) * 2019-04-12 2019-06-14 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN110033760A (en) * 2019-04-15 2019-07-19 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN110379407A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment
CN110517666A (en) * 2019-01-29 2019-11-29 腾讯科技(深圳)有限公司 Audio identification methods, system, machinery equipment and computer-readable medium
CN110895935A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Speech recognition method, system, device and medium
CN111128137A (en) * 2019-12-30 2020-05-08 广州市百果园信息技术有限公司 Acoustic model training method and device, computer equipment and storage medium
CN111223476A (en) * 2020-04-23 2020-06-02 深圳市友杰智新科技有限公司 Method and device for extracting voice feature vector, computer equipment and storage medium
CN111477212A (en) * 2019-01-04 2020-07-31 阿里巴巴集团控股有限公司 Content recognition, model training and data processing method, system and equipment
CN111739537A (en) * 2020-06-08 2020-10-02 北京灵蚌科技有限公司 Semantic recognition method and device, storage medium and processor
CN111768761A (en) * 2019-03-14 2020-10-13 京东数字科技控股有限公司 Training method and device of voice recognition model
CN111862985A (en) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition device, method, electronic equipment and storage medium
CN112949107A (en) * 2019-12-10 2021-06-11 通用汽车环球科技运作有限责任公司 Composite neural network architecture for stress distribution prediction
CN113112993A (en) * 2020-01-10 2021-07-13 阿里巴巴集团控股有限公司 Audio information processing method and device, electronic equipment and storage medium
CN113129867A (en) * 2019-12-28 2021-07-16 中移(上海)信息通信科技有限公司 Training method of voice recognition model, voice recognition method, device and equipment
CN113129869A (en) * 2021-03-22 2021-07-16 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7280963B1 (en) * 2003-09-12 2007-10-09 Nuance Communications, Inc. Method for learning linguistically valid word pronunciations from acoustic data
CN102810311A (en) * 2011-06-01 2012-12-05 株式会社理光 Speaker estimation method and speaker estimation equipment
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103021418A (en) * 2012-12-13 2013-04-03 南京邮电大学 Voice conversion method facing to multi-time scale prosodic features
US20130085756A1 (en) * 2005-11-30 2013-04-04 At&T Corp. System and Method of Semi-Supervised Learning for Spoken Language Understanding Using Semantic Role Labeling
CN103531205A (en) * 2013-10-09 2014-01-22 常州工学院 Asymmetrical voice conversion method based on deep neural network feature mapping
CN103984959A (en) * 2014-05-26 2014-08-13 中国科学院自动化研究所 Data-driven and task-driven image classification method
CN104575519A (en) * 2013-10-17 2015-04-29 清华大学 Feature extraction method and device as well as stress detection method and device
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
CN102831184B (en) * 2012-08-01 2016-03-02 中国科学院自动化研究所 According to the method and system text description of social event being predicted to social affection
CN105469785A (en) * 2015-11-25 2016-04-06 南京师范大学 Voice activity detection method in communication-terminal double-microphone denoising system and apparatus thereof
CN105551483A (en) * 2015-12-11 2016-05-04 百度在线网络技术(北京)有限公司 Speech recognition modeling method and speech recognition modeling device
CN105895082A (en) * 2016-05-30 2016-08-24 乐视控股(北京)有限公司 Acoustic model training method and device as well as speech recognition method and device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7280963B1 (en) * 2003-09-12 2007-10-09 Nuance Communications, Inc. Method for learning linguistically valid word pronunciations from acoustic data
US20130085756A1 (en) * 2005-11-30 2013-04-04 At&T Corp. System and Method of Semi-Supervised Learning for Spoken Language Understanding Using Semantic Role Labeling
CN102810311A (en) * 2011-06-01 2012-12-05 株式会社理光 Speaker estimation method and speaker estimation equipment
CN102831184B (en) * 2012-08-01 2016-03-02 中国科学院自动化研究所 According to the method and system text description of social event being predicted to social affection
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103021418A (en) * 2012-12-13 2013-04-03 南京邮电大学 Voice conversion method facing to multi-time scale prosodic features
CN103531205A (en) * 2013-10-09 2014-01-22 常州工学院 Asymmetrical voice conversion method based on deep neural network feature mapping
CN104575519A (en) * 2013-10-17 2015-04-29 清华大学 Feature extraction method and device as well as stress detection method and device
CN103984959A (en) * 2014-05-26 2014-08-13 中国科学院自动化研究所 Data-driven and task-driven image classification method
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
CN105469785A (en) * 2015-11-25 2016-04-06 南京师范大学 Voice activity detection method in communication-terminal double-microphone denoising system and apparatus thereof
CN105551483A (en) * 2015-12-11 2016-05-04 百度在线网络技术(北京)有限公司 Speech recognition modeling method and speech recognition modeling device
CN105895082A (en) * 2016-05-30 2016-08-24 乐视控股(北京)有限公司 Acoustic model training method and device as well as speech recognition method and device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BRETT MATTHEWS: "Fast audio search using vector space modelling", 《2007 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU)》 *
FERREIRA, EMMANUEL: "ADVERSARIAL BANDIT FOR ONLINE INTERACTIVE ACTIVE LEARNING OF ZERO-SHOT SPOKEN LANGUAGE UNDERSTANDING", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING》 *
任纪生: "基于潜在语义信息的汉语语音识别方法", 《2004中文信息处理技术研讨会》 *
杨南: "基于神经网络学习的统计机器翻译研究", 《中国优秀硕博论文全文数据库》 *
贾永红: "《数字图像处理》", 31 July 2015 *

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108833722A (en) * 2018-05-29 2018-11-16 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN108833722B (en) * 2018-05-29 2021-05-11 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN108922513A (en) * 2018-06-04 2018-11-30 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN108847224A (en) * 2018-07-05 2018-11-20 广州势必可赢网络科技有限公司 A kind of sound mural painting plane display method and device
CN110895935A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Speech recognition method, system, device and medium
CN110895935B (en) * 2018-09-13 2023-10-27 阿里巴巴集团控股有限公司 Speech recognition method, system, equipment and medium
CN109559735A (en) * 2018-10-11 2019-04-02 平安科技(深圳)有限公司 A kind of audio recognition method neural network based, terminal device and medium
CN109559735B (en) * 2018-10-11 2023-10-27 平安科技(深圳)有限公司 Voice recognition method, terminal equipment and medium based on neural network
CN109326299B (en) * 2018-11-14 2023-04-25 平安科技(深圳)有限公司 Speech enhancement method, device and storage medium based on full convolution neural network
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
CN111477212B (en) * 2019-01-04 2023-10-24 阿里巴巴集团控股有限公司 Content identification, model training and data processing method, system and equipment
CN111477212A (en) * 2019-01-04 2020-07-31 阿里巴巴集团控股有限公司 Content recognition, model training and data processing method, system and equipment
CN110517666A (en) * 2019-01-29 2019-11-29 腾讯科技(深圳)有限公司 Audio identification methods, system, machinery equipment and computer-readable medium
CN110517666B (en) * 2019-01-29 2021-03-02 腾讯科技(深圳)有限公司 Audio recognition method, system, machine device and computer readable medium
CN111768761B (en) * 2019-03-14 2024-03-01 京东科技控股股份有限公司 Training method and device for speech recognition model
CN111768761A (en) * 2019-03-14 2020-10-13 京东数字科技控股有限公司 Training method and device of voice recognition model
CN109866713A (en) * 2019-03-21 2019-06-11 斑马网络技术有限公司 Safety detection method and device, vehicle
CN109887497A (en) * 2019-04-12 2019-06-14 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN109887497B (en) * 2019-04-12 2021-01-29 北京百度网讯科技有限公司 Modeling method, device and equipment for speech recognition
CN110033760A (en) * 2019-04-15 2019-07-19 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
US11688391B2 (en) 2019-04-15 2023-06-27 Beijing Baidu Netcom Science And Technology Co. Mandarin and dialect mixed modeling and speech recognition
CN110033760B (en) * 2019-04-15 2021-01-29 北京百度网讯科技有限公司 Modeling method, device and equipment for speech recognition
CN109887511A (en) * 2019-04-24 2019-06-14 武汉水象电子科技有限公司 A kind of voice wake-up optimization method based on cascade DNN
CN111862985A (en) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition device, method, electronic equipment and storage medium
CN110379407A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment
CN112949107A (en) * 2019-12-10 2021-06-11 通用汽车环球科技运作有限责任公司 Composite neural network architecture for stress distribution prediction
CN113129867A (en) * 2019-12-28 2021-07-16 中移(上海)信息通信科技有限公司 Training method of voice recognition model, voice recognition method, device and equipment
CN111128137A (en) * 2019-12-30 2020-05-08 广州市百果园信息技术有限公司 Acoustic model training method and device, computer equipment and storage medium
CN113112993A (en) * 2020-01-10 2021-07-13 阿里巴巴集团控股有限公司 Audio information processing method and device, electronic equipment and storage medium
CN113112993B (en) * 2020-01-10 2024-04-02 阿里巴巴集团控股有限公司 Audio information processing method and device, electronic equipment and storage medium
CN111223476A (en) * 2020-04-23 2020-06-02 深圳市友杰智新科技有限公司 Method and device for extracting voice feature vector, computer equipment and storage medium
CN111739537A (en) * 2020-06-08 2020-10-02 北京灵蚌科技有限公司 Semantic recognition method and device, storage medium and processor
CN111739537B (en) * 2020-06-08 2023-01-24 北京灵蚌科技有限公司 Semantic recognition method and device, storage medium and processor
CN113129869B (en) * 2021-03-22 2022-01-28 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN113129869A (en) * 2021-03-22 2021-07-16 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model

Similar Documents

Publication Publication Date Title
CN107871497A (en) Audio recognition method and device
CN110534132A (en) A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN107885853A (en) A kind of combined type file classification method based on deep learning
CN108133038A (en) A kind of entity level emotional semantic classification system and method based on dynamic memory network
CN108280064A (en) Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN109062939A (en) A kind of intelligence towards Chinese international education leads method
CN107330444A (en) A kind of image autotext mask method based on generation confrontation network
CN107705806A (en) A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN112395945A (en) Graph volume behavior identification method and device based on skeletal joint points
CN107239446A (en) A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism
CN108960419A (en) For using student-teacher's transfer learning network device and method of knowledge bridge
CN108764292A (en) Deep learning image object mapping based on Weakly supervised information and localization method
CN106202044A (en) A kind of entity relation extraction method based on deep neural network
CN107679462A (en) A kind of depth multiple features fusion sorting technique based on small echo
CN106980858A (en) The language text detection of a kind of language text detection with alignment system and the application system and localization method
CN106897738A (en) A kind of pedestrian detection method based on semi-supervised learning
CN111581966B (en) Context feature-fused aspect-level emotion classification method and device
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN109934261A (en) A kind of Knowledge driving parameter transformation model and its few sample learning method
CN106469560A (en) A kind of speech-emotion recognition method being adapted to based on unsupervised domain
CN109753567A (en) A kind of file classification method of combination title and text attention mechanism
CN112990296B (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN109410974A (en) Sound enhancement method, device, equipment and storage medium
CN112364719A (en) Method for rapidly detecting remote sensing image target
CN108765383A (en) Video presentation method based on depth migration study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180403