CN110517664A - Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing - Google Patents
Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN110517664A CN110517664A CN201910852557.0A CN201910852557A CN110517664A CN 110517664 A CN110517664 A CN 110517664A CN 201910852557 A CN201910852557 A CN 201910852557A CN 110517664 A CN110517664 A CN 110517664A
- Authority
- CN
- China
- Prior art keywords
- dialect
- feature
- languages
- loss function
- diagnostic horizon
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
Abstract
The embodiment of the present application discloses a kind of multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing, the identification of dialect is carried out by the accent recognition model constructed in advance, wherein, the dialect identification model is by including that the training corpus training of a variety of dialects obtains, and the voice content of corpus is not only limited in the training process of the dialect identification model, also introduce dialect type belonging to dialect, the dialect type in conjunction with belonging to dialect optimizes dialect identification model, accent recognition model is enabled to accurately identify a variety of dialects, so that user need not carry out the switching of speech recognition mode again, simplify user's operation, improve the accuracy rate and efficiency of more accent recognitions.
Description
Technical field
This application involves technical field of voice recognition, more specifically to a kind of multi-party speech recognition methods, device, set
Standby and readable storage medium storing program for executing.
Background technique
Currently, the entrance of more and more artificial intelligent uses depends on speech recognition, for example, realizing that different language is different
The translator of the people-to-people accessible exchange of country, the robot customer service for greatly reducing human resources, the voice for liberating both hands
Input method, the control more convenient natural smart home (household electrical appliances) of household appliance, their entrance all rely on speech recognition, therefore
The accuracy rate of speech recognition is particularly important.
However, existing speech recognition schemes, usually only support the identification of mandarin, if user uses dialect, know
Other accuracy can degradation.Although support the identification of dialect, need user that selection dialect is manually operated corresponding
Recognition mode, this needs user to cooperate on one's own initiative, if user's mandarin and dialect are mingled with, is difficult to recognize actively to go to switch
Mode, and in the scene of multi-conference exchange, if there are many speakers of dialect to occur, frequent switching obviously will lead to effect
Rate is low, and user experience is deteriorated.
Therefore, the accuracy rate and efficiency for how improving accent recognition become technical problem urgently to be resolved.
Summary of the invention
In view of this, being used for this application provides a kind of multi-party speech method, apparatus, equipment and readable storage medium storing program for executing.
To achieve the goals above, it is proposed that scheme it is as follows:
A kind of multi-party speech recognition methods, comprising:
Receive voice data;
Accent recognition feature is extracted to the voice data;
The accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the identification knot of the voice data
Fruit;The accent recognition model is to be obtained using the training corpus training for being at least labeled with voice content and affiliated dialect type.
The above method, it is preferred that the accent recognition model is to utilize at least to be labeled with voice content, affiliated dialect type
It is obtained with the other training corpus training of dialect Attribute class.
The above method, it is preferred that the accent recognition model includes: feature extractor, classifier and arbiter;Wherein,
The input of the feature extractor is the accent recognition feature, is exported as characteristic feature, and the characteristic feature is
Than the feature that the accent recognition feature has more distinction;
The input of the classifier is the characteristic feature, is exported as the recognition result of the voice data;
The input of the arbiter is the characteristic feature, is exported as dialect type belonging to the voice data, alternatively,
Output is dialect attribute classification belonging to dialect type belonging to the voice data and the voice data.
The above method, it is preferred that the arbiter includes: gradient inversion layer and languages diagnostic horizon;Alternatively, the arbiter
It include: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon;Wherein,
The input of the gradient inversion layer is the characteristic feature, is exported as the characteristic feature;
The input of the languages diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as the voice data
Affiliated dialect type;
The input of the attribute diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as the voice data
Affiliated dialect attribute classification.
The above method, it is preferred that when being trained to the accent recognition model,
The gradient inversion layer is transmitted to the feature extractor after negating the gradient of the languages diagnostic horizon, alternatively, institute
It states after gradient inversion layer negates the gradient of the languages diagnostic horizon and attribute diagnostic horizon and is transmitted to the feature extractor, to update
The parameter of the feature extractor.
The above method, it is preferred that when being trained to the accent recognition model, the loss of the accent recognition model
Function is made of the weighting of the loss function of the loss function of the classifier and the arbiter.
The above method, it is preferred that if the arbiter includes gradient inversion layer and languages diagnostic horizon, to the dialect
When identification model is trained, loss function and the languages of the loss function of the accent recognition model by the classifier
The loss function of diagnostic horizon, which weights, to be constituted;
Alternatively,
If the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, to the accent recognition
When model is trained, loss function of the loss function of the accent recognition model by the classifier, the languages differentiation
The loss function of layer, the loss function of the attribute diagnostic horizon, which weights, to be constituted.
The above method, it is preferred that if the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon,
When being trained to the accent recognition model, the loss function of the accent recognition model by the classifier loss letter
Number, the loss function of the languages diagnostic horizon, the loss function and languages diagnostic horizon of the attribute diagnostic horizon and the attribute
The languages Attribute consistency loss function of diagnostic horizon, which weights, to be constituted.
The above method, it is preferred that the languages diagnostic horizon is the neural network comprising controlling door;The layer of the neural network
Number is greater than 1;
The input of each of described neural network layer is obtained according to the feature that the output of the control door is exported with upper one layer;
The input of the control door is the vector of the corresponding classifier output of the feature of upper one layer of output.
A kind of multi-party speech identification device, comprising:
Receiving module, for receiving voice data;
Extraction module, for extracting accent recognition feature to the voice data;
Identification module obtains institute's predicate for the accent recognition feature to be inputted the accent recognition model constructed in advance
The recognition result of sound data;The accent recognition model is to utilize the training for being at least labeled with voice content and affiliated dialect type
Corpus training obtains.
Above-mentioned apparatus, it is preferred that the accent recognition model is to utilize at least to be labeled with voice content, affiliated dialect type
It is obtained with the other training corpus training of dialect Attribute class.
Above-mentioned apparatus, it is preferred that the accent recognition model includes: feature extractor, classifier and arbiter;Wherein,
The feature extractor exports characteristic feature for obtaining the accent recognition feature, and the characteristic feature is
Than the feature that the accent recognition feature has more distinction;
The classifier exports the recognition result of the voice data for obtaining the characteristic feature;
The arbiter exports dialect type belonging to the voice data for obtaining the characteristic feature, alternatively,
Export dialect attribute classification belonging to dialect type and the voice data belonging to the voice data.
Above-mentioned apparatus, it is preferred that the arbiter includes: gradient inversion layer and languages diagnostic horizon;Alternatively, the arbiter
It include: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon;Wherein,
The gradient inversion layer exports the characteristic feature for obtaining the characteristic feature;
The languages diagnostic horizon is used to obtain the characteristic feature of the gradient inversion layer output, and exports the voice data
Affiliated dialect type;
The attribute diagnostic horizon is used to obtain the characteristic feature of the gradient inversion layer output, and exports the voice data
Affiliated dialect attribute classification.
Above-mentioned apparatus, it is preferred that the gradient inversion layer is used for when being trained to the accent recognition model, by institute
The gradient of predicate kind diagnostic horizon is transmitted to the feature extractor after negating, alternatively, the gradient inversion layer is used for the side
When speech identification model is trained, the feature extraction is transmitted to after the gradient of the languages diagnostic horizon and attribute diagnostic horizon is negated
Device, to update the parameter of the feature extractor.
Above-mentioned apparatus, it is preferred that the accent recognition model training when loss function by the classifier loss
The weighting of the loss function of function and the arbiter is constituted.
Above-mentioned apparatus, it is preferred that if the arbiter includes gradient inversion layer and languages diagnostic horizon, the accent recognition
Loss function of the model in training weights structure by the loss function of the classifier and the loss function of the languages diagnostic horizon
At;
Alternatively,
If the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, the accent recognition model
Loss function of the loss function by the classifier in training, the loss function of the languages diagnostic horizon, the attribute are sentenced
The loss function of other layer, which weights, to be constituted.
Above-mentioned apparatus, it is preferred that if the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon,
Loss function of loss function of the accent recognition model in training by the classifier, the loss of the languages diagnostic horizon
Function, the loss function and languages diagnostic horizon of the attribute diagnostic horizon and the languages Attribute consistency of the attribute diagnostic horizon
Loss function weighting is constituted.
Above-mentioned apparatus, it is preferred that the languages diagnostic horizon is the neural network comprising controlling door;The layer of the neural network
Number is greater than 1;
The input of each of described neural network layer is that the feature exported according to the output of the control door with upper one layer obtains
It arrives;
The input of the control door is the vector of the corresponding classifier output of the feature of upper one layer of output.
A kind of more accent recognition equipment, including memory and processor;
The memory, for storing program;
The processor realizes each of as above described in any item multi-party speech recognition methods for executing described program
Step.
A kind of readable storage medium storing program for executing is stored thereon with computer program, real when the computer program is executed by processor
It is now as above described in any item to believe the multi-party each steps for saying recognition methods.
It can be seen from the above technical scheme that multi-party speech recognition methods provided by the embodiments of the present application, device, equipment and
Readable storage medium storing program for executing carries out the identification of dialect by the accent recognition model constructed in advance, wherein the dialect identification model passes through
Training corpus training including a variety of dialects obtains, and is not only limited to corpus in the training process of the dialect identification model
Voice content also introduces dialect type belonging to dialect, and the dialect type in conjunction with belonging to dialect carries out dialect identification model
Optimization, enables accent recognition model to accurately identify a variety of dialects, so that user need not carry out cutting for speech recognition mode again
It changes, simplifies user's operation, improve the accuracy rate and efficiency of more accent recognitions.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of implementation flow chart for saying recognition methods disclosed in the embodiment of the present application in many ways;
Fig. 2 is a kind of structural schematic diagram of accent recognition model disclosed in the embodiment of the present application;
Fig. 3 is a kind of structural schematic diagram of the first arbiter disclosed in the embodiment of the present application;
Fig. 4 is another structural schematic diagram of accent recognition model disclosed in the embodiment of the present application;
Fig. 5 is a kind of structural schematic diagram of the second arbiter disclosed in the embodiment of the present application;
Fig. 6 is a kind of structural schematic diagram for saying identification device disclosed in the embodiment of the present application in many ways;
Fig. 7 is the hardware block diagram of more accent recognition equipment disclosed in the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
Inventor is the study found that existing speech recognition schemes use independent dialect when carrying out accent recognition
Identification model needs to identify the second dialect using the first dialect identification model for example, to identify the voice of the first dialect
Voice is then needed using the second dialect identification model, first party make peace the second dialect be different dialects, the first accent recognition mould
Type is obtained by the training of the training corpus of the first dialect, and the second dialect identification model is by the training corpus of the second dialect trained
It arrives, in this way, needing to train N number of accent recognition model if wanting to support the identification of the voice of N kind dialect.This voice is known
Other scheme has the disadvantage that
1, the development time is long, at high cost: in the training stage of accent recognition model, needing to receive respectively for each dialect
Collect a large amount of dialect audio data, and by manually carrying out transcription to audio content, and for dialect, audio data collection
Therefore larger, the higher cost with the difficulty of artificial transcription when it is desirable that increasing the recognition capability of the new dialect of one kind, often needs
Want longer development time and higher development cost.
2, the convenience that user uses is poor: when needing to carry out speech recognition, needing user according to used in speaker
Dialect carries out the switching of accent recognition mode, that is, needs user to cooperate on one's own initiative, if user's mandarin and dialect are mingled with, be difficult
Recognize actively to remove switch mode, and in the scene of multi-conference exchange, if there are many speakers of dialect to occur, frequently
Switching obviously will lead to inefficiency, and user experience is deteriorated.
In order to overcome above-mentioned deficiency or at least partly overcome above-mentioned deficiency, the basic thought of application scheme is: making
With training corpus one accent recognition model of training comprising a variety of dialects, identified so as to be based on an accent recognition model
The voice of a variety of dialects, on the one hand, individually train identification model relative to every kind of dialect, training accent recognition model in the application
The training corpus requirement of every kind of dialect is less in used training corpus, on the other hand, in actual application, keeps away
Exempt from user to switch between a variety of dialect modes, improves the convenience that user uses.
Application scheme is described in detail below:
Multi-party speech recognition methods provided by the present application can be applied in electronic equipment, which may include but not
Be limited to it is following any one: smart phone, computer, translator, robot, smart home (household electrical appliances), remote controler etc..
Please refer to Fig. 1, Fig. 1 is a kind of implementation flow chart of multi-party speech recognition methods provided by the embodiments of the present application, can be with
Include:
Step S11: voice data is received.
The voice data is voice data to be identified, can be electronic equipment by pick up facility (such as microphone or
Microphone array) received user input dialect voice data, be also possible to the voice data or dialect of mandarin
The voice data being mingled with mandarin.
Step S12: accent recognition feature is extracted to voice data.
The dialect identification feature can be acoustic feature, which is generally the spectrum signature of voice data, for example,
Mel frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) feature, alternatively, FBank feature
Deng.
When extracting accent recognition feature, voice data first can be divided into several speech frames, extract each frame voice
The accent recognition feature of frame.
Step S13: accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the identification knot of voice data
Fruit (i.e. the concrete sound content of voice data);The dialect identification model is to utilize at least to be labeled with voice content and affiliated dialect
The training corpus training of type (can also abbreviation languages) obtains.
In the case where voice data is divided into several speech frames, the accent recognition feature of each speech frame is inputted pre-
The accent recognition model first constructed obtains the recognition result of each frame speech frame, the identification knot of all speech frames of voice data
Fruit constitutes the recognition result of the voice data.
In China, dialect is many kinds of, only illustrate below it is wherein several, such as: Sichuan words, Henan words, fuzhou dialect,
Nanchang words, GuangZhou native language, Changsha Dialect etc..It may include the training sample of above-mentioned several dialects in the embodiment of the present application, in training corpus
This, the training sample of acceptable more dialect types, naturally it is also possible to the training sample including all dialect types.Specific training
It include the training sample of which dialect in corpus, the dialect type that can be supported according to actual needs determines, for example, to support
Sichuan words, GuangZhou native language and Changsha Dialect then need the voice data simultaneously including several Sichuan words, several GuangZhou native languages in training corpus
Voice data and several Changsha Dialects voice data.
In the training process of accent recognition model, for each training sample, in addition in the voice to the training sample
Appearance identified outside, also the affiliated dialect type of the training sample is differentiated, so based on the recognition result of voice content with
And the differentiation result of dialect type optimizes training to dialect identification model.
Multi-party speech recognition methods provided by the embodiments of the present application carries out dialect by the accent recognition model constructed in advance
Identification, wherein the dialect identification model is obtained by the training corpus training including a variety of dialects, and the dialect identification model
Training process in be not only limited to the voice content of corpus, dialect type belonging to dialect is also introduced, in conjunction with belonging to dialect
Dialect type dialect identification model is optimized, enable accent recognition model to accurately identify a variety of dialects (including general
Call), so that user need not carry out the switching of speech recognition mode again, user's operation is simplified, more accent recognitions are improved
Accuracy rate and efficiency.
In addition, not needing each dialect due in the training process to dialect identification model and requiring a large amount of reference numerals
According to (for the accent recognition model that training is exclusively used in identifying a certain dialect, required sample size is less), thus sound
Frequency data collection and the difficulty of artificial transcription reduce, and cost also just decreases, therefore, when it is desirable that increasing the new dialect of one kind
It, can be within a short period of time with the recognition capability of the new dialect of lesser increased costs when recognition capability.
The specific implementation of accent recognition model provided by the embodiments of the present application is illustrated below.
Referring to Fig. 2, a kind of structural schematic diagram such as Fig. 2 institute of the Fig. 2 for accent recognition model provided by the embodiments of the present application
Show, may include:
Fisrt feature extractor 21, the first classifier 22 and the first arbiter 23;Wherein,
The input of fisrt feature extractor 21 is the accent recognition feature for each frame speech frame that step S12 is extracted, and first is special
The output for levying extractor 21 is the corresponding characteristic feature of each frame speech frame, which is to have more area than dialect identification feature
Divide the feature of property.That is, fisrt feature extractor 21 from accent recognition feature for extracting the voice number of characterization input
According to the feature of the intrinsic characteristic of (i.e. received voice data in step S11), this feature is the advanced features for accent recognition.
Specifically, corresponding any one frame speech frame (to be denoted as the first speech frame convenient for narration), when fisrt feature extractor 21 receives
When the accent recognition feature of the first speech frame, it is corresponding that first speech frame is extracted from the accent recognition feature of first speech frame
Characteristic feature, the corresponding characteristic feature of the first speech frame be characterize the first speech frame intrinsic characteristic feature.
The concrete form of fisrt feature extractor 21 can be convolutional neural networks (Convolutional Neural
Networks, CNN), alternatively, Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) even depth neural network.
The input of first classifier 22 is the characteristic feature that fisrt feature extractor 21 exports, the output of the first classifier 22
For the recognition result of voice data, i.e., corresponding first speech frame, the first classifier 22 is used to determine in the voice of the first speech frame
Hold.Specifically, corresponding first speech frame, the input of the first classifier 22 is the corresponding characteristic feature of the first speech frame, first point
The output of class device 22 is that the state of voice content corresponding with the first speech frame indicates.
The concrete form of first classifier 22 can be the neural network of shallow-layer, for example, two layers of DNN (Deep Neural
Network, deep neural network) network.The application is not specifically limited the concrete form of the first classifier 22.First classification
The concrete form of the output of device 22 can (phoneme state be more smaller than phoneme granularity one for word, syllable, phoneme and phoneme state
A unit) in any one.Specifically which kind of form is related with the modeling unit of the first classifier 22:
If the first classifier 22 is modeled by modeling unit of word, the output of the first classifier 22 is the state table of word
Show, i.e., corresponding first speech frame, the dialect of first speech frame of first classifier 22 for determining input accent recognition model is known
Other which word of characteristic present.
If the first classifier 22 is modeled by modeling unit of syllable, the output of the first classifier 22 is the shape of syllable
State indicates that is, corresponding first speech frame, the first classifier 22 inputs the side of the first speech frame of accent recognition model for determining
Which syllable is speech identification feature characterize.
If the first classifier 22 is modeled by modeling unit of phoneme, the output of the first classifier 22 is the shape of phoneme
State indicates that is, corresponding first speech frame, the first classifier 22 inputs the side of the first speech frame of accent recognition model for determining
Which phoneme is speech identification feature characterize.
If the first classifier 22 is modeled by modeling unit of phoneme state, the output of the first classifier 22 is phoneme
The state of state indicates that is, corresponding first speech frame, the first classifier 22 is for determining the first language of input accent recognition model
The accent recognition characteristic present of sound frame is which phoneme state.
The input of first arbiter 23 is the characteristic feature that fisrt feature extractor 21 exports, the output of the first arbiter 23
For dialect type belonging to voice data.Specifically, corresponding first speech frame, the input of the first arbiter 23 is the first speech frame
Corresponding characteristic feature, the output of the first arbiter 23 are that the state of dialect type corresponding with the first speech frame indicates, i.e., the
One arbiter 23 is for which dialect type of the accent recognition characteristic present of the first speech frame of input accent recognition model to be determined.
It should be noted that the first arbiter 23 is mainly used for the training in accent recognition model in the embodiment of the present application
Stage optimizes training to dialect model, thus, in the process for carrying out speech recognition using trained accent recognition model
In, the differentiation result of the first arbiter 23 output can be exported to user, can not also be exported to user.Alternatively, can be use
Family provides and checks interface, and when user checks that interface operates to this, then the differentiation result that the first arbiter 23 is exported is defeated
Out to user.
In the present embodiment, back-propagation algorithm (Backpropagation is used when to the training of dialect identification model
Algorithm), which is made of the forward-propagating of signal and two processes of backpropagation of error.Wherein, the forward direction of signal
It propagates and refers to the process of that accent recognition model receives the accent recognition feature of sample and exports the speech recognition result of sample, letter
Number direction of propagation is differentiated from the 21 to the first classifier of fisrt feature extractor 22, and from fisrt feature extractor 21 to the first
Device 23.And the backpropagation of error (being characterized with gradient) refers to that the dialect type for the sample for exporting the first arbiter 23 differentiates knot
The error of the true dialect type of fruit and sample return to accent recognition mode input end process, signal transmission direction be from
First arbiter 23 arrives fisrt feature extractor 21.
Illustrate optimization of first arbiter 23 for accent recognition model below with reference to the specific structure of the first arbiter 23
Trained specific implementation.
Referring to Fig. 3, Fig. 3 is a kind of structural schematic diagram of the first arbiter 23 provided by the embodiments of the present application, can wrap
It includes:
First gradient inversion layer 31 (for convenient for narration, first gradient inversion layer 31 is indicated with R) and the first languages differentiate
Layer 32;Wherein,
First gradient inversion layer 31 is defined as follows:
Z=R (z) (1)
Formula (1) is the calculation formula of 31 forward-propagating of first gradient inversion layer, wherein z is first gradient inversion layer 31
Input, i.e., fisrt feature extractor 21 export characteristic feature f, R (z) be gradient inversion layer output, here R () indicate
It by R layers, does not deal with, it is seen then that in the forward propagation process, the output of first gradient inversion layer 31 is first gradient inversion layer
31 input, i.e. first gradient inversion layer 31 are without any processing to input feature vector and are directly transmitted to next layer (i.e. the first languages
Diagnostic horizon 32).Specifically, corresponding first speech frame, the input of first gradient inversion layer 31 is the characteristic feature of the first speech frame,
The output of first gradient inversion layer 31 remains as the characteristic feature of the first speech frame.
Formula (2) is the calculation formula of 31 backpropagation of first gradient inversion layer, whereinIt is first gradient inversion layer
31 gradient, E are unit matrixs, and α is preset hyper parameter, it is seen that the gradient of first gradient inversion layer 31 be hyper parameter with
The product of one negative unit matrix.
According to chain rule, the gradient of output be equal to the gradient of input multiplied by their own gradient (be formulated are as follows: if
H (x)=f (g (x)), then h ' (x)=f ' (g (x)) g ' (x)), then the output gradient of first gradient inversion layer 31 is equal to input ladder
Degree (i.e. the gradient of the first languages diagnostic horizon 32 characterizes the output error of the first languages diagnostic horizon 32) is multiplied by-α E, due to negative sign
In the presence of, thus can regard as by the value of input gradient take it is negative after be transmitted to preceding layer (i.e. fisrt feature extractor 21).First gradient
Inversion layer 31 is not processed input feature vector in propagated forward, negates processing to input gradient in backpropagation and (i.e. will
Input gradient is multiplied by-α E) so that negating the sign of processing result and the opposite sign of input gradient, therefore the layer is referred to as
Gradient inversion layer.
According to gradient descent method, when the more new direction of model parameter is gradient direction (i.e. gradient does not negate processing), mould
Type will be optimal solution with prestissimo.And in the embodiment of the present application, first gradient inversion layer 31 is by the first languages diagnostic horizon 32
Gradient reversion after be transmitted to fisrt feature extractor 21 so that the more new direction of fisrt feature extractor 21 and the first languages differentiation
Layer 32 it is contrary, i.e. the training objective of the first languages diagnostic horizon 32 is the dialect kind for identifying sample and belonging to as quasi- as possible
Class, and the training objective of fisrt feature extractor 21 is to identify the dialect type of inaccurate sample as far as possible, therefore pass through the first ladder
Degree inversion layer 31 introduces dual training.
First languages diagnostic horizon 32 can be the neural network of a shallow-layer, for example, two layers of DNN network, the application is not
The specific latticed form of first languages diagnostic horizon 32 is specifically limited.The input of first languages diagnostic horizon 32 is that first gradient is anti-
The characteristic feature for turning the output of layer 31, exports as dialect type belonging to voice data.Specifically, corresponding first speech frame, first
The input of languages diagnostic horizon 32 is the characteristic feature for the first speech frame that first gradient inversion layer 31 exports, the first languages diagnostic horizon
32 output is that the state of dialect type belonging to the first speech frame indicates.
By foregoing teachings it is found that the application introduces dual training by first gradient inversion layer 31, purpose has two sides
On the one hand face is that the first languages diagnostic horizon of training 32 more accurately judges which kind of side is the feature for inputting accent recognition model belong to
On the other hand speech passes through first gradient inversion layer 31 for the gradient of the first languages diagnostic horizon 32 reversely toward forward pass, training first is special
The extraction of extractor 21 is levied with less the feature of languages distinction, i.e., so that the feature that fisrt feature extractor 21 extracts is characterized
Voice content belong to different dialect types conditional probability distribution it is consistent.Voice content belongs to the item of different dialect types
Part probability distribution unanimously refers to that pronunciation phase Sihe of the voice content in different types of dialect is identical, for example, the sound of Sichuan words
Phoneme a their feature distribution of plain a, the phoneme a of northeast words and Henan words are consistent, i.e. the phoneme a of Sichuan words, northeast words
Phoneme a and Henan words phoneme a pronunciation it is similar or identical.
In order to enable the feature distribution that the first languages diagnostic horizon 32 learns is the condition with the dialect type of voice content
Probability distribution is relevant, introduces control door in the first languages diagnostic horizon 32 in the embodiment of the present application, passes through control door control the
One languages diagnostic horizon 32 learns the conditional probability distribution of the dialect type of different phonetic content.In the embodiment of the present application, door is controlled
Input be the first classifier 22 output, for purposes of illustration only, below using the first classifier 22 using phoneme be modeling unit progress
It is illustrated for modeling.
For any one layer (to be denoted as kth layer convenient for narration) of the first languages diagnostic horizon 32, the input of kth layerRoot
It is obtained according to the feature that the output of control door is exported with -1 layer of kth;The input for controlling door is the feature corresponding first of the output of kth -1
The vector that classifier 22 exports.It can specifically be indicated with formula are as follows:
g(ci)=σ (Vci+b) (4)
Wherein, hiBe -1 layer of kth of the first languages diagnostic horizon 32 output with the i-th frame speech frame character pair, ciIt is hiIt is right
The one hot vector for the first classifier 22 output answered, the i.e. corresponding phoneme vector of the i-th frame speech frame.For example, it is assumed that first point
Modeling unit of the class device 22 using 83 phonemes as classifier, then ciIt is the vector of one 83 dimension, wherein per one-dimensional difference
Corresponding 1 phoneme, if the corresponding phoneme of the i-th frame speech frame is a, in the vector of 83 dimensions, a it is corresponding that it is one-dimensional be 1,
It is entirely 0 that he, which ties up,.g(ci) it is control door, wherein σ is activation primitive, and V is matrix weight, and b is biasing weight, that is to say, that
Phoneme vector ciControl door is obtained after matrixing, the corresponding phoneme of the feature by controlling -1 layer of goalkeeper's kth melts
It is combined input kth layer, keeps the study of the first languages diagnostic horizon 32 related to the conditional probability distribution of the dialect type with phoneme
Information.It should be noted that -1 layer of kth refers to first gradient inversion layer 31 if k is 1.
The output layer of first languages diagnostic horizon 32 has M node, and M=N*C, wherein N is dialect type number, and C is first
The sum of 22 modeling unit of classifier, such as phoneme sum;The M node is divided into C group, and the corresponding phoneme of each group node is used
Belong to the differentiation situation of each dialect type in characterizing the phoneme, typically refers to the probability that the phoneme belongs to each dialect type;
When being updated every time to the node parameter of the output layer of the first languages diagnostic horizon 32, only update and accent recognition mode input
The parameter of that corresponding group node of the corresponding recognition result of accent recognition feature.
For example, it is assumed that the model trains 20 kinds of dialects altogether, pronunciation modeling unit is 83 phonemes, then M=20*83=
1660 nodes, wherein every one group of 20 nodes, i.e., every corresponding phoneme of 20 nodes characterizes the phoneme and belongs to 20 kinds of dialects
In each dialect probability.When being updated every time to the node parameter of the output layer of the first languages diagnostic horizon 32, is determined
The phoneme prediction corresponding with the input accent recognition feature of the i-th frame speech frame of accent recognition model of one classifier 22 output is tied
Fruit, then only the parameter of that 20 node of group corresponding to the phoneme prediction result is updated.
During model training, loss function is the indispensable a part of model.In the embodiment of the present application, corresponding the
One classifier 22 and the first arbiter 23 are respectively provided with loss function, and the loss function of accent recognition model is by the first classifier
The weighting of the loss function of 22 loss function and the first arbiter 23 is constituted.
The loss function of first classifier 22 be used for characterize the first classifier 22 prediction output sample voice content and
Difference between the real speech content of sample.The loss function of first arbiter 23 is predicted defeated for the first arbiter 23 of characterization
Difference between the languages classification of sample out and the true languages classification of sample.
The loss function of the loss function of first classifier 22 and the first arbiter 23 may be the same or different.In
When the loss function of loss function and the first arbiter 23 to the first classifier 22 is weighted, the loss of the first classifier 22
The weight of the loss function of the weight of function and the first arbiter 23 may be the same or different.
Optionally, the loss function of the loss function of the first classifier 22 and the first arbiter 23 can be cross entropy letter
Number.Cross entropy is a key concept in information theory, is mainly used for measuring the otherness information between two probability distribution, when two
When probability distribution is identical, cross entropy is minimum value.Below by taking the first classifier 22 as an example, to intersect entropy function (being indicated with L1) into
Row explanation:
Wherein, I indicates total number (the i.e. accent recognition mould of the provincialism of the speech frame of primary input accent recognition model
Type can handle the provincialism of I speech frame simultaneously every time), i indicates that i-th of speech frame, F indicate fisrt feature extractor
21, F (xi) indicate i-th of speech frame xiAccent recognition feature fisrt feature extractor 21 output, Y indicate first classification
Device 22, Y (F (xi)) indicate i-th of speech frame xiAccent recognition feature a classifier 22 output,Indicate i-th of voice
Frame xiThe corresponding true voice content of accent recognition feature, LyIt is cross entropy, is Y (F (x herei)) andCross entropy.
By minimizing the loss function, that is, minimize the first output of classifier 22 and intersecting for legitimate reading
Entropy can be made to export with training pattern closer to legitimate reading, that is, model recognition result closer to legitimate reading, identification
Rate is also higher.
In the embodiment of the present application, the languages classification of the first arbiter 23 prediction output sample is by the first languages diagnostic horizon 32
It realizes, therefore, the corresponding loss function of the first arbiter 23 is the corresponding loss function of the first languages diagnostic horizon 32.
Assuming that the loss function of accent recognition model is characterized with L, the loss function of the first classifier 22 is characterized with L1, first
The loss function of languages diagnostic horizon 32 is characterized by L2, then:
L=a × L1+b × L2.
Optionally, L=L1+L2, i.e. a=b=1.
During model training, the update of model parameter is carried out by minimizing L, L1 and L2.Made by minimizing L
Accent recognition model has the multi-purpose dialect ability of identification, by minimizing L1, so that the first classifier 22 has stronger acoustics
Separating capacity, by minimizing L2, so that the first arbiter 23 has stronger accent recognition ability, simultaneously because first differentiates
The effect of gradient inversion layer in device 23, so that the feature that fisrt feature extractor 21 generates has dialect confusion, the dialect
Confusion refers to that the distribution of characteristic feature of the provincialism of different dialect types by the generation of fisrt feature extractor 21 is consistent,
First arbiter 23 can not differentiate which kind of dialect input is characterized in.During above-mentioned dual training, the first arbiter 23
Ability it is more and more stronger, the dialect confusion of feature for promoting fisrt feature extractor 21 to generate is become better and better, so that first sentences
Other device 23 can not differentiate;When the dialect confusion for the feature that fisrt feature extractor 21 generates is become better and better, the first arbiter
23 in order to accurately differentiate, and can further promote discriminating power, be finally reached an equilibrium state, i.e. fisrt feature extractor
When the feature of 21 extractions is good enough, the first arbiter 23 can not differentiate, the feature distribution that at this moment fisrt feature extractor 21 extracts
Almost the same, to no longer need to distinguish the dialect of different language in speech recognition, directly progress speech recognition reaches
The effect of more accent recognitions.
In previous embodiment, the training corpus for being labeled with voice content and affiliated dialect type is all based on to accent recognition
Model is illustrated for being trained.Inventor has found during realizing the application, to the training of dialect identification model
In the process, if introducing dialect attribute information, it can be further improved the recognition effect of accent recognition model.Wherein, dialect category
Property information is specifically as follows dialect attribute classification belonging to voice data, such as dialect section, and by taking Chinese as an example, Chinese dialects can
To be divided into seven big sections: Mandarin dialect, Hunan dialect, Jiangxi dialect, Wu Fangyan, Fujian dialect, Guangdong dialect and Hakka dialect.Wherein, official
Words dialect can also segment are as follows: northern Mandarin (Beijing Mandarin, northeast Mandarin, glue the Liao Dynasty Mandarin, Hebei and Shandong Mandarin, Central Plains Mandarin and orchid
The general designation of silver-colored Mandarin), southwestern Mandarin and Yangze river and Huai river Mandarin.
Based on this, accent recognition model provided by the embodiments of the present application can at least be labeled with voice content, institute to utilize
Belong to dialect type and the other training corpus training of dialect Attribute class obtains.
That is, in the training process of accent recognition model, in addition to the voice content to training sample identifies
Outside, also dialect attribute classification belonging to the affiliated dialect type of training sample and training sample is differentiated respectively, is based on voice
The recognition result of content, the differentiation result of dialect type and the other differentiation result of dialect Attribute class carry out dialect identification model
Optimization training, to further increase the accuracy rate of the recognition result of accent recognition model.
Based on this, another structural schematic diagram of accent recognition model provided by the embodiments of the present application is as shown in figure 4, can be with
Include:
Second feature extractor 41, the second classifier 42 and the second arbiter 43;Wherein,
The input of second feature extractor 41 is the accent recognition feature for each frame speech frame that step S12 is extracted, and second is special
The output for levying extractor 41 is the corresponding characteristic feature of each frame speech frame, which is to have more area than dialect identification feature
Divide the feature of property.That is, second feature extractor 41 from accent recognition feature for extracting the voice number of characterization input
According to the feature of the intrinsic characteristic of (i.e. received voice data in step S11), this feature is the advanced features for accent recognition.
Specifically, corresponding first speech frame, when second feature extractor 41 receives the accent recognition feature of the first speech frame, from this
The corresponding characteristic feature of the first speech frame is extracted in the accent recognition feature of first speech frame, the corresponding table of the first speech frame
Sign is characterized in characterizing the feature of the first speech frame intrinsic characteristic.
The concrete form of second feature extractor 41 can be CNN, alternatively, RNN even depth neural network.
The input of second classifier 42 is the characteristic feature that second feature extractor 41 exports, the output of the second classifier 42
For the recognition result of voice data, i.e., corresponding first speech frame, the second classifier 42 is used to determine in the voice of the first speech frame
Hold.Specifically, corresponding first speech frame, the input of the second classifier 42 is the corresponding characteristic feature of the first speech frame, second point
The output of class device 42 is that the state of voice content corresponding with the first speech frame indicates.
The concrete form of second classifier 42 can be the neural network of shallow-layer, for example, two layers of DNN network.The application
The concrete form of second classifier 42 is not specifically limited.The concrete form of the output of second classifier 42 can be word, sound
Any one in section, phoneme and phoneme state.Specifically which kind of form is related with the modeling unit of the second classifier 42, and second
The specific implementation of the modeling unit of classifier 42 may refer to the implementation of the modeling unit of aforementioned first classifier 22,
I will not elaborate.
The input of second arbiter 43 is the characteristic feature that second feature extractor 41 exports, the output of the second arbiter 43
For dialect attribute classification belonging to dialect type belonging to voice data and voice data.Specifically, corresponding first speech frame,
The input of second arbiter 43 is the corresponding characteristic feature of the first speech frame, and the output of the second arbiter 43 is and the first speech frame
The other state of dialect Attribute class belonging to the state expression of corresponding dialect type and the first speech frame indicates that is, second differentiates
Device 43 is for which dialect type of the accent recognition characteristic present of the first speech frame of determining input accent recognition model, and is somebody's turn to do
Which dialect attribute classification of the accent recognition characteristic present of first speech frame.
Similar to aforementioned first arbiter 23, in the embodiment of the present application, the second arbiter 43 is mainly used in accent recognition
The training stage of model optimizes training to dialect model, thus, voice is being carried out using trained accent recognition model
During identification, the differentiation result of the second arbiter 43 output can be exported to user, can not also be exported to user.Or
Person can provide for user and check interface, when user checks that interface operates to this, then the second arbiter 43 is exported
Differentiate that result is exported to user.
In the present embodiment, back-propagation algorithm (Backpropagation is used when to the training of dialect identification model
Algorithm), which is made of the forward-propagating of signal and two processes of backpropagation of error.Wherein, the forward direction of signal
It propagates and refers to the process of that accent recognition model receives the accent recognition feature of sample and exports the speech recognition result of sample, letter
Number direction of propagation is differentiated from the 41 to the second classifier of second feature extractor 42, and from second feature extractor 41 to the second
Device 43.And the backpropagation of error refers to that the dialect type for the sample for exporting the second arbiter 43 differentiates result and dialect attribute
Classification differentiates that the true dialect type and the other error of dialect Attribute class of result and sample return to accent recognition mode input end
Process, signal transmission direction is from the second arbiter 43 to second feature extractor 41.
Illustrate optimization of second arbiter 43 for accent recognition model below with reference to the specific structure of the second arbiter 43
Trained specific implementation.
Referring to Fig. 5, Fig. 5 is a kind of structural schematic diagram of the second arbiter 43 provided by the embodiments of the present application, can wrap
It includes:
Second gradient inversion layer 51 (for convenient for narration, the second gradient inversion layer 51 is indicated with R), the second languages diagnostic horizon
52 and attribute diagnostic horizon 53;Wherein,
The definition of second gradient inversion layer 51 is identical with the definition of first gradient inversion layer 31, it may be assumed that
Z=R (z) (1)
Formula (1) is the calculation formula of 51 forward-propagating of the second gradient inversion layer, wherein z is the second gradient inversion layer 51
Input, i.e., the characteristic feature f, R (z) that second feature extractor 41 exports are the output of the second gradient inversion layer 51, here R
() indicates to pass through R layers, not deal with, it is seen then that in the forward propagation process, the output of the second gradient inversion layer 51 is the second ladder
The input of inversion layer 51 is spent, i.e. the second gradient inversion layer 51 is without any processing to input feature vector and is directly transmitted to next layer (i.e.
Second languages diagnostic horizon 52 and attribute diagnostic horizon 53).Specifically, corresponding first speech frame, the input of the second gradient inversion layer 51
For the characteristic feature of the first speech frame, the output of the second gradient inversion layer 51 remains as the characteristic feature of the first speech frame.
Formula (2) is the calculation formula of 51 backpropagation of the second gradient inversion layer, whereinIt is the second gradient inversion layer
51 gradient, E are unit matrixs, and α is preset hyper parameter, it is seen that the gradient of the second gradient inversion layer 51 be hyper parameter with
The product of one negative unit matrix.
According to chain rule, the gradient of output is equal to the gradient of input multiplied by the gradient of their own, then the second gradient inverts
The output gradient of layer 51 is equal to input gradient (i.e. the sum of the gradient of the gradient of the second languages diagnostic horizon 52 and attribute diagnostic horizon 53)
Multiplied by-α E, due to the presence of negative sign, can regard as by the value of input gradient take it is negative after be transmitted to preceding layer (i.e. second feature
Extractor 41).Second gradient inversion layer 51 is not processed input feature vector in propagated forward, in backpropagation to input ladder
Degree negates processing (i.e. by input gradient multiplied by-α E), so that negating the sign of processing result and the sign of input gradient
On the contrary, therefore the layer is referred to as gradient inversion layer.
According to gradient descent method, when the more new direction of model parameter is gradient direction (i.e. gradient does not negate processing), mould
Type will be optimal solution with prestissimo.And in the embodiment of the present application, the second gradient inversion layer 51 is by the second languages diagnostic horizon 52
Gradient and attribute diagnostic horizon 53 gradient negate turn after be transmitted to second feature extractor 41 so that second feature extractor 41
The gradient direction of parameter more new direction and the second languages diagnostic horizon 52 and attribute diagnostic horizon 53 is on the contrary, i.e. the second languages diagnostic horizon 52
Training objective be the dialect type for identifying sample and belonging to as quasi- as possible, the training objective of attribute diagnostic horizon 53 is as far as possible
Accurately identify dialect attribute classification belonging to sample, and the training objective of second feature extractor 41 is to identify as far as possible not
The dialect type and dialect attribute classification of quasi- sample, therefore dual training is introduced by the second gradient inversion layer 51.
Second languages diagnostic horizon 52 can be the neural network of a shallow-layer, for example, two layers of DNN network, the application is not
The specific latticed form of second languages diagnostic horizon 52 is specifically limited.The input of second languages diagnostic horizon 52 is that the second gradient is anti-
The characteristic feature for turning the output of layer 51, exports as dialect type belonging to voice data.Specifically, corresponding first speech frame, second
The input of languages diagnostic horizon 52 is the characteristic feature of the first speech frame of the second gradient inversion layer 51 output, the second languages diagnostic horizon
52 output is that the state of dialect type belonging to the first speech frame indicates.
Attribute diagnostic horizon 53 can be the neural network of a shallow-layer, for example, two layers of DNN network, the application is not to category
The specific latticed form of property diagnostic horizon 53 is specifically limited.The input of attribute diagnostic horizon 53 is the output of the second gradient inversion layer 51
Characteristic feature exports as dialect attribute classification belonging to voice data.Specifically, corresponding first speech frame, attribute diagnostic horizon 53
Input be the output of the second gradient inversion layer 51 the first speech frame characteristic feature, the output of attribute diagnostic horizon 53 is the first language
The other state of dialect Attribute class belonging to sound frame indicates.
By foregoing teachings it is found that the application introduces dual training by the second gradient inversion layer 51, purpose has two aspects,
It on the one hand is that the second languages diagnostic horizon 52 of training and attribute diagnostic horizon 53 are more accurately judged to input the spy of accent recognition model
Sign belongs to which kind of dialect and affiliated dialect attribute classification, on the other hand passes through the second gradient inversion layer 51 for the second languages
Reversely toward forward pass, training second feature extractor 41 is generated with less languages area the gradient of diagnostic horizon 52 and attribute diagnostic horizon 53
Divide the feature of property and attribute classification distinction, that is, so that the voice content that the feature that second feature extractor 41 extracts is characterized
The conditional probability distribution for belonging to different dialect types is consistent, and the language that the feature of the extraction of second feature extractor 41 is characterized
The other conditional probability distribution of Attribute class of the affiliated dialect of sound content is consistent.The other condition of dialect Attribute class belonging to voice content is general
Rate distribution unanimously refers to that different dialects belongs to same attribute classification.For example, Henan words and northeast words belong to northern Mandarin.
In order to enable the feature distribution that the second languages diagnostic horizon 52 learns is the condition with the dialect type of voice content
Probability distribution is relevant, introduces control door in the second languages diagnostic horizon 52 in the embodiment of the present application, passes through control door control the
Two languages diagnostic horizons 52 learn the conditional probability distribution of the dialect type of different phonetic content.In the embodiment of the present application, door is controlled
Input be the second classifier 42 output, the specific implementation for controlling door may refer to control in the first languages diagnostic horizon 32
The implementation of door, I will not elaborate.
In order to enable the feature distribution that attribute diagnostic horizon 53 learns is other with dialect Attribute class belonging to voice content
Conditional probability distribution is relevant, and control door has also been introduced in the embodiment of the present application in attribute diagnostic horizon 53, is gated by control
Attribute diagnostic horizon 53 processed learns the other conditional probability distribution of dialect Attribute class of different phonetic content.In the embodiment of the present application, control
The input of door processed is the output of the second classifier 42, in attribute diagnostic horizon 53 in the structure and the second languages diagnostic horizon 52 of control door
The structure for controlling door is identical, specific as shown in formula (3)-(4):
g(ci)=σ (Vci+b) (4)
For purposes of illustration only, still being said so that phoneme is modeled for modeling unit as an example by the second classifier 42 below
It is bright.
In attribute diagnostic horizon 53, above-mentioned formula (3)-(4) are meant that: for any one layer of attribute diagnostic horizon 53
(to be denoted as kth layer convenient for narration), the input of kth layerIt is obtained according to the feature that the output of control door is exported with -1 layer of kth;
The input for controlling door is the vector for the corresponding output of second classifier 42 of feature that kth -1 exports.
Specifically, in attribute diagnostic horizon 53, hiBe -1 layer of kth of attribute diagnostic horizon 53 output with the i-th frame speech frame
Corresponding feature, ciIt is hiThe one hot vector that corresponding second classifier 42 exports, i.e. the corresponding phoneme of the i-th frame speech frame to
Amount.For example, it is assumed that modeling unit of second classifier 42 using 83 phonemes as the second classifier 42, then ciIt is one 83
The vector of dimension, wherein respectively correspond 1 phoneme per one-dimensional, if the corresponding phoneme of the i-th frame speech frame is a, 83 dimensions to
In amount, a it is corresponding that it is one-dimensional be 1, other dimension be entirely 0.g(ci) it is control door, wherein σ is activation primitive, and V is matrix power
Weight, b are biasing weights, that is to say, that phoneme vector ciControl door is obtained after matrixing, by controlling goalkeeper's kth -1
The corresponding phoneme of the feature of layer is fused together input kth layer, makes the study of attribute diagnostic horizon 53 to the dialect category with phoneme
The relevant information of conditional probability distribution of property classification.It should be noted that -1 layer of kth refers to that the second gradient is anti-if k is 1
Turn layer 51.
The output layer of attribute diagnostic horizon 53 has Q node, and Q=P*C, wherein P is the other number of dialect Attribute class, and C is
The sum of two classifiers, 42 modeling unit, such as phoneme sum;The Q node is divided into C group, and each group node corresponds to a phoneme,
The other differentiation situation of dialect Attribute class for characterizing the phoneme, typically referring to the phoneme, to belong to each dialect Attribute class other general
Rate;When being updated every time to the node parameter of the output layer of attribute diagnostic horizon 53, only update and accent recognition mode input
The parameter of that corresponding group node of the corresponding recognition result of accent recognition feature.
For example, it is assumed that the model trains 7 kinds of dialect attribute classifications (7 big sections of corresponding dialect), pronunciation modeling unit altogether
It is 83 phonemes, then Q=7*83=581 node, wherein every one group of 7 nodes, i.e., every corresponding phoneme of 7 nodes,
Characterize each other probability of dialect Attribute class that the phoneme belongs in 7 kinds of dialect attribute classifications.Every time to attribute diagnostic horizon 53
When the node parameter of output layer is updated, the i-th frame language with input accent recognition model of the second classifier 42 output is determined
The corresponding phoneme prediction result of the accent recognition feature of sound frame, then only that 7 node of group corresponding to the phoneme prediction result
Parameter be updated.
Illustrate below in the case where introducing attribute diagnostic horizon 53, the facilities of loss function in accent recognition model.
In the case where introducing attribute diagnostic horizon 53, to dialect identification model setting loss function in the embodiment of the present application
A kind of implementation can be with are as follows: corresponding second classifier 42, the second voice diagnostic horizon 52 and attribute diagnostic horizon 53 are respectively provided with
Loss function, the loss function of accent recognition model is by the loss function of the second classifier 42, the damage of the second voice diagnostic horizon 52
The loss function for losing function and attribute diagnostic horizon 53 weights composition.
The loss function of second classifier 42 be used for characterize the second classifier 42 prediction output sample voice content and
Difference between the real speech content of sample.The loss function of second voice diagnostic horizon 52 is for characterizing the second voice diagnostic horizon
Difference between the languages classification of sample and the true languages classification of sample of 52 prediction outputs.The loss letter of attribute diagnostic horizon 53
Number for characterization attributes diagnostic horizon 53 prediction output samples dialect attribute classifications and sample true dialect attribute classification it
Between difference.
The loss of the loss function of second classifier 42, the loss function of the second voice diagnostic horizon 52 and attribute diagnostic horizon 53
Function may be the same or different.In the loss function of loss function, the second voice diagnostic horizon 52 to the second classifier 42
When being weighted with the loss function of attribute diagnostic horizon 53, the weight of the loss function of the second classifier 42, the second voice differentiate
The weight of the loss function of the weight and attribute diagnostic horizon 53 of the loss function of layer 52 may be the same or different.
Optionally, the loss function and attribute diagnostic horizon of the loss function of the second classifier 42, the second voice diagnostic horizon 52
53 loss function can be intersection entropy function.
Assuming that the loss function of accent recognition model is characterized with L, the loss function of the second classifier 42 is characterized with L1, second
The loss function of languages diagnostic horizon 52 is characterized by L2, and the loss function of attribute diagnostic horizon 53 is characterized by L3, then:
L=a × L1+b × L2+c × L3.
Optionally, L=L1+L2+L3, i.e. a=b=c=1.
During model training, the update of model parameter is carried out by minimizing L, L1 and L2+L3.By minimizing L
So that accent recognition model has the multi-purpose dialect ability of identification, by minimizing L1, so that the second classifier 42 is with stronger
Acoustics separating capacity, by minimizing L2+L3, so that the second arbiter 43 has stronger accent recognition ability, simultaneously because
The effect of gradient inversion layer in second arbiter 43, so that the feature that second feature extractor 41 generates is obscured with dialect
Property, which refers to characteristic feature of the provincialism of different dialect types by the generation of second feature extractor 41
Distribution is consistent, and the second arbiter 43 can not differentiate which kind of dialect input is characterized in.During above-mentioned dual training, second
The ability of arbiter 43 is more and more stronger, and the dialect confusion for the feature for promoting second feature extractor 41 to generate is become better and better, with
Differentiate the second arbiter 43 can not;When the dialect confusion for the feature that second feature extractor 41 generates is become better and better, the
Two arbiters 43 can further promote discriminating power in order to accurately differentiate, be finally reached an equilibrium state, i.e., second is special
When the feature that sign extractor 41 extracts is good enough, the second arbiter 43 can not differentiate, at this moment second feature extractor 41 extracts
Feature distribution is almost the same, thus no longer need to distinguish the dialect of different language in speech recognition, directly progress speech recognition
, achieve the effect that more accent recognitions.
Furthermore, it is contemplated that dialect attribute classification has certain correlation, dialect attribute classification and dialect with dialect type
There is type certain correlation to refer to dialect attribute classification there are one-to-one or one-to-many relationships with dialect type, such as
Using section belonging to dialect as dialect attribute classification, then dialect section and dialect type be there are one-to-many relationship, than
Such as, Sichuan words belong to southwestern Mandarin, and Henan words and northeast words belong to northern Mandarin, if a sample is judged as dialect type
Sichuan words, then the other judging result of Attribute class should be southwestern Mandarin, if it is not, illustrate dialect type judging result and
Dialect determined property result is inconsistent, needs to be optimized.In order to optimize this error, the application is to dialect identification model
Be arranged loss function when, introduce languages Attribute consistency loss function, by the languages Attribute consistency loss function come into
The Consistency Learning of one step reinforcing feature distribution.Here it is as follows to define languages Attribute consistency loss L4:
Wherein, I indicates the total number of the provincialism of the speech frame of primary input accent recognition model, DKLIt is KL divergence
(Kullback-Leibler divergence), qoutiIt is output of the feature in attribute diagnostic horizon 53 of the i-th frame speech frame,
q′outiIt is the feature output obtained according to the feature of the i-th frame speech frame in the output conversion of the second languages diagnostic horizon 52.Second language
What kind diagnostic horizon 52 exported is to characterize the state expression of dialect type belonging to the i-th frame speech frame, and attribute diagnostic horizon 53 is defeated
Out be that the other state of dialect Attribute class belonging to the i-th frame speech frame of characterization indicates, therefore, when calculating KL divergence, need pair
The two is normalized, and in the embodiment of the present application, normalization refers to: characterization the i-th frame language that the second languages diagnostic horizon 52 is exported
The other state of Attribute class belonging to sound frame indicates that being converted into the other state of Attribute class belonging to the i-th frame speech frame indicates.The conversion
Process can be to be converted to obtain according to preset transformation rule.
In the case where introducing languages Attribute consistency loss function, the loss function of accent recognition model is by the second classification
The loss function of device 42, the loss function of the second languages diagnostic horizon 52, the loss function of attribute diagnostic horizon 53 and the second languages
The weighting of the languages Attribute consistency loss function of diagnostic horizon 52 and attribute diagnostic horizon 53 is constituted.It can be indicated with formula are as follows:
L=a × L1+b × L2+c × L3+d × L4.
Optionally, L=L1+L2+L3+L4, i.e. a=b=c=d=1.
During model training, the update of model parameter is carried out by minimizing L, L1 and L2+L3+L4.Pass through minimum
Changing L makes accent recognition model have the multi-purpose dialect ability of identification, by minimizing L1, so that the second classifier 42 is with stronger
Acoustics separating capacity, by minimize L2+L3+L4 so that the second arbiter 43 have stronger accent recognition ability, simultaneously
Due to the effect of the gradient inversion layer in the second arbiter 43, so that the feature that second feature extractor 41 generates is mixed with dialect
Confusing property, the dialect confusion refer to that the provincialism of different dialect types passes through the characteristic feature that second feature extractor 41 generates
Distribution it is consistent, the second arbiter 43 can not differentiate which kind of dialect input is characterized in.During above-mentioned dual training, the
The ability of two arbiters 43 is more and more stronger, and the dialect confusion for the feature for promoting second feature extractor 41 to generate is become better and better,
So that the second arbiter 43 can not differentiate;When the dialect confusion for the feature that second feature extractor 41 generates is become better and better,
Second arbiter 43 can further promote discriminating power in order to accurately differentiate, be finally reached an equilibrium state, i.e., and second
When the feature that feature extractor 41 extracts is good enough, the second arbiter 43 can not differentiate, at this moment second feature extractor 41 extracts
Feature distribution it is almost the same, to no longer need to distinguish the dialect of different language in speech recognition, directly progress voice knowledge
Not, achieve the effect that more accent recognitions.
Corresponding with embodiment of the method, the application implementation also provides a kind of multi-party speech identification device, and the embodiment of the present application mentions
A kind of structural schematic diagram of the multi-party speech identification device supplied is as shown in fig. 6, may include:
Receiving module 61, extraction module 62 and identification module 63;Wherein,
Receiving module 61 is for receiving voice data;
Extraction module 62 is used to extract accent recognition feature to the voice data;
Identification module 63 is used to the accent recognition feature inputting the accent recognition model constructed in advance, obtains institute's predicate
The recognition result of sound data;The accent recognition model is to utilize the training for being at least labeled with voice content and affiliated dialect type
Corpus training obtains.
Multi-party speech identification device provided by the embodiments of the present application carries out dialect by the accent recognition model constructed in advance
Identification, wherein the dialect identification model is obtained by the training corpus training including a variety of dialects, and the dialect identification model
Training process in be not only limited to the voice content of corpus, dialect type belonging to dialect is also introduced, in conjunction with belonging to dialect
Dialect type dialect identification model is optimized, enable accent recognition model to accurately identify a variety of dialects, thus with
Family need not carry out the switching of speech recognition mode again, simplify user's operation, improve the accuracy rate and efficiency of more accent recognitions.
In an optional embodiment, the accent recognition model is to utilize at least to be labeled with voice content, affiliated dialect
Type and the other training corpus training of dialect Attribute class obtain.
In an optional embodiment, the accent recognition model includes: feature extractor, classifier and arbiter;Its
In,
The feature extractor exports characteristic feature for obtaining the accent recognition feature, and the characteristic feature is
Than the feature that the accent recognition feature has more distinction;
The classifier exports the recognition result of the voice data for obtaining the characteristic feature;
The arbiter exports dialect type belonging to the voice data for obtaining the characteristic feature, alternatively,
Export dialect attribute classification belonging to dialect type and the voice data belonging to the voice data.
In an optional embodiment, the arbiter includes: gradient inversion layer and languages diagnostic horizon;Alternatively, described sentence
Other device includes: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon;Wherein,
The gradient inversion layer exports the characteristic feature for obtaining the characteristic feature;
The languages diagnostic horizon is used to obtain the characteristic feature of the gradient inversion layer output, and exports the voice data
Affiliated dialect type;
The attribute diagnostic horizon is used to obtain the characteristic feature of the gradient inversion layer output, and exports the voice data
Affiliated dialect attribute classification.
In an optional embodiment, the gradient inversion layer is used for when being trained to the accent recognition model,
It is transmitted to the feature extractor after the gradient of the languages diagnostic horizon is negated, alternatively, the gradient inversion layer is used for institute
When stating accent recognition model and being trained, the feature is transmitted to after the gradient of the languages diagnostic horizon and attribute diagnostic horizon is negated
Extractor, to update the parameter of the feature extractor.
In an optional embodiment, the accent recognition model training when loss function by the classifier damage
The loss function for losing function and the arbiter weights composition.
In an optional embodiment, if the arbiter includes gradient inversion layer and languages diagnostic horizon, the dialect
Loss function of the identification model in training is added by the loss function of the classifier and the loss function of the languages diagnostic horizon
Power is constituted;
Alternatively,
If the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, the accent recognition model
Loss function of the loss function by the classifier in training, the loss function of the languages diagnostic horizon, the attribute are sentenced
The loss function of other layer, which weights, to be constituted.
In an optional embodiment, if the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon,
Then loss function of loss function of the accent recognition model in training by the classifier, the damage of the languages diagnostic horizon
Function is lost, the loss function and languages diagnostic horizon of the attribute diagnostic horizon are consistent with the languages attribute of the attribute diagnostic horizon
Property loss function weighting constitute.
In an optional embodiment, the languages diagnostic horizon is the neural network comprising controlling door;The neural network
The number of plies be greater than 1;
The input of each of described neural network layer is that the feature exported according to the output of the control door with upper one layer obtains
It arrives;
The input of the control door is the vector of the corresponding classifier output of the feature of upper one layer of output.
Multi-party speech identification device provided by the embodiments of the present application can be applied to more accent recognition equipment, such as PC terminal, intelligence
Mobile phone, translator, robot, smart home (household electrical appliances), remote controler, cloud platform, server and server cluster etc..Optionally,
Fig. 7 shows the hardware block diagram of more accent recognition equipment, and referring to Fig. 7, the hardware configuration of more accent recognition equipment be can wrap
It includes: at least one processor 1, at least one communication interface 2, at least one processor 3 and at least one communication bus 4;
In the embodiment of the present application, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one,
And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4;
Processor 1 may be a central processor CPU or specific integrated circuit ASIC (Application
Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention
Road etc.;
Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile
Memory) etc., a for example, at least magnetic disk storage;
Wherein, memory is stored with program, the program that processor can call memory to store, and described program is used for:
Receive voice data;
Accent recognition feature is extracted to the voice data;
The accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the identification knot of the voice data
Fruit;The accent recognition model is to be obtained using the training corpus training for being at least labeled with voice content and affiliated dialect type.
Optionally, the refinement function of described program and extension function can refer to above description.
The embodiment of the present application also provides a kind of storage medium, which can be stored with the journey executed suitable for processor
Sequence, described program are used for:
Receive voice data;
Accent recognition feature is extracted to the voice data;
The accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the identification knot of the voice data
Fruit;The accent recognition model is to be obtained using the training corpus training for being at least labeled with voice content and affiliated dialect type.
Optionally, the refinement function of described program and extension function can refer to above description.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that
A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or
The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged
Except there is also other identical elements in the process, method, article or apparatus that includes the element.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (12)
1. a kind of multi-party speech recognition methods characterized by comprising
Receive voice data;
Accent recognition feature is extracted to the voice data;
The accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the recognition result of the voice data;
The accent recognition model is to be obtained using the training corpus training for being at least labeled with voice content and affiliated dialect type.
2. the method according to claim 1, wherein the accent recognition model is to utilize at least to be labeled with voice
Content, affiliated dialect type and the other training corpus training of dialect Attribute class obtain.
3. method according to claim 1 or 2, which is characterized in that the accent recognition model includes: feature extractor,
Classifier and arbiter;Wherein,
The input of the feature extractor is the accent recognition feature, is exported as characteristic feature, the characteristic feature is than institute
State the feature that accent recognition feature has more distinction;
The input of the classifier is the characteristic feature, is exported as the recognition result of the voice data;
The input of the arbiter is the characteristic feature, is exported as dialect type belonging to the voice data, alternatively, output
For dialect attribute classification belonging to dialect type belonging to the voice data and the voice data.
4. according to the method described in claim 3, it is characterized in that, the arbiter includes: that gradient inversion layer and languages differentiate
Layer;Alternatively, the arbiter includes: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon;Wherein,
The input of the gradient inversion layer is the characteristic feature, is exported as the characteristic feature;
The input of the languages diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as belonging to the voice data
Dialect type;
The input of the attribute diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as belonging to the voice data
Dialect attribute classification.
5. according to the method described in claim 4, it is characterized in that, when being trained to the accent recognition model,
The gradient inversion layer is transmitted to the feature extractor after negating the gradient of the languages diagnostic horizon, alternatively, the ladder
Degree inversion layer is transmitted to the feature extractor after negating the gradient of the languages diagnostic horizon and attribute diagnostic horizon, described in updating
The parameter of feature extractor.
6. described according to the method described in claim 4, it is characterized in that, when being trained to the accent recognition model
The loss function of accent recognition model is made of the weighting of the loss function of the loss function of the classifier and the arbiter.
7. according to the method described in claim 6, it is characterized in that, if the arbiter includes that gradient inversion layer and languages differentiate
Layer, then when being trained to the accent recognition model, the loss function of the accent recognition model is by the classifier
The weighting of the loss function of loss function and the languages diagnostic horizon is constituted;
Alternatively,
If the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, to the accent recognition model
When being trained, the loss function of the accent recognition model by the classifier loss function, the languages diagnostic horizon
Loss function, the loss function of the attribute diagnostic horizon, which weights, to be constituted.
8. according to the method described in claim 6, it is characterized in that, if the arbiter includes gradient inversion layer, languages differentiation
Layer and attribute diagnostic horizon, then when being trained to the accent recognition model, the loss function of the accent recognition model by
The loss function of the classifier, the loss function of the languages diagnostic horizon, the loss function of the attribute diagnostic horizon, Yi Jiyu
The languages Attribute consistency loss function of kind diagnostic horizon and the attribute diagnostic horizon, which weights, to be constituted.
9. according to method described in claim 4-8 any one, which is characterized in that the languages diagnostic horizon is comprising controlling door
Neural network;The number of plies of the neural network is greater than 1;
The input of each of described neural network layer is obtained according to the feature that the output of the control door is exported with upper one layer;
The input of the control door is the vector of the corresponding classifier output of the feature of upper one layer of output.
10. a kind of multi-party speech identification device characterized by comprising
Receiving module, for receiving voice data;
Extraction module, for extracting accent recognition feature to the voice data;
Identification module obtains the voice number for the accent recognition feature to be inputted the accent recognition model constructed in advance
According to recognition result;The accent recognition model is to utilize the training corpus for being at least labeled with voice content and affiliated dialect type
Training obtains.
11. a kind of more accent recognition equipment, which is characterized in that including memory and processor;
The memory, for storing program;
The processor realizes multi-party speech recognition methods as claimed in any one of claims 1-9 wherein for executing described program
Each step.
12. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed
When device executes, each step of the multi-party speech recognition methods of letter as claimed in any one of claims 1-9 wherein is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910852557.0A CN110517664B (en) | 2019-09-10 | 2019-09-10 | Multi-party identification method, device, equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910852557.0A CN110517664B (en) | 2019-09-10 | 2019-09-10 | Multi-party identification method, device, equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110517664A true CN110517664A (en) | 2019-11-29 |
CN110517664B CN110517664B (en) | 2022-08-05 |
Family
ID=68632012
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910852557.0A Active CN110517664B (en) | 2019-09-10 | 2019-09-10 | Multi-party identification method, device, equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110517664B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111105786A (en) * | 2019-12-26 | 2020-05-05 | 苏州思必驰信息科技有限公司 | Multi-sampling-rate voice recognition method, device, system and storage medium |
CN111292727A (en) * | 2020-02-03 | 2020-06-16 | 北京声智科技有限公司 | Voice recognition method and electronic equipment |
CN111369981A (en) * | 2020-03-02 | 2020-07-03 | 北京远鉴信息技术有限公司 | Dialect region identification method and device, electronic equipment and storage medium |
CN111460214A (en) * | 2020-04-02 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Classification model training method, audio classification method, device, medium and equipment |
CN111653274A (en) * | 2020-04-17 | 2020-09-11 | 北京声智科技有限公司 | Method, device and storage medium for awakening word recognition |
CN111798836A (en) * | 2020-08-03 | 2020-10-20 | 上海茂声智能科技有限公司 | Method, device, system, equipment and storage medium for automatically switching languages |
CN111833844A (en) * | 2020-07-28 | 2020-10-27 | 苏州思必驰信息科技有限公司 | Training method and system of mixed model for speech recognition and language classification |
CN112017630A (en) * | 2020-08-19 | 2020-12-01 | 北京字节跳动网络技术有限公司 | Language identification method and device, electronic equipment and storage medium |
CN112908296A (en) * | 2021-02-18 | 2021-06-04 | 上海工程技术大学 | Dialect identification method |
CN112951240A (en) * | 2021-05-14 | 2021-06-11 | 北京世纪好未来教育科技有限公司 | Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium |
CN113053367A (en) * | 2021-04-16 | 2021-06-29 | 北京百度网讯科技有限公司 | Speech recognition method, model training method and device for speech recognition |
CN111460214B (en) * | 2020-04-02 | 2024-04-19 | 北京字节跳动网络技术有限公司 | Classification model training method, audio classification method, device, medium and equipment |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120109649A1 (en) * | 2010-11-01 | 2012-05-03 | General Motors Llc | Speech dialect classification for automatic speech recognition |
CN103578465A (en) * | 2013-10-18 | 2014-02-12 | 威盛电子股份有限公司 | Speech recognition method and electronic device |
US20150221305A1 (en) * | 2014-02-05 | 2015-08-06 | Google Inc. | Multiple speech locale-specific hotword classifiers for selection of a speech locale |
CN105632501A (en) * | 2015-12-30 | 2016-06-01 | 中国科学院自动化研究所 | Deep-learning-technology-based automatic accent classification method and apparatus |
US20160284344A1 (en) * | 2013-12-19 | 2016-09-29 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speech data recognition method, apparatus, and server for distinguishing regional accent |
CN106537493A (en) * | 2015-09-29 | 2017-03-22 | 深圳市全圣时代科技有限公司 | Speech recognition system and method, client device and cloud server |
CN106887226A (en) * | 2017-04-07 | 2017-06-23 | 天津中科先进技术研究院有限公司 | Speech recognition algorithm based on artificial intelligence recognition |
US20180053500A1 (en) * | 2016-08-22 | 2018-02-22 | Google Inc. | Multi-accent speech recognition |
CN108281137A (en) * | 2017-01-03 | 2018-07-13 | 中国科学院声学研究所 | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system |
CN108510976A (en) * | 2017-02-24 | 2018-09-07 | 芋头科技(杭州)有限公司 | A kind of multilingual mixing voice recognition methods |
CN108682420A (en) * | 2018-05-14 | 2018-10-19 | 平安科技(深圳)有限公司 | A kind of voice and video telephone accent recognition method and terminal device |
US20180350343A1 (en) * | 2017-05-31 | 2018-12-06 | Lenovo (Singapore) Pte. Ltd. | Provide output associated with a dialect |
CN109979432A (en) * | 2019-04-02 | 2019-07-05 | 科大讯飞股份有限公司 | A kind of dialect translation method and device |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110033756A (en) * | 2019-04-15 | 2019-07-19 | 北京达佳互联信息技术有限公司 | Language Identification, device, electronic equipment and storage medium |
-
2019
- 2019-09-10 CN CN201910852557.0A patent/CN110517664B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120109649A1 (en) * | 2010-11-01 | 2012-05-03 | General Motors Llc | Speech dialect classification for automatic speech recognition |
CN103578465A (en) * | 2013-10-18 | 2014-02-12 | 威盛电子股份有限公司 | Speech recognition method and electronic device |
US20160284344A1 (en) * | 2013-12-19 | 2016-09-29 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speech data recognition method, apparatus, and server for distinguishing regional accent |
US20150221305A1 (en) * | 2014-02-05 | 2015-08-06 | Google Inc. | Multiple speech locale-specific hotword classifiers for selection of a speech locale |
CN106537493A (en) * | 2015-09-29 | 2017-03-22 | 深圳市全圣时代科技有限公司 | Speech recognition system and method, client device and cloud server |
CN105632501A (en) * | 2015-12-30 | 2016-06-01 | 中国科学院自动化研究所 | Deep-learning-technology-based automatic accent classification method and apparatus |
US20180053500A1 (en) * | 2016-08-22 | 2018-02-22 | Google Inc. | Multi-accent speech recognition |
CN108281137A (en) * | 2017-01-03 | 2018-07-13 | 中国科学院声学研究所 | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system |
CN108510976A (en) * | 2017-02-24 | 2018-09-07 | 芋头科技(杭州)有限公司 | A kind of multilingual mixing voice recognition methods |
CN106887226A (en) * | 2017-04-07 | 2017-06-23 | 天津中科先进技术研究院有限公司 | Speech recognition algorithm based on artificial intelligence recognition |
US20180350343A1 (en) * | 2017-05-31 | 2018-12-06 | Lenovo (Singapore) Pte. Ltd. | Provide output associated with a dialect |
CN108682420A (en) * | 2018-05-14 | 2018-10-19 | 平安科技(深圳)有限公司 | A kind of voice and video telephone accent recognition method and terminal device |
CN109979432A (en) * | 2019-04-02 | 2019-07-05 | 科大讯飞股份有限公司 | A kind of dialect translation method and device |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110033756A (en) * | 2019-04-15 | 2019-07-19 | 北京达佳互联信息技术有限公司 | Language Identification, device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
CHUNLEI ZHANG: "Semi-supervised Learning with Generative Adversarial Networks for Arabic Dialect Identification", 《 ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
王慧勇: "基于神经网络的多方言口音汉语语音识别系统研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111105786A (en) * | 2019-12-26 | 2020-05-05 | 苏州思必驰信息科技有限公司 | Multi-sampling-rate voice recognition method, device, system and storage medium |
CN111292727B (en) * | 2020-02-03 | 2023-03-24 | 北京声智科技有限公司 | Voice recognition method and electronic equipment |
CN111292727A (en) * | 2020-02-03 | 2020-06-16 | 北京声智科技有限公司 | Voice recognition method and electronic equipment |
CN111369981A (en) * | 2020-03-02 | 2020-07-03 | 北京远鉴信息技术有限公司 | Dialect region identification method and device, electronic equipment and storage medium |
CN111369981B (en) * | 2020-03-02 | 2024-02-23 | 北京远鉴信息技术有限公司 | Dialect region identification method and device, electronic equipment and storage medium |
CN111460214A (en) * | 2020-04-02 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Classification model training method, audio classification method, device, medium and equipment |
CN111460214B (en) * | 2020-04-02 | 2024-04-19 | 北京字节跳动网络技术有限公司 | Classification model training method, audio classification method, device, medium and equipment |
CN111653274A (en) * | 2020-04-17 | 2020-09-11 | 北京声智科技有限公司 | Method, device and storage medium for awakening word recognition |
CN111653274B (en) * | 2020-04-17 | 2023-08-04 | 北京声智科技有限公司 | Wake-up word recognition method, device and storage medium |
CN111833844A (en) * | 2020-07-28 | 2020-10-27 | 苏州思必驰信息科技有限公司 | Training method and system of mixed model for speech recognition and language classification |
CN111798836B (en) * | 2020-08-03 | 2023-12-05 | 上海茂声智能科技有限公司 | Method, device, system, equipment and storage medium for automatically switching languages |
CN111798836A (en) * | 2020-08-03 | 2020-10-20 | 上海茂声智能科技有限公司 | Method, device, system, equipment and storage medium for automatically switching languages |
CN112017630B (en) * | 2020-08-19 | 2022-04-01 | 北京字节跳动网络技术有限公司 | Language identification method and device, electronic equipment and storage medium |
CN112017630A (en) * | 2020-08-19 | 2020-12-01 | 北京字节跳动网络技术有限公司 | Language identification method and device, electronic equipment and storage medium |
CN112908296A (en) * | 2021-02-18 | 2021-06-04 | 上海工程技术大学 | Dialect identification method |
CN113053367A (en) * | 2021-04-16 | 2021-06-29 | 北京百度网讯科技有限公司 | Speech recognition method, model training method and device for speech recognition |
CN113053367B (en) * | 2021-04-16 | 2023-10-10 | 北京百度网讯科技有限公司 | Speech recognition method, speech recognition model training method and device |
CN112951240A (en) * | 2021-05-14 | 2021-06-11 | 北京世纪好未来教育科技有限公司 | Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110517664B (en) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110517664A (en) | Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
Cai et al. | A novel learnable dictionary encoding layer for end-to-end language identification | |
WO2018227780A1 (en) | Speech recognition method and device, computer device and storage medium | |
CN108711421A (en) | A kind of voice recognition acoustic model method for building up and device and electronic equipment | |
CN103578471B (en) | Speech identifying method and its electronic installation | |
CN110473523A (en) | A kind of audio recognition method, device, storage medium and terminal | |
CN108831445A (en) | Sichuan dialect recognition methods, acoustic training model method, device and equipment | |
CN107492382A (en) | Voiceprint extracting method and device based on neutral net | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN107704482A (en) | Method, apparatus and program | |
CN107195296A (en) | A kind of audio recognition method, device, terminal and system | |
CN109036391A (en) | Audio recognition method, apparatus and system | |
CN106935239A (en) | The construction method and device of a kind of pronunciation dictionary | |
CN104575497B (en) | A kind of acoustic model method for building up and the tone decoding method based on the model | |
CN108899013A (en) | Voice search method, device and speech recognition system | |
CN107146615A (en) | Audio recognition method and system based on the secondary identification of Matching Model | |
WO2022178969A1 (en) | Voice conversation data processing method and apparatus, and computer device and storage medium | |
CN110349597A (en) | A kind of speech detection method and device | |
CN107437417A (en) | Based on speech data Enhancement Method and device in Recognition with Recurrent Neural Network speech recognition | |
CN111694940A (en) | User report generation method and terminal equipment | |
CN106875936A (en) | Audio recognition method and device | |
CN111833845A (en) | Multi-language speech recognition model training method, device, equipment and storage medium | |
CN109741735A (en) | The acquisition methods and device of a kind of modeling method, acoustic model | |
CN107679225A (en) | A kind of reply generation method based on keyword |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |