CN110517664A

CN110517664A - Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing

Info

Publication number: CN110517664A
Application number: CN201910852557.0A
Authority: CN
Inventors: 许丽; 潘嘉
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2019-11-29
Anticipated expiration: 2039-09-10
Also published as: CN110517664B

Abstract

The embodiment of the present application discloses a kind of multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing, the identification of dialect is carried out by the accent recognition model constructed in advance, wherein, the dialect identification model is by including that the training corpus training of a variety of dialects obtains, and the voice content of corpus is not only limited in the training process of the dialect identification model, also introduce dialect type belonging to dialect, the dialect type in conjunction with belonging to dialect optimizes dialect identification model, accent recognition model is enabled to accurately identify a variety of dialects, so that user need not carry out the switching of speech recognition mode again, simplify user's operation, improve the accuracy rate and efficiency of more accent recognitions.

Description

Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing

Technical field

This application involves technical field of voice recognition, more specifically to a kind of multi-party speech recognition methods, device, set Standby and readable storage medium storing program for executing.

Background technique

Currently, the entrance of more and more artificial intelligent uses depends on speech recognition, for example, realizing that different language is different The translator of the people-to-people accessible exchange of country, the robot customer service for greatly reducing human resources, the voice for liberating both hands Input method, the control more convenient natural smart home (household electrical appliances) of household appliance, their entrance all rely on speech recognition, therefore The accuracy rate of speech recognition is particularly important.

However, existing speech recognition schemes, usually only support the identification of mandarin, if user uses dialect, know Other accuracy can degradation.Although support the identification of dialect, need user that selection dialect is manually operated corresponding Recognition mode, this needs user to cooperate on one's own initiative, if user's mandarin and dialect are mingled with, is difficult to recognize actively to go to switch Mode, and in the scene of multi-conference exchange, if there are many speakers of dialect to occur, frequent switching obviously will lead to effect Rate is low, and user experience is deteriorated.

Therefore, the accuracy rate and efficiency for how improving accent recognition become technical problem urgently to be resolved.

Summary of the invention

In view of this, being used for this application provides a kind of multi-party speech method, apparatus, equipment and readable storage medium storing program for executing.

To achieve the goals above, it is proposed that scheme it is as follows:

A kind of multi-party speech recognition methods, comprising:

Receive voice data；

Accent recognition feature is extracted to the voice data；

The accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the identification knot of the voice data Fruit；The accent recognition model is to be obtained using the training corpus training for being at least labeled with voice content and affiliated dialect type.

The above method, it is preferred that the accent recognition model is to utilize at least to be labeled with voice content, affiliated dialect type It is obtained with the other training corpus training of dialect Attribute class.

The above method, it is preferred that the accent recognition model includes: feature extractor, classifier and arbiter；Wherein,

The input of the feature extractor is the accent recognition feature, is exported as characteristic feature, and the characteristic feature is Than the feature that the accent recognition feature has more distinction；

The input of the classifier is the characteristic feature, is exported as the recognition result of the voice data；

The input of the arbiter is the characteristic feature, is exported as dialect type belonging to the voice data, alternatively, Output is dialect attribute classification belonging to dialect type belonging to the voice data and the voice data.

The above method, it is preferred that the arbiter includes: gradient inversion layer and languages diagnostic horizon；Alternatively, the arbiter It include: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon；Wherein,

The input of the gradient inversion layer is the characteristic feature, is exported as the characteristic feature；

The input of the languages diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as the voice data Affiliated dialect type；

The input of the attribute diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as the voice data Affiliated dialect attribute classification.

The above method, it is preferred that when being trained to the accent recognition model,

The gradient inversion layer is transmitted to the feature extractor after negating the gradient of the languages diagnostic horizon, alternatively, institute It states after gradient inversion layer negates the gradient of the languages diagnostic horizon and attribute diagnostic horizon and is transmitted to the feature extractor, to update The parameter of the feature extractor.

The above method, it is preferred that when being trained to the accent recognition model, the loss of the accent recognition model Function is made of the weighting of the loss function of the loss function of the classifier and the arbiter.

The above method, it is preferred that if the arbiter includes gradient inversion layer and languages diagnostic horizon, to the dialect When identification model is trained, loss function and the languages of the loss function of the accent recognition model by the classifier The loss function of diagnostic horizon, which weights, to be constituted；

Alternatively,

If the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, to the accent recognition When model is trained, loss function of the loss function of the accent recognition model by the classifier, the languages differentiation The loss function of layer, the loss function of the attribute diagnostic horizon, which weights, to be constituted.

The above method, it is preferred that if the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, When being trained to the accent recognition model, the loss function of the accent recognition model by the classifier loss letter Number, the loss function of the languages diagnostic horizon, the loss function and languages diagnostic horizon of the attribute diagnostic horizon and the attribute The languages Attribute consistency loss function of diagnostic horizon, which weights, to be constituted.

The above method, it is preferred that the languages diagnostic horizon is the neural network comprising controlling door；The layer of the neural network Number is greater than 1；

The input of each of described neural network layer is obtained according to the feature that the output of the control door is exported with upper one layer；

The input of the control door is the vector of the corresponding classifier output of the feature of upper one layer of output.

A kind of multi-party speech identification device, comprising:

Receiving module, for receiving voice data；

Extraction module, for extracting accent recognition feature to the voice data；

Identification module obtains institute's predicate for the accent recognition feature to be inputted the accent recognition model constructed in advance The recognition result of sound data；The accent recognition model is to utilize the training for being at least labeled with voice content and affiliated dialect type Corpus training obtains.

Above-mentioned apparatus, it is preferred that the accent recognition model is to utilize at least to be labeled with voice content, affiliated dialect type It is obtained with the other training corpus training of dialect Attribute class.

Above-mentioned apparatus, it is preferred that the accent recognition model includes: feature extractor, classifier and arbiter；Wherein,

The feature extractor exports characteristic feature for obtaining the accent recognition feature, and the characteristic feature is Than the feature that the accent recognition feature has more distinction；

The classifier exports the recognition result of the voice data for obtaining the characteristic feature；

The arbiter exports dialect type belonging to the voice data for obtaining the characteristic feature, alternatively, Export dialect attribute classification belonging to dialect type and the voice data belonging to the voice data.

Above-mentioned apparatus, it is preferred that the arbiter includes: gradient inversion layer and languages diagnostic horizon；Alternatively, the arbiter It include: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon；Wherein,

The gradient inversion layer exports the characteristic feature for obtaining the characteristic feature；

The languages diagnostic horizon is used to obtain the characteristic feature of the gradient inversion layer output, and exports the voice data Affiliated dialect type；

The attribute diagnostic horizon is used to obtain the characteristic feature of the gradient inversion layer output, and exports the voice data Affiliated dialect attribute classification.

Above-mentioned apparatus, it is preferred that the gradient inversion layer is used for when being trained to the accent recognition model, by institute The gradient of predicate kind diagnostic horizon is transmitted to the feature extractor after negating, alternatively, the gradient inversion layer is used for the side When speech identification model is trained, the feature extraction is transmitted to after the gradient of the languages diagnostic horizon and attribute diagnostic horizon is negated Device, to update the parameter of the feature extractor.

Above-mentioned apparatus, it is preferred that the accent recognition model training when loss function by the classifier loss The weighting of the loss function of function and the arbiter is constituted.

Above-mentioned apparatus, it is preferred that if the arbiter includes gradient inversion layer and languages diagnostic horizon, the accent recognition Loss function of the model in training weights structure by the loss function of the classifier and the loss function of the languages diagnostic horizon At；

Alternatively,

If the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, the accent recognition model Loss function of the loss function by the classifier in training, the loss function of the languages diagnostic horizon, the attribute are sentenced The loss function of other layer, which weights, to be constituted.

Above-mentioned apparatus, it is preferred that if the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, Loss function of loss function of the accent recognition model in training by the classifier, the loss of the languages diagnostic horizon Function, the loss function and languages diagnostic horizon of the attribute diagnostic horizon and the languages Attribute consistency of the attribute diagnostic horizon Loss function weighting is constituted.

Above-mentioned apparatus, it is preferred that the languages diagnostic horizon is the neural network comprising controlling door；The layer of the neural network Number is greater than 1；

The input of each of described neural network layer is that the feature exported according to the output of the control door with upper one layer obtains It arrives；

A kind of more accent recognition equipment, including memory and processor；

The memory, for storing program；

The processor realizes each of as above described in any item multi-party speech recognition methods for executing described program Step.

A kind of readable storage medium storing program for executing is stored thereon with computer program, real when the computer program is executed by processor It is now as above described in any item to believe the multi-party each steps for saying recognition methods.

It can be seen from the above technical scheme that multi-party speech recognition methods provided by the embodiments of the present application, device, equipment and Readable storage medium storing program for executing carries out the identification of dialect by the accent recognition model constructed in advance, wherein the dialect identification model passes through Training corpus training including a variety of dialects obtains, and is not only limited to corpus in the training process of the dialect identification model Voice content also introduces dialect type belonging to dialect, and the dialect type in conjunction with belonging to dialect carries out dialect identification model Optimization, enables accent recognition model to accurately identify a variety of dialects, so that user need not carry out cutting for speech recognition mode again It changes, simplifies user's operation, improve the accuracy rate and efficiency of more accent recognitions.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is a kind of implementation flow chart for saying recognition methods disclosed in the embodiment of the present application in many ways；

Fig. 2 is a kind of structural schematic diagram of accent recognition model disclosed in the embodiment of the present application；

Fig. 3 is a kind of structural schematic diagram of the first arbiter disclosed in the embodiment of the present application；

Fig. 4 is another structural schematic diagram of accent recognition model disclosed in the embodiment of the present application；

Fig. 5 is a kind of structural schematic diagram of the second arbiter disclosed in the embodiment of the present application；

Fig. 6 is a kind of structural schematic diagram for saying identification device disclosed in the embodiment of the present application in many ways；

Fig. 7 is the hardware block diagram of more accent recognition equipment disclosed in the embodiment of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

Inventor is the study found that existing speech recognition schemes use independent dialect when carrying out accent recognition Identification model needs to identify the second dialect using the first dialect identification model for example, to identify the voice of the first dialect Voice is then needed using the second dialect identification model, first party make peace the second dialect be different dialects, the first accent recognition mould Type is obtained by the training of the training corpus of the first dialect, and the second dialect identification model is by the training corpus of the second dialect trained It arrives, in this way, needing to train N number of accent recognition model if wanting to support the identification of the voice of N kind dialect.This voice is known Other scheme has the disadvantage that

1, the development time is long, at high cost: in the training stage of accent recognition model, needing to receive respectively for each dialect Collect a large amount of dialect audio data, and by manually carrying out transcription to audio content, and for dialect, audio data collection Therefore larger, the higher cost with the difficulty of artificial transcription when it is desirable that increasing the recognition capability of the new dialect of one kind, often needs Want longer development time and higher development cost.

2, the convenience that user uses is poor: when needing to carry out speech recognition, needing user according to used in speaker Dialect carries out the switching of accent recognition mode, that is, needs user to cooperate on one's own initiative, if user's mandarin and dialect are mingled with, be difficult Recognize actively to remove switch mode, and in the scene of multi-conference exchange, if there are many speakers of dialect to occur, frequently Switching obviously will lead to inefficiency, and user experience is deteriorated.

In order to overcome above-mentioned deficiency or at least partly overcome above-mentioned deficiency, the basic thought of application scheme is: making With training corpus one accent recognition model of training comprising a variety of dialects, identified so as to be based on an accent recognition model The voice of a variety of dialects, on the one hand, individually train identification model relative to every kind of dialect, training accent recognition model in the application The training corpus requirement of every kind of dialect is less in used training corpus, on the other hand, in actual application, keeps away Exempt from user to switch between a variety of dialect modes, improves the convenience that user uses.

Application scheme is described in detail below:

Multi-party speech recognition methods provided by the present application can be applied in electronic equipment, which may include but not Be limited to it is following any one: smart phone, computer, translator, robot, smart home (household electrical appliances), remote controler etc..

Please refer to Fig. 1, Fig. 1 is a kind of implementation flow chart of multi-party speech recognition methods provided by the embodiments of the present application, can be with Include:

Step S11: voice data is received.

The voice data is voice data to be identified, can be electronic equipment by pick up facility (such as microphone or Microphone array) received user input dialect voice data, be also possible to the voice data or dialect of mandarin The voice data being mingled with mandarin.

Step S12: accent recognition feature is extracted to voice data.

The dialect identification feature can be acoustic feature, which is generally the spectrum signature of voice data, for example, Mel frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) feature, alternatively, FBank feature Deng.

When extracting accent recognition feature, voice data first can be divided into several speech frames, extract each frame voice The accent recognition feature of frame.

Step S13: accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the identification knot of voice data Fruit (i.e. the concrete sound content of voice data)；The dialect identification model is to utilize at least to be labeled with voice content and affiliated dialect The training corpus training of type (can also abbreviation languages) obtains.

In the case where voice data is divided into several speech frames, the accent recognition feature of each speech frame is inputted pre- The accent recognition model first constructed obtains the recognition result of each frame speech frame, the identification knot of all speech frames of voice data Fruit constitutes the recognition result of the voice data.

In China, dialect is many kinds of, only illustrate below it is wherein several, such as: Sichuan words, Henan words, fuzhou dialect, Nanchang words, GuangZhou native language, Changsha Dialect etc..It may include the training sample of above-mentioned several dialects in the embodiment of the present application, in training corpus This, the training sample of acceptable more dialect types, naturally it is also possible to the training sample including all dialect types.Specific training It include the training sample of which dialect in corpus, the dialect type that can be supported according to actual needs determines, for example, to support Sichuan words, GuangZhou native language and Changsha Dialect then need the voice data simultaneously including several Sichuan words, several GuangZhou native languages in training corpus Voice data and several Changsha Dialects voice data.

In the training process of accent recognition model, for each training sample, in addition in the voice to the training sample Appearance identified outside, also the affiliated dialect type of the training sample is differentiated, so based on the recognition result of voice content with And the differentiation result of dialect type optimizes training to dialect identification model.

Multi-party speech recognition methods provided by the embodiments of the present application carries out dialect by the accent recognition model constructed in advance Identification, wherein the dialect identification model is obtained by the training corpus training including a variety of dialects, and the dialect identification model Training process in be not only limited to the voice content of corpus, dialect type belonging to dialect is also introduced, in conjunction with belonging to dialect Dialect type dialect identification model is optimized, enable accent recognition model to accurately identify a variety of dialects (including general Call), so that user need not carry out the switching of speech recognition mode again, user's operation is simplified, more accent recognitions are improved Accuracy rate and efficiency.

In addition, not needing each dialect due in the training process to dialect identification model and requiring a large amount of reference numerals According to (for the accent recognition model that training is exclusively used in identifying a certain dialect, required sample size is less), thus sound Frequency data collection and the difficulty of artificial transcription reduce, and cost also just decreases, therefore, when it is desirable that increasing the new dialect of one kind It, can be within a short period of time with the recognition capability of the new dialect of lesser increased costs when recognition capability.

The specific implementation of accent recognition model provided by the embodiments of the present application is illustrated below.

Referring to Fig. 2, a kind of structural schematic diagram such as Fig. 2 institute of the Fig. 2 for accent recognition model provided by the embodiments of the present application Show, may include:

Fisrt feature extractor 21, the first classifier 22 and the first arbiter 23；Wherein,

The input of fisrt feature extractor 21 is the accent recognition feature for each frame speech frame that step S12 is extracted, and first is special The output for levying extractor 21 is the corresponding characteristic feature of each frame speech frame, which is to have more area than dialect identification feature Divide the feature of property.That is, fisrt feature extractor 21 from accent recognition feature for extracting the voice number of characterization input According to the feature of the intrinsic characteristic of (i.e. received voice data in step S11), this feature is the advanced features for accent recognition. Specifically, corresponding any one frame speech frame (to be denoted as the first speech frame convenient for narration), when fisrt feature extractor 21 receives When the accent recognition feature of the first speech frame, it is corresponding that first speech frame is extracted from the accent recognition feature of first speech frame Characteristic feature, the corresponding characteristic feature of the first speech frame be characterize the first speech frame intrinsic characteristic feature.

The concrete form of fisrt feature extractor 21 can be convolutional neural networks (Convolutional Neural Networks, CNN), alternatively, Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) even depth neural network.

The input of first classifier 22 is the characteristic feature that fisrt feature extractor 21 exports, the output of the first classifier 22 For the recognition result of voice data, i.e., corresponding first speech frame, the first classifier 22 is used to determine in the voice of the first speech frame Hold.Specifically, corresponding first speech frame, the input of the first classifier 22 is the corresponding characteristic feature of the first speech frame, first point The output of class device 22 is that the state of voice content corresponding with the first speech frame indicates.

The concrete form of first classifier 22 can be the neural network of shallow-layer, for example, two layers of DNN (Deep Neural Network, deep neural network) network.The application is not specifically limited the concrete form of the first classifier 22.First classification The concrete form of the output of device 22 can (phoneme state be more smaller than phoneme granularity one for word, syllable, phoneme and phoneme state A unit) in any one.Specifically which kind of form is related with the modeling unit of the first classifier 22:

If the first classifier 22 is modeled by modeling unit of word, the output of the first classifier 22 is the state table of word Show, i.e., corresponding first speech frame, the dialect of first speech frame of first classifier 22 for determining input accent recognition model is known Other which word of characteristic present.

If the first classifier 22 is modeled by modeling unit of syllable, the output of the first classifier 22 is the shape of syllable State indicates that is, corresponding first speech frame, the first classifier 22 inputs the side of the first speech frame of accent recognition model for determining Which syllable is speech identification feature characterize.

If the first classifier 22 is modeled by modeling unit of phoneme, the output of the first classifier 22 is the shape of phoneme State indicates that is, corresponding first speech frame, the first classifier 22 inputs the side of the first speech frame of accent recognition model for determining Which phoneme is speech identification feature characterize.

If the first classifier 22 is modeled by modeling unit of phoneme state, the output of the first classifier 22 is phoneme The state of state indicates that is, corresponding first speech frame, the first classifier 22 is for determining the first language of input accent recognition model The accent recognition characteristic present of sound frame is which phoneme state.

The input of first arbiter 23 is the characteristic feature that fisrt feature extractor 21 exports, the output of the first arbiter 23 For dialect type belonging to voice data.Specifically, corresponding first speech frame, the input of the first arbiter 23 is the first speech frame Corresponding characteristic feature, the output of the first arbiter 23 are that the state of dialect type corresponding with the first speech frame indicates, i.e., the One arbiter 23 is for which dialect type of the accent recognition characteristic present of the first speech frame of input accent recognition model to be determined.

It should be noted that the first arbiter 23 is mainly used for the training in accent recognition model in the embodiment of the present application Stage optimizes training to dialect model, thus, in the process for carrying out speech recognition using trained accent recognition model In, the differentiation result of the first arbiter 23 output can be exported to user, can not also be exported to user.Alternatively, can be use Family provides and checks interface, and when user checks that interface operates to this, then the differentiation result that the first arbiter 23 is exported is defeated Out to user.

In the present embodiment, back-propagation algorithm (Backpropagation is used when to the training of dialect identification model Algorithm), which is made of the forward-propagating of signal and two processes of backpropagation of error.Wherein, the forward direction of signal It propagates and refers to the process of that accent recognition model receives the accent recognition feature of sample and exports the speech recognition result of sample, letter Number direction of propagation is differentiated from the 21 to the first classifier of fisrt feature extractor 22, and from fisrt feature extractor 21 to the first Device 23.And the backpropagation of error (being characterized with gradient) refers to that the dialect type for the sample for exporting the first arbiter 23 differentiates knot The error of the true dialect type of fruit and sample return to accent recognition mode input end process, signal transmission direction be from First arbiter 23 arrives fisrt feature extractor 21.

Illustrate optimization of first arbiter 23 for accent recognition model below with reference to the specific structure of the first arbiter 23 Trained specific implementation.

Referring to Fig. 3, Fig. 3 is a kind of structural schematic diagram of the first arbiter 23 provided by the embodiments of the present application, can wrap It includes:

First gradient inversion layer 31 (for convenient for narration, first gradient inversion layer 31 is indicated with R) and the first languages differentiate Layer 32；Wherein,

First gradient inversion layer 31 is defined as follows:

Z=R (z) (1)

Formula (1) is the calculation formula of 31 forward-propagating of first gradient inversion layer, wherein z is first gradient inversion layer 31 Input, i.e., fisrt feature extractor 21 export characteristic feature f, R (z) be gradient inversion layer output, here R () indicate It by R layers, does not deal with, it is seen then that in the forward propagation process, the output of first gradient inversion layer 31 is first gradient inversion layer 31 input, i.e. first gradient inversion layer 31 are without any processing to input feature vector and are directly transmitted to next layer (i.e. the first languages Diagnostic horizon 32).Specifically, corresponding first speech frame, the input of first gradient inversion layer 31 is the characteristic feature of the first speech frame, The output of first gradient inversion layer 31 remains as the characteristic feature of the first speech frame.

Formula (2) is the calculation formula of 31 backpropagation of first gradient inversion layer, whereinIt is first gradient inversion layer 31 gradient, E are unit matrixs, and α is preset hyper parameter, it is seen that the gradient of first gradient inversion layer 31 be hyper parameter with The product of one negative unit matrix.

According to chain rule, the gradient of output be equal to the gradient of input multiplied by their own gradient (be formulated are as follows: if H (x)=f (g (x)), then h ' (x)=f ' (g (x)) g ' (x)), then the output gradient of first gradient inversion layer 31 is equal to input ladder Degree (i.e. the gradient of the first languages diagnostic horizon 32 characterizes the output error of the first languages diagnostic horizon 32) is multiplied by-α E, due to negative sign In the presence of, thus can regard as by the value of input gradient take it is negative after be transmitted to preceding layer (i.e. fisrt feature extractor 21).First gradient Inversion layer 31 is not processed input feature vector in propagated forward, negates processing to input gradient in backpropagation and (i.e. will Input gradient is multiplied by-α E) so that negating the sign of processing result and the opposite sign of input gradient, therefore the layer is referred to as Gradient inversion layer.

According to gradient descent method, when the more new direction of model parameter is gradient direction (i.e. gradient does not negate processing), mould Type will be optimal solution with prestissimo.And in the embodiment of the present application, first gradient inversion layer 31 is by the first languages diagnostic horizon 32 Gradient reversion after be transmitted to fisrt feature extractor 21 so that the more new direction of fisrt feature extractor 21 and the first languages differentiation Layer 32 it is contrary, i.e. the training objective of the first languages diagnostic horizon 32 is the dialect kind for identifying sample and belonging to as quasi- as possible Class, and the training objective of fisrt feature extractor 21 is to identify the dialect type of inaccurate sample as far as possible, therefore pass through the first ladder Degree inversion layer 31 introduces dual training.

First languages diagnostic horizon 32 can be the neural network of a shallow-layer, for example, two layers of DNN network, the application is not The specific latticed form of first languages diagnostic horizon 32 is specifically limited.The input of first languages diagnostic horizon 32 is that first gradient is anti- The characteristic feature for turning the output of layer 31, exports as dialect type belonging to voice data.Specifically, corresponding first speech frame, first The input of languages diagnostic horizon 32 is the characteristic feature for the first speech frame that first gradient inversion layer 31 exports, the first languages diagnostic horizon 32 output is that the state of dialect type belonging to the first speech frame indicates.

By foregoing teachings it is found that the application introduces dual training by first gradient inversion layer 31, purpose has two sides On the one hand face is that the first languages diagnostic horizon of training 32 more accurately judges which kind of side is the feature for inputting accent recognition model belong to On the other hand speech passes through first gradient inversion layer 31 for the gradient of the first languages diagnostic horizon 32 reversely toward forward pass, training first is special The extraction of extractor 21 is levied with less the feature of languages distinction, i.e., so that the feature that fisrt feature extractor 21 extracts is characterized Voice content belong to different dialect types conditional probability distribution it is consistent.Voice content belongs to the item of different dialect types Part probability distribution unanimously refers to that pronunciation phase Sihe of the voice content in different types of dialect is identical, for example, the sound of Sichuan words Phoneme a their feature distribution of plain a, the phoneme a of northeast words and Henan words are consistent, i.e. the phoneme a of Sichuan words, northeast words Phoneme a and Henan words phoneme a pronunciation it is similar or identical.

In order to enable the feature distribution that the first languages diagnostic horizon 32 learns is the condition with the dialect type of voice content Probability distribution is relevant, introduces control door in the first languages diagnostic horizon 32 in the embodiment of the present application, passes through control door control the One languages diagnostic horizon 32 learns the conditional probability distribution of the dialect type of different phonetic content.In the embodiment of the present application, door is controlled Input be the first classifier 22 output, for purposes of illustration only, below using the first classifier 22 using phoneme be modeling unit progress It is illustrated for modeling.

For any one layer (to be denoted as kth layer convenient for narration) of the first languages diagnostic horizon 32, the input of kth layerRoot It is obtained according to the feature that the output of control door is exported with -1 layer of kth；The input for controlling door is the feature corresponding first of the output of kth -1 The vector that classifier 22 exports.It can specifically be indicated with formula are as follows:

g(c_i)=σ (Vc_i+b) (4)

Wherein, h_iBe -1 layer of kth of the first languages diagnostic horizon 32 output with the i-th frame speech frame character pair, c_iIt is h_iIt is right The one hot vector for the first classifier 22 output answered, the i.e. corresponding phoneme vector of the i-th frame speech frame.For example, it is assumed that first point Modeling unit of the class device 22 using 83 phonemes as classifier, then c_iIt is the vector of one 83 dimension, wherein per one-dimensional difference Corresponding 1 phoneme, if the corresponding phoneme of the i-th frame speech frame is a, in the vector of 83 dimensions, a it is corresponding that it is one-dimensional be 1, It is entirely 0 that he, which ties up,.g(c_i) it is control door, wherein σ is activation primitive, and V is matrix weight, and b is biasing weight, that is to say, that Phoneme vector c_iControl door is obtained after matrixing, the corresponding phoneme of the feature by controlling -1 layer of goalkeeper's kth melts It is combined input kth layer, keeps the study of the first languages diagnostic horizon 32 related to the conditional probability distribution of the dialect type with phoneme Information.It should be noted that -1 layer of kth refers to first gradient inversion layer 31 if k is 1.

The output layer of first languages diagnostic horizon 32 has M node, and M=N*C, wherein N is dialect type number, and C is first The sum of 22 modeling unit of classifier, such as phoneme sum；The M node is divided into C group, and the corresponding phoneme of each group node is used Belong to the differentiation situation of each dialect type in characterizing the phoneme, typically refers to the probability that the phoneme belongs to each dialect type； When being updated every time to the node parameter of the output layer of the first languages diagnostic horizon 32, only update and accent recognition mode input The parameter of that corresponding group node of the corresponding recognition result of accent recognition feature.

For example, it is assumed that the model trains 20 kinds of dialects altogether, pronunciation modeling unit is 83 phonemes, then M=20*83= 1660 nodes, wherein every one group of 20 nodes, i.e., every corresponding phoneme of 20 nodes characterizes the phoneme and belongs to 20 kinds of dialects In each dialect probability.When being updated every time to the node parameter of the output layer of the first languages diagnostic horizon 32, is determined The phoneme prediction corresponding with the input accent recognition feature of the i-th frame speech frame of accent recognition model of one classifier 22 output is tied Fruit, then only the parameter of that 20 node of group corresponding to the phoneme prediction result is updated.

During model training, loss function is the indispensable a part of model.In the embodiment of the present application, corresponding the One classifier 22 and the first arbiter 23 are respectively provided with loss function, and the loss function of accent recognition model is by the first classifier The weighting of the loss function of 22 loss function and the first arbiter 23 is constituted.

The loss function of first classifier 22 be used for characterize the first classifier 22 prediction output sample voice content and Difference between the real speech content of sample.The loss function of first arbiter 23 is predicted defeated for the first arbiter 23 of characterization Difference between the languages classification of sample out and the true languages classification of sample.

The loss function of the loss function of first classifier 22 and the first arbiter 23 may be the same or different.In When the loss function of loss function and the first arbiter 23 to the first classifier 22 is weighted, the loss of the first classifier 22 The weight of the loss function of the weight of function and the first arbiter 23 may be the same or different.

Optionally, the loss function of the loss function of the first classifier 22 and the first arbiter 23 can be cross entropy letter Number.Cross entropy is a key concept in information theory, is mainly used for measuring the otherness information between two probability distribution, when two When probability distribution is identical, cross entropy is minimum value.Below by taking the first classifier 22 as an example, to intersect entropy function (being indicated with L1) into Row explanation:

Wherein, I indicates total number (the i.e. accent recognition mould of the provincialism of the speech frame of primary input accent recognition model Type can handle the provincialism of I speech frame simultaneously every time), i indicates that i-th of speech frame, F indicate fisrt feature extractor 21, F (x_i) indicate i-th of speech frame x_iAccent recognition feature fisrt feature extractor 21 output, Y indicate first classification Device 22, Y (F (x_i)) indicate i-th of speech frame x_iAccent recognition feature a classifier 22 output,Indicate i-th of voice Frame x_iThe corresponding true voice content of accent recognition feature, L_yIt is cross entropy, is Y (F (x here_i)) andCross entropy.

By minimizing the loss function, that is, minimize the first output of classifier 22 and intersecting for legitimate reading Entropy can be made to export with training pattern closer to legitimate reading, that is, model recognition result closer to legitimate reading, identification Rate is also higher.

In the embodiment of the present application, the languages classification of the first arbiter 23 prediction output sample is by the first languages diagnostic horizon 32 It realizes, therefore, the corresponding loss function of the first arbiter 23 is the corresponding loss function of the first languages diagnostic horizon 32.

Assuming that the loss function of accent recognition model is characterized with L, the loss function of the first classifier 22 is characterized with L1, first The loss function of languages diagnostic horizon 32 is characterized by L2, then:

L=a × L1+b × L2.

Optionally, L=L1+L2, i.e. a=b=1.

During model training, the update of model parameter is carried out by minimizing L, L1 and L2.Made by minimizing L Accent recognition model has the multi-purpose dialect ability of identification, by minimizing L1, so that the first classifier 22 has stronger acoustics Separating capacity, by minimizing L2, so that the first arbiter 23 has stronger accent recognition ability, simultaneously because first differentiates The effect of gradient inversion layer in device 23, so that the feature that fisrt feature extractor 21 generates has dialect confusion, the dialect Confusion refers to that the distribution of characteristic feature of the provincialism of different dialect types by the generation of fisrt feature extractor 21 is consistent, First arbiter 23 can not differentiate which kind of dialect input is characterized in.During above-mentioned dual training, the first arbiter 23 Ability it is more and more stronger, the dialect confusion of feature for promoting fisrt feature extractor 21 to generate is become better and better, so that first sentences Other device 23 can not differentiate；When the dialect confusion for the feature that fisrt feature extractor 21 generates is become better and better, the first arbiter 23 in order to accurately differentiate, and can further promote discriminating power, be finally reached an equilibrium state, i.e. fisrt feature extractor When the feature of 21 extractions is good enough, the first arbiter 23 can not differentiate, the feature distribution that at this moment fisrt feature extractor 21 extracts Almost the same, to no longer need to distinguish the dialect of different language in speech recognition, directly progress speech recognition reaches The effect of more accent recognitions.

In previous embodiment, the training corpus for being labeled with voice content and affiliated dialect type is all based on to accent recognition Model is illustrated for being trained.Inventor has found during realizing the application, to the training of dialect identification model In the process, if introducing dialect attribute information, it can be further improved the recognition effect of accent recognition model.Wherein, dialect category Property information is specifically as follows dialect attribute classification belonging to voice data, such as dialect section, and by taking Chinese as an example, Chinese dialects can To be divided into seven big sections: Mandarin dialect, Hunan dialect, Jiangxi dialect, Wu Fangyan, Fujian dialect, Guangdong dialect and Hakka dialect.Wherein, official Words dialect can also segment are as follows: northern Mandarin (Beijing Mandarin, northeast Mandarin, glue the Liao Dynasty Mandarin, Hebei and Shandong Mandarin, Central Plains Mandarin and orchid The general designation of silver-colored Mandarin), southwestern Mandarin and Yangze river and Huai river Mandarin.

Based on this, accent recognition model provided by the embodiments of the present application can at least be labeled with voice content, institute to utilize Belong to dialect type and the other training corpus training of dialect Attribute class obtains.

That is, in the training process of accent recognition model, in addition to the voice content to training sample identifies Outside, also dialect attribute classification belonging to the affiliated dialect type of training sample and training sample is differentiated respectively, is based on voice The recognition result of content, the differentiation result of dialect type and the other differentiation result of dialect Attribute class carry out dialect identification model Optimization training, to further increase the accuracy rate of the recognition result of accent recognition model.

Based on this, another structural schematic diagram of accent recognition model provided by the embodiments of the present application is as shown in figure 4, can be with Include:

Second feature extractor 41, the second classifier 42 and the second arbiter 43；Wherein,

The input of second feature extractor 41 is the accent recognition feature for each frame speech frame that step S12 is extracted, and second is special The output for levying extractor 41 is the corresponding characteristic feature of each frame speech frame, which is to have more area than dialect identification feature Divide the feature of property.That is, second feature extractor 41 from accent recognition feature for extracting the voice number of characterization input According to the feature of the intrinsic characteristic of (i.e. received voice data in step S11), this feature is the advanced features for accent recognition. Specifically, corresponding first speech frame, when second feature extractor 41 receives the accent recognition feature of the first speech frame, from this The corresponding characteristic feature of the first speech frame is extracted in the accent recognition feature of first speech frame, the corresponding table of the first speech frame Sign is characterized in characterizing the feature of the first speech frame intrinsic characteristic.

The concrete form of second feature extractor 41 can be CNN, alternatively, RNN even depth neural network.

The input of second classifier 42 is the characteristic feature that second feature extractor 41 exports, the output of the second classifier 42 For the recognition result of voice data, i.e., corresponding first speech frame, the second classifier 42 is used to determine in the voice of the first speech frame Hold.Specifically, corresponding first speech frame, the input of the second classifier 42 is the corresponding characteristic feature of the first speech frame, second point The output of class device 42 is that the state of voice content corresponding with the first speech frame indicates.

The concrete form of second classifier 42 can be the neural network of shallow-layer, for example, two layers of DNN network.The application The concrete form of second classifier 42 is not specifically limited.The concrete form of the output of second classifier 42 can be word, sound Any one in section, phoneme and phoneme state.Specifically which kind of form is related with the modeling unit of the second classifier 42, and second The specific implementation of the modeling unit of classifier 42 may refer to the implementation of the modeling unit of aforementioned first classifier 22, I will not elaborate.

The input of second arbiter 43 is the characteristic feature that second feature extractor 41 exports, the output of the second arbiter 43 For dialect attribute classification belonging to dialect type belonging to voice data and voice data.Specifically, corresponding first speech frame, The input of second arbiter 43 is the corresponding characteristic feature of the first speech frame, and the output of the second arbiter 43 is and the first speech frame The other state of dialect Attribute class belonging to the state expression of corresponding dialect type and the first speech frame indicates that is, second differentiates Device 43 is for which dialect type of the accent recognition characteristic present of the first speech frame of determining input accent recognition model, and is somebody's turn to do Which dialect attribute classification of the accent recognition characteristic present of first speech frame.

Similar to aforementioned first arbiter 23, in the embodiment of the present application, the second arbiter 43 is mainly used in accent recognition The training stage of model optimizes training to dialect model, thus, voice is being carried out using trained accent recognition model During identification, the differentiation result of the second arbiter 43 output can be exported to user, can not also be exported to user.Or Person can provide for user and check interface, when user checks that interface operates to this, then the second arbiter 43 is exported Differentiate that result is exported to user.

In the present embodiment, back-propagation algorithm (Backpropagation is used when to the training of dialect identification model Algorithm), which is made of the forward-propagating of signal and two processes of backpropagation of error.Wherein, the forward direction of signal It propagates and refers to the process of that accent recognition model receives the accent recognition feature of sample and exports the speech recognition result of sample, letter Number direction of propagation is differentiated from the 41 to the second classifier of second feature extractor 42, and from second feature extractor 41 to the second Device 43.And the backpropagation of error refers to that the dialect type for the sample for exporting the second arbiter 43 differentiates result and dialect attribute Classification differentiates that the true dialect type and the other error of dialect Attribute class of result and sample return to accent recognition mode input end Process, signal transmission direction is from the second arbiter 43 to second feature extractor 41.

Illustrate optimization of second arbiter 43 for accent recognition model below with reference to the specific structure of the second arbiter 43 Trained specific implementation.

Referring to Fig. 5, Fig. 5 is a kind of structural schematic diagram of the second arbiter 43 provided by the embodiments of the present application, can wrap It includes:

Second gradient inversion layer 51 (for convenient for narration, the second gradient inversion layer 51 is indicated with R), the second languages diagnostic horizon 52 and attribute diagnostic horizon 53；Wherein,

The definition of second gradient inversion layer 51 is identical with the definition of first gradient inversion layer 31, it may be assumed that

Z=R (z) (1)

Formula (1) is the calculation formula of 51 forward-propagating of the second gradient inversion layer, wherein z is the second gradient inversion layer 51 Input, i.e., the characteristic feature f, R (z) that second feature extractor 41 exports are the output of the second gradient inversion layer 51, here R () indicates to pass through R layers, not deal with, it is seen then that in the forward propagation process, the output of the second gradient inversion layer 51 is the second ladder The input of inversion layer 51 is spent, i.e. the second gradient inversion layer 51 is without any processing to input feature vector and is directly transmitted to next layer (i.e. Second languages diagnostic horizon 52 and attribute diagnostic horizon 53).Specifically, corresponding first speech frame, the input of the second gradient inversion layer 51 For the characteristic feature of the first speech frame, the output of the second gradient inversion layer 51 remains as the characteristic feature of the first speech frame.

Formula (2) is the calculation formula of 51 backpropagation of the second gradient inversion layer, whereinIt is the second gradient inversion layer 51 gradient, E are unit matrixs, and α is preset hyper parameter, it is seen that the gradient of the second gradient inversion layer 51 be hyper parameter with The product of one negative unit matrix.

According to chain rule, the gradient of output is equal to the gradient of input multiplied by the gradient of their own, then the second gradient inverts The output gradient of layer 51 is equal to input gradient (i.e. the sum of the gradient of the gradient of the second languages diagnostic horizon 52 and attribute diagnostic horizon 53) Multiplied by-α E, due to the presence of negative sign, can regard as by the value of input gradient take it is negative after be transmitted to preceding layer (i.e. second feature Extractor 41).Second gradient inversion layer 51 is not processed input feature vector in propagated forward, in backpropagation to input ladder Degree negates processing (i.e. by input gradient multiplied by-α E), so that negating the sign of processing result and the sign of input gradient On the contrary, therefore the layer is referred to as gradient inversion layer.

According to gradient descent method, when the more new direction of model parameter is gradient direction (i.e. gradient does not negate processing), mould Type will be optimal solution with prestissimo.And in the embodiment of the present application, the second gradient inversion layer 51 is by the second languages diagnostic horizon 52 Gradient and attribute diagnostic horizon 53 gradient negate turn after be transmitted to second feature extractor 41 so that second feature extractor 41 The gradient direction of parameter more new direction and the second languages diagnostic horizon 52 and attribute diagnostic horizon 53 is on the contrary, i.e. the second languages diagnostic horizon 52 Training objective be the dialect type for identifying sample and belonging to as quasi- as possible, the training objective of attribute diagnostic horizon 53 is as far as possible Accurately identify dialect attribute classification belonging to sample, and the training objective of second feature extractor 41 is to identify as far as possible not The dialect type and dialect attribute classification of quasi- sample, therefore dual training is introduced by the second gradient inversion layer 51.

Second languages diagnostic horizon 52 can be the neural network of a shallow-layer, for example, two layers of DNN network, the application is not The specific latticed form of second languages diagnostic horizon 52 is specifically limited.The input of second languages diagnostic horizon 52 is that the second gradient is anti- The characteristic feature for turning the output of layer 51, exports as dialect type belonging to voice data.Specifically, corresponding first speech frame, second The input of languages diagnostic horizon 52 is the characteristic feature of the first speech frame of the second gradient inversion layer 51 output, the second languages diagnostic horizon 52 output is that the state of dialect type belonging to the first speech frame indicates.

Attribute diagnostic horizon 53 can be the neural network of a shallow-layer, for example, two layers of DNN network, the application is not to category The specific latticed form of property diagnostic horizon 53 is specifically limited.The input of attribute diagnostic horizon 53 is the output of the second gradient inversion layer 51 Characteristic feature exports as dialect attribute classification belonging to voice data.Specifically, corresponding first speech frame, attribute diagnostic horizon 53 Input be the output of the second gradient inversion layer 51 the first speech frame characteristic feature, the output of attribute diagnostic horizon 53 is the first language The other state of dialect Attribute class belonging to sound frame indicates.

By foregoing teachings it is found that the application introduces dual training by the second gradient inversion layer 51, purpose has two aspects, It on the one hand is that the second languages diagnostic horizon 52 of training and attribute diagnostic horizon 53 are more accurately judged to input the spy of accent recognition model Sign belongs to which kind of dialect and affiliated dialect attribute classification, on the other hand passes through the second gradient inversion layer 51 for the second languages Reversely toward forward pass, training second feature extractor 41 is generated with less languages area the gradient of diagnostic horizon 52 and attribute diagnostic horizon 53 Divide the feature of property and attribute classification distinction, that is, so that the voice content that the feature that second feature extractor 41 extracts is characterized The conditional probability distribution for belonging to different dialect types is consistent, and the language that the feature of the extraction of second feature extractor 41 is characterized The other conditional probability distribution of Attribute class of the affiliated dialect of sound content is consistent.The other condition of dialect Attribute class belonging to voice content is general Rate distribution unanimously refers to that different dialects belongs to same attribute classification.For example, Henan words and northeast words belong to northern Mandarin.

In order to enable the feature distribution that the second languages diagnostic horizon 52 learns is the condition with the dialect type of voice content Probability distribution is relevant, introduces control door in the second languages diagnostic horizon 52 in the embodiment of the present application, passes through control door control the Two languages diagnostic horizons 52 learn the conditional probability distribution of the dialect type of different phonetic content.In the embodiment of the present application, door is controlled Input be the second classifier 42 output, the specific implementation for controlling door may refer to control in the first languages diagnostic horizon 32 The implementation of door, I will not elaborate.

In order to enable the feature distribution that attribute diagnostic horizon 53 learns is other with dialect Attribute class belonging to voice content Conditional probability distribution is relevant, and control door has also been introduced in the embodiment of the present application in attribute diagnostic horizon 53, is gated by control Attribute diagnostic horizon 53 processed learns the other conditional probability distribution of dialect Attribute class of different phonetic content.In the embodiment of the present application, control The input of door processed is the output of the second classifier 42, in attribute diagnostic horizon 53 in the structure and the second languages diagnostic horizon 52 of control door The structure for controlling door is identical, specific as shown in formula (3)-(4):

g(c_i)=σ (Vc_i+b) (4)

For purposes of illustration only, still being said so that phoneme is modeled for modeling unit as an example by the second classifier 42 below It is bright.

In attribute diagnostic horizon 53, above-mentioned formula (3)-(4) are meant that: for any one layer of attribute diagnostic horizon 53 (to be denoted as kth layer convenient for narration), the input of kth layerIt is obtained according to the feature that the output of control door is exported with -1 layer of kth； The input for controlling door is the vector for the corresponding output of second classifier 42 of feature that kth -1 exports.

Specifically, in attribute diagnostic horizon 53, h_iBe -1 layer of kth of attribute diagnostic horizon 53 output with the i-th frame speech frame Corresponding feature, c_iIt is h_iThe one hot vector that corresponding second classifier 42 exports, i.e. the corresponding phoneme of the i-th frame speech frame to Amount.For example, it is assumed that modeling unit of second classifier 42 using 83 phonemes as the second classifier 42, then c_iIt is one 83 The vector of dimension, wherein respectively correspond 1 phoneme per one-dimensional, if the corresponding phoneme of the i-th frame speech frame is a, 83 dimensions to In amount, a it is corresponding that it is one-dimensional be 1, other dimension be entirely 0.g(c_i) it is control door, wherein σ is activation primitive, and V is matrix power Weight, b are biasing weights, that is to say, that phoneme vector c_iControl door is obtained after matrixing, by controlling goalkeeper's kth -1 The corresponding phoneme of the feature of layer is fused together input kth layer, makes the study of attribute diagnostic horizon 53 to the dialect category with phoneme The relevant information of conditional probability distribution of property classification.It should be noted that -1 layer of kth refers to that the second gradient is anti-if k is 1 Turn layer 51.

The output layer of attribute diagnostic horizon 53 has Q node, and Q=P*C, wherein P is the other number of dialect Attribute class, and C is The sum of two classifiers, 42 modeling unit, such as phoneme sum；The Q node is divided into C group, and each group node corresponds to a phoneme, The other differentiation situation of dialect Attribute class for characterizing the phoneme, typically referring to the phoneme, to belong to each dialect Attribute class other general Rate；When being updated every time to the node parameter of the output layer of attribute diagnostic horizon 53, only update and accent recognition mode input The parameter of that corresponding group node of the corresponding recognition result of accent recognition feature.

For example, it is assumed that the model trains 7 kinds of dialect attribute classifications (7 big sections of corresponding dialect), pronunciation modeling unit altogether It is 83 phonemes, then Q=7*83=581 node, wherein every one group of 7 nodes, i.e., every corresponding phoneme of 7 nodes, Characterize each other probability of dialect Attribute class that the phoneme belongs in 7 kinds of dialect attribute classifications.Every time to attribute diagnostic horizon 53 When the node parameter of output layer is updated, the i-th frame language with input accent recognition model of the second classifier 42 output is determined The corresponding phoneme prediction result of the accent recognition feature of sound frame, then only that 7 node of group corresponding to the phoneme prediction result Parameter be updated.

Illustrate below in the case where introducing attribute diagnostic horizon 53, the facilities of loss function in accent recognition model.

In the case where introducing attribute diagnostic horizon 53, to dialect identification model setting loss function in the embodiment of the present application A kind of implementation can be with are as follows: corresponding second classifier 42, the second voice diagnostic horizon 52 and attribute diagnostic horizon 53 are respectively provided with Loss function, the loss function of accent recognition model is by the loss function of the second classifier 42, the damage of the second voice diagnostic horizon 52 The loss function for losing function and attribute diagnostic horizon 53 weights composition.

The loss function of second classifier 42 be used for characterize the second classifier 42 prediction output sample voice content and Difference between the real speech content of sample.The loss function of second voice diagnostic horizon 52 is for characterizing the second voice diagnostic horizon Difference between the languages classification of sample and the true languages classification of sample of 52 prediction outputs.The loss letter of attribute diagnostic horizon 53 Number for characterization attributes diagnostic horizon 53 prediction output samples dialect attribute classifications and sample true dialect attribute classification it Between difference.

The loss of the loss function of second classifier 42, the loss function of the second voice diagnostic horizon 52 and attribute diagnostic horizon 53 Function may be the same or different.In the loss function of loss function, the second voice diagnostic horizon 52 to the second classifier 42 When being weighted with the loss function of attribute diagnostic horizon 53, the weight of the loss function of the second classifier 42, the second voice differentiate The weight of the loss function of the weight and attribute diagnostic horizon 53 of the loss function of layer 52 may be the same or different.

Optionally, the loss function and attribute diagnostic horizon of the loss function of the second classifier 42, the second voice diagnostic horizon 52 53 loss function can be intersection entropy function.

Assuming that the loss function of accent recognition model is characterized with L, the loss function of the second classifier 42 is characterized with L1, second The loss function of languages diagnostic horizon 52 is characterized by L2, and the loss function of attribute diagnostic horizon 53 is characterized by L3, then:

L=a × L1+b × L2+c × L3.

Optionally, L=L1+L2+L3, i.e. a=b=c=1.

During model training, the update of model parameter is carried out by minimizing L, L1 and L2+L3.By minimizing L So that accent recognition model has the multi-purpose dialect ability of identification, by minimizing L1, so that the second classifier 42 is with stronger Acoustics separating capacity, by minimizing L2+L3, so that the second arbiter 43 has stronger accent recognition ability, simultaneously because The effect of gradient inversion layer in second arbiter 43, so that the feature that second feature extractor 41 generates is obscured with dialect Property, which refers to characteristic feature of the provincialism of different dialect types by the generation of second feature extractor 41 Distribution is consistent, and the second arbiter 43 can not differentiate which kind of dialect input is characterized in.During above-mentioned dual training, second The ability of arbiter 43 is more and more stronger, and the dialect confusion for the feature for promoting second feature extractor 41 to generate is become better and better, with Differentiate the second arbiter 43 can not；When the dialect confusion for the feature that second feature extractor 41 generates is become better and better, the Two arbiters 43 can further promote discriminating power in order to accurately differentiate, be finally reached an equilibrium state, i.e., second is special When the feature that sign extractor 41 extracts is good enough, the second arbiter 43 can not differentiate, at this moment second feature extractor 41 extracts Feature distribution is almost the same, thus no longer need to distinguish the dialect of different language in speech recognition, directly progress speech recognition , achieve the effect that more accent recognitions.

Furthermore, it is contemplated that dialect attribute classification has certain correlation, dialect attribute classification and dialect with dialect type There is type certain correlation to refer to dialect attribute classification there are one-to-one or one-to-many relationships with dialect type, such as Using section belonging to dialect as dialect attribute classification, then dialect section and dialect type be there are one-to-many relationship, than Such as, Sichuan words belong to southwestern Mandarin, and Henan words and northeast words belong to northern Mandarin, if a sample is judged as dialect type Sichuan words, then the other judging result of Attribute class should be southwestern Mandarin, if it is not, illustrate dialect type judging result and Dialect determined property result is inconsistent, needs to be optimized.In order to optimize this error, the application is to dialect identification model Be arranged loss function when, introduce languages Attribute consistency loss function, by the languages Attribute consistency loss function come into The Consistency Learning of one step reinforcing feature distribution.Here it is as follows to define languages Attribute consistency loss L4:

Wherein, I indicates the total number of the provincialism of the speech frame of primary input accent recognition model, D_KLIt is KL divergence (Kullback-Leibler divergence), q_outiIt is output of the feature in attribute diagnostic horizon 53 of the i-th frame speech frame, q′_outiIt is the feature output obtained according to the feature of the i-th frame speech frame in the output conversion of the second languages diagnostic horizon 52.Second language What kind diagnostic horizon 52 exported is to characterize the state expression of dialect type belonging to the i-th frame speech frame, and attribute diagnostic horizon 53 is defeated Out be that the other state of dialect Attribute class belonging to the i-th frame speech frame of characterization indicates, therefore, when calculating KL divergence, need pair The two is normalized, and in the embodiment of the present application, normalization refers to: characterization the i-th frame language that the second languages diagnostic horizon 52 is exported The other state of Attribute class belonging to sound frame indicates that being converted into the other state of Attribute class belonging to the i-th frame speech frame indicates.The conversion Process can be to be converted to obtain according to preset transformation rule.

In the case where introducing languages Attribute consistency loss function, the loss function of accent recognition model is by the second classification The loss function of device 42, the loss function of the second languages diagnostic horizon 52, the loss function of attribute diagnostic horizon 53 and the second languages The weighting of the languages Attribute consistency loss function of diagnostic horizon 52 and attribute diagnostic horizon 53 is constituted.It can be indicated with formula are as follows:

L=a × L1+b × L2+c × L3+d × L4.

Optionally, L=L1+L2+L3+L4, i.e. a=b=c=d=1.

During model training, the update of model parameter is carried out by minimizing L, L1 and L2+L3+L4.Pass through minimum Changing L makes accent recognition model have the multi-purpose dialect ability of identification, by minimizing L1, so that the second classifier 42 is with stronger Acoustics separating capacity, by minimize L2+L3+L4 so that the second arbiter 43 have stronger accent recognition ability, simultaneously Due to the effect of the gradient inversion layer in the second arbiter 43, so that the feature that second feature extractor 41 generates is mixed with dialect Confusing property, the dialect confusion refer to that the provincialism of different dialect types passes through the characteristic feature that second feature extractor 41 generates Distribution it is consistent, the second arbiter 43 can not differentiate which kind of dialect input is characterized in.During above-mentioned dual training, the The ability of two arbiters 43 is more and more stronger, and the dialect confusion for the feature for promoting second feature extractor 41 to generate is become better and better, So that the second arbiter 43 can not differentiate；When the dialect confusion for the feature that second feature extractor 41 generates is become better and better, Second arbiter 43 can further promote discriminating power in order to accurately differentiate, be finally reached an equilibrium state, i.e., and second When the feature that feature extractor 41 extracts is good enough, the second arbiter 43 can not differentiate, at this moment second feature extractor 41 extracts Feature distribution it is almost the same, to no longer need to distinguish the dialect of different language in speech recognition, directly progress voice knowledge Not, achieve the effect that more accent recognitions.

Corresponding with embodiment of the method, the application implementation also provides a kind of multi-party speech identification device, and the embodiment of the present application mentions A kind of structural schematic diagram of the multi-party speech identification device supplied is as shown in fig. 6, may include:

Receiving module 61, extraction module 62 and identification module 63；Wherein,

Receiving module 61 is for receiving voice data；

Extraction module 62 is used to extract accent recognition feature to the voice data；

Identification module 63 is used to the accent recognition feature inputting the accent recognition model constructed in advance, obtains institute's predicate The recognition result of sound data；The accent recognition model is to utilize the training for being at least labeled with voice content and affiliated dialect type Corpus training obtains.

Multi-party speech identification device provided by the embodiments of the present application carries out dialect by the accent recognition model constructed in advance Identification, wherein the dialect identification model is obtained by the training corpus training including a variety of dialects, and the dialect identification model Training process in be not only limited to the voice content of corpus, dialect type belonging to dialect is also introduced, in conjunction with belonging to dialect Dialect type dialect identification model is optimized, enable accent recognition model to accurately identify a variety of dialects, thus with Family need not carry out the switching of speech recognition mode again, simplify user's operation, improve the accuracy rate and efficiency of more accent recognitions.

In an optional embodiment, the accent recognition model is to utilize at least to be labeled with voice content, affiliated dialect Type and the other training corpus training of dialect Attribute class obtain.

In an optional embodiment, the accent recognition model includes: feature extractor, classifier and arbiter；Its In,

In an optional embodiment, the arbiter includes: gradient inversion layer and languages diagnostic horizon；Alternatively, described sentence Other device includes: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon；Wherein,

In an optional embodiment, the gradient inversion layer is used for when being trained to the accent recognition model, It is transmitted to the feature extractor after the gradient of the languages diagnostic horizon is negated, alternatively, the gradient inversion layer is used for institute When stating accent recognition model and being trained, the feature is transmitted to after the gradient of the languages diagnostic horizon and attribute diagnostic horizon is negated Extractor, to update the parameter of the feature extractor.

In an optional embodiment, the accent recognition model training when loss function by the classifier damage The loss function for losing function and the arbiter weights composition.

In an optional embodiment, if the arbiter includes gradient inversion layer and languages diagnostic horizon, the dialect Loss function of the identification model in training is added by the loss function of the classifier and the loss function of the languages diagnostic horizon Power is constituted；

Alternatively,

In an optional embodiment, if the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, Then loss function of loss function of the accent recognition model in training by the classifier, the damage of the languages diagnostic horizon Function is lost, the loss function and languages diagnostic horizon of the attribute diagnostic horizon are consistent with the languages attribute of the attribute diagnostic horizon Property loss function weighting constitute.

In an optional embodiment, the languages diagnostic horizon is the neural network comprising controlling door；The neural network The number of plies be greater than 1；

Multi-party speech identification device provided by the embodiments of the present application can be applied to more accent recognition equipment, such as PC terminal, intelligence Mobile phone, translator, robot, smart home (household electrical appliances), remote controler, cloud platform, server and server cluster etc..Optionally, Fig. 7 shows the hardware block diagram of more accent recognition equipment, and referring to Fig. 7, the hardware configuration of more accent recognition equipment be can wrap It includes: at least one processor 1, at least one communication interface 2, at least one processor 3 and at least one communication bus 4；

In the embodiment of the present application, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one, And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4；

Processor 1 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention Road etc.；

Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile Memory) etc., a for example, at least magnetic disk storage；

Wherein, memory is stored with program, the program that processor can call memory to store, and described program is used for:

Receive voice data；

Accent recognition feature is extracted to the voice data；

Optionally, the refinement function of described program and extension function can refer to above description.

The embodiment of the present application also provides a kind of storage medium, which can be stored with the journey executed suitable for processor Sequence, described program are used for:

Receive voice data；

Accent recognition feature is extracted to the voice data；

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of multi-party speech recognition methods characterized by comprising

Receive voice data；

Accent recognition feature is extracted to the voice data；

The accent recognition feature is inputted to the accent recognition model constructed in advance, obtains the recognition result of the voice data； The accent recognition model is to be obtained using the training corpus training for being at least labeled with voice content and affiliated dialect type.

2. the method according to claim 1, wherein the accent recognition model is to utilize at least to be labeled with voice Content, affiliated dialect type and the other training corpus training of dialect Attribute class obtain.

3. method according to claim 1 or 2, which is characterized in that the accent recognition model includes: feature extractor, Classifier and arbiter；Wherein,

The input of the feature extractor is the accent recognition feature, is exported as characteristic feature, the characteristic feature is than institute State the feature that accent recognition feature has more distinction；

The input of the arbiter is the characteristic feature, is exported as dialect type belonging to the voice data, alternatively, output For dialect attribute classification belonging to dialect type belonging to the voice data and the voice data.

4. according to the method described in claim 3, it is characterized in that, the arbiter includes: that gradient inversion layer and languages differentiate Layer；Alternatively, the arbiter includes: gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon；Wherein,

The input of the languages diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as belonging to the voice data Dialect type；

The input of the attribute diagnostic horizon is the characteristic feature of gradient inversion layer output, is exported as belonging to the voice data Dialect attribute classification.

5. according to the method described in claim 4, it is characterized in that, when being trained to the accent recognition model,

The gradient inversion layer is transmitted to the feature extractor after negating the gradient of the languages diagnostic horizon, alternatively, the ladder Degree inversion layer is transmitted to the feature extractor after negating the gradient of the languages diagnostic horizon and attribute diagnostic horizon, described in updating The parameter of feature extractor.

6. described according to the method described in claim 4, it is characterized in that, when being trained to the accent recognition model The loss function of accent recognition model is made of the weighting of the loss function of the loss function of the classifier and the arbiter.

7. according to the method described in claim 6, it is characterized in that, if the arbiter includes that gradient inversion layer and languages differentiate Layer, then when being trained to the accent recognition model, the loss function of the accent recognition model is by the classifier The weighting of the loss function of loss function and the languages diagnostic horizon is constituted；

Alternatively,

If the arbiter includes gradient inversion layer, languages diagnostic horizon and attribute diagnostic horizon, to the accent recognition model When being trained, the loss function of the accent recognition model by the classifier loss function, the languages diagnostic horizon Loss function, the loss function of the attribute diagnostic horizon, which weights, to be constituted.

8. according to the method described in claim 6, it is characterized in that, if the arbiter includes gradient inversion layer, languages differentiation Layer and attribute diagnostic horizon, then when being trained to the accent recognition model, the loss function of the accent recognition model by The loss function of the classifier, the loss function of the languages diagnostic horizon, the loss function of the attribute diagnostic horizon, Yi Jiyu The languages Attribute consistency loss function of kind diagnostic horizon and the attribute diagnostic horizon, which weights, to be constituted.

9. according to method described in claim 4-8 any one, which is characterized in that the languages diagnostic horizon is comprising controlling door Neural network；The number of plies of the neural network is greater than 1；

10. a kind of multi-party speech identification device characterized by comprising

Receiving module, for receiving voice data；

Identification module obtains the voice number for the accent recognition feature to be inputted the accent recognition model constructed in advance According to recognition result；The accent recognition model is to utilize the training corpus for being at least labeled with voice content and affiliated dialect type Training obtains.

11. a kind of more accent recognition equipment, which is characterized in that including memory and processor；

The memory, for storing program；

The processor realizes multi-party speech recognition methods as claimed in any one of claims 1-9 wherein for executing described program Each step.

12. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed When device executes, each step of the multi-party speech recognition methods of letter as claimed in any one of claims 1-9 wherein is realized.