CN110517664B

CN110517664B - Multi-party identification method, device, equipment and readable storage medium

Info

Publication number: CN110517664B
Application number: CN201910852557.0A
Authority: CN
Inventors: 许丽; 潘嘉
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2022-08-05
Anticipated expiration: 2039-09-10
Also published as: CN110517664A

Abstract

The embodiment of the application discloses a multi-dialect recognition method, a device, equipment and a readable storage medium, which are used for recognizing dialects through a pre-constructed dialect recognition model, wherein the dialect recognition model is obtained through training corpus including multiple dialects, the training process of the dialect recognition model is not limited to voice content of the corpus, the dialect type to which the dialect belongs is introduced, and the dialect recognition model is optimized in combination with the dialect type to which the dialect belongs, so that the dialect recognition model can accurately recognize multiple dialects, a user does not need to switch voice recognition modes, user operation is simplified, and the accuracy and the efficiency of multi-dialect recognition are improved.

Description

Multi-party identification method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for multi-party speech recognition.

Background

At present, more and more entrances for artificial intelligence application rely on speech recognition, for example, a translator for realizing barrier-free communication between people in different countries in different languages, robot customer service for greatly reducing human resources, a speech input method for liberating both hands, and smart homes (home appliances) for controlling home appliances more conveniently and naturally, and the entrances of the translators, the robot customer service for greatly reducing human resources, and the smart homes (home appliances) all rely on speech recognition, so that the accuracy of the speech recognition is very important.

However, the existing speech recognition scheme, which usually supports only recognition of mandarin, can seriously decrease the recognition accuracy if the user uses dialect. Or although dialects are identified, a user is required to manually operate to select an identification mode corresponding to the dialects, active cooperation of the user is required, if the user Mandarin Chinese and the dialects are mixed, the user is difficult to realize active switching of the modes, and in a scene of conversation and communication among multiple people, if speakers of multiple dialects appear, frequent switching obviously causes low efficiency, and user experience is poor.

Therefore, how to improve the accuracy and efficiency of dialect identification becomes an urgent technical problem to be solved.

Disclosure of Invention

In view of the foregoing, the present application provides a multi-language method, apparatus, device and readable storage medium.

In order to achieve the above object, the following solutions are proposed:

a multi-party identification method, comprising:

receiving voice data;

extracting dialect recognition features from the voice data;

inputting the dialect recognition characteristics into a pre-constructed dialect recognition model to obtain a recognition result of the voice data; the dialect recognition model is obtained by training with training corpora at least marked with voice content and the dialect type.

In the above method, preferably, the dialect recognition model is obtained by training using a corpus labeled with at least speech content, a type of the dialect to which the dialect belongs, and a type of attribute of the dialect.

The above method, preferably, the dialect recognition model includes: a feature extractor, a classifier and a discriminator; wherein,

the input of the feature extractor is the dialect identification feature, and the output of the feature extractor is a characterization feature which is more distinctive than the dialect identification feature;

the input of the classifier is the characterization feature, and the output is the recognition result of the voice data;

and the input of the discriminator is the characterization feature, and the output is the dialect type to which the voice data belongs, or the output is the dialect type to which the voice data belongs and the dialect attribute category to which the voice data belongs.

In the above method, preferably, the discriminator includes: a gradient inversion layer and a language discrimination layer; alternatively, the discriminator may include: a gradient inversion layer, a language discrimination layer and an attribute discrimination layer; wherein,

the input of the gradient inversion layer is the characterization feature, and the output of the gradient inversion layer is the characterization feature;

the input of the language discrimination layer is the characterization feature output by the gradient inversion layer, and the output is the dialect type to which the voice data belongs;

and the input of the attribute discrimination layer is the characterization feature output by the gradient inversion layer, and the output is the dialect attribute category to which the voice data belongs.

The above method preferably further comprises, when training the dialect recognition model,

and the gradient inversion layer inverts the gradient of the language discrimination layer and then transmits the inverted gradient to the feature extractor, or the gradient inversion layer inverts the gradient of the language discrimination layer and the gradient of the attribute discrimination layer and then transmits the inverted gradient to the feature extractor so as to update the parameters of the feature extractor.

In the above method, preferably, when the dialect recognition model is trained, the loss function of the dialect recognition model is formed by weighting the loss function of the classifier and the loss function of the discriminator.

In the above method, preferably, if the discriminator includes a gradient inversion layer and a language discrimination layer, when the dialect recognition model is trained, a loss function of the dialect recognition model is formed by weighting a loss function of the classifier and a loss function of the language discrimination layer;

or,

if the discriminator comprises a gradient reversal layer, a language discrimination layer and an attribute discrimination layer, when the dialect recognition model is trained, the loss function of the dialect recognition model is formed by weighting the loss function of the classifier, the loss function of the language discrimination layer and the loss function of the attribute discrimination layer.

In the above method, preferably, if the discriminator includes a gradient reversal layer, a language discrimination layer and an attribute discrimination layer, when the dialect recognition model is trained, the loss function of the dialect recognition model is formed by weighting the loss function of the classifier, the loss function of the language discrimination layer, the loss function of the attribute discrimination layer, and the language attribute consistency loss function of the language discrimination layer and the attribute discrimination layer.

In the above method, preferably, the language type discrimination layer is a neural network including a control gate; the number of layers of the neural network is more than 1;

the input of each layer of the neural network is obtained according to the output of the control gate and the characteristics of the output of the previous layer;

the input of the control gate is the vector output by the classifier corresponding to the feature output by the previous layer.

A multi-party identification apparatus comprising:

the receiving module is used for receiving voice data;

the extraction module is used for extracting dialect recognition characteristics from the voice data;

the recognition module is used for inputting the dialect recognition characteristics into a pre-constructed dialect recognition model to obtain a recognition result of the voice data; the dialect recognition model is obtained by training by using a training corpus at least marked with voice content and the dialect type.

Preferably, the dialect recognition model is obtained by training with a corpus labeled with at least voice content, a dialect type and a dialect attribute category.

The apparatus preferably further comprises: a feature extractor, a classifier and a discriminator; wherein,

the feature extractor is used for acquiring the dialect identification features and outputting characterization features which are more distinctive than the dialect identification features;

the classifier is used for acquiring the characterization features and outputting the recognition result of the voice data;

the discriminator is used for acquiring the characterization features and outputting the dialect type to which the voice data belongs, or outputting the dialect type to which the voice data belongs and the dialect attribute type to which the voice data belongs.

In the above apparatus, preferably, the discriminator includes: a gradient inversion layer and a language discrimination layer; alternatively, the discriminator may include: a gradient inversion layer, a language discrimination layer and an attribute discrimination layer; wherein,

the gradient inversion layer is used for acquiring the characterization feature and outputting the characterization feature;

the language discrimination layer is used for acquiring the characteristic features output by the gradient inversion layer and outputting the dialect type to which the voice data belongs;

and the attribute discrimination layer is used for acquiring the characterization features output by the gradient inversion layer and outputting the dialect attribute category to which the voice data belongs.

In the above apparatus, preferably, the gradient inversion layer is configured to invert the gradient of the language identification layer and transmit the inverted gradient to the feature extractor when the dialect recognition model is trained, or the gradient inversion layer is configured to invert the gradients of the language identification layer and the attribute identification layer and transmit the inverted gradients to the feature extractor when the dialect recognition model is trained, so as to update parameters of the feature extractor.

In the above apparatus, it is preferable that the loss function of the dialect recognition model during training is formed by weighting the loss function of the classifier and the loss function of the discriminator.

In the above apparatus, preferably, if the discriminator includes a gradient reversal layer and a language type discrimination layer, the loss function of the dialect recognition model during training is formed by weighting the loss function of the classifier and the loss function of the language type discrimination layer;

or,

if the discriminator comprises a gradient reversal layer, a language discrimination layer and an attribute discrimination layer, the loss function of the dialect recognition model during training is formed by weighting the loss function of the classifier, the loss function of the language discrimination layer and the loss function of the attribute discrimination layer.

Preferably, in the above apparatus, if the discriminator includes a gradient inversion layer, a language discrimination layer, and an attribute discrimination layer, the loss function of the dialect recognition model during training is formed by weighting the loss function of the classifier, the loss function of the language discrimination layer, the loss function of the attribute discrimination layer, and the language attribute consistency loss function of the language discrimination layer and the attribute discrimination layer.

In the above apparatus, preferably, the language type discrimination layer is a neural network including a control gate; the number of layers of the neural network is more than 1;

the input of each layer of the neural network is obtained according to the characteristics of the output of the control gate and the output of the previous layer;

A multi-party identification device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the multi-party identification method as described in any one of the above.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of recognizing a party as claimed in any of the preceding claims.

It can be seen from the foregoing technical solutions that, in the multi-dialect recognition method, apparatus, device, and readable storage medium provided in this embodiment of the application, the dialect recognition model is obtained by training a dialect recognition model that is constructed in advance, and the dialect recognition model is trained by a corpus including multiple dialects, and the training process of the dialect recognition model is not limited to the speech content of the corpus, but also introduces a dialect type to which the dialect belongs, and optimizes the dialect recognition model in combination with the dialect type to which the dialect belongs, so that the dialect recognition model can accurately recognize multiple dialects, and a user does not need to switch speech recognition modes any more, thereby simplifying user operation and improving accuracy and efficiency of multi-dialect recognition.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of one implementation of a multi-party identification method disclosed in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a dialect recognition model disclosed in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a first discriminator according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a dialect recognition model disclosed in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a second discriminator according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a multi-party identification apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a hardware structure of the multi-party identification device disclosed in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The inventor researches and discovers that when carrying out dialect recognition, the existing voice recognition scheme uses independent dialect recognition models, for example, to recognize the voice of a first dialect, a first dialect recognition model needs to be used, and to recognize the voice of a second dialect, a second dialect recognition model needs to be used, the first dialect and the second dialect are different dialects, the first dialect recognition model is obtained by training the training corpus of the first dialect, and the second dialect recognition model is obtained by training the training corpus of the second dialect, so that if the recognition of the voice of N dialects can be supported, N dialect recognition models need to be trained. This speech recognition scheme suffers from the following disadvantages:

1. long development time and high cost: in the training stage of the dialect recognition model, a large amount of dialect audio data needs to be collected for each dialect respectively, and the audio content needs to be transcribed manually, and for dialects, the difficulty of audio data collection and manual transcription is high, and the cost is high, so when a new dialect recognition capability is expected to be added, long development time and high development cost are often needed.

2. The convenience of use by the user is poor: when speech recognition is needed, a user needs to switch a dialect recognition mode according to a dialect used by a speaker, namely, the user needs to actively cooperate, if the user mandarin and the dialect are mixed, the user is difficult to realize that the speaker actively switches the mode, and in a multi-speaker conversation communication scene, if speakers with various dialects appear, frequent switching obviously causes low efficiency, and user experience is poor.

In order to overcome the above-mentioned deficiencies or at least partially overcome the above-mentioned deficiencies, the basic idea of the solution of the present application is: the dialect recognition model is trained by using the corpus containing multiple dialects, so that the voice of the multiple dialects can be recognized based on one dialect recognition model, on one hand, the recognition model is trained independently for each dialect, the required quantity of the corpus of each dialect in the corpus used by the dialect recognition model is less in the application, on the other hand, in the actual application process, the user is prevented from switching among the multiple dialect modes, and the use convenience of the user is improved.

The following is a detailed description of the scheme of the application:

the multi-party identification method provided by the application can be applied to electronic equipment, and the electronic equipment can include but is not limited to any one of the following: smart phones, computers, translators, robots, smart homes (home appliances), remote controllers, and the like.

Referring to fig. 1, fig. 1 is a flowchart of an implementation of a multi-party identification method according to an embodiment of the present application, where the implementation of the method may include:

step S11: voice data is received.

The voice data is voice data to be recognized, and may be voice data of a dialect input by a user, which is received by the electronic device through a sound pickup device (such as a microphone or a microphone array), or voice data of mandarin, or voice data of a dialect and mandarin mixed with each other.

Step S12: dialect identifying features are extracted for the speech data.

The dialect identifying feature may be an acoustic feature, which is typically a spectral feature of the speech data, such as a Mel Frequency Cepstral Coefficient (MFCC) feature, or an FBank feature, etc.

When the dialect identifying feature is extracted, the voice data can be divided into a plurality of voice frames, and the dialect identifying feature of each voice frame can be extracted.

Step S13: inputting the dialect recognition characteristics into a pre-constructed dialect recognition model to obtain a recognition result of the voice data (namely the specific voice content of the voice data); the dialect recognition model is obtained by training with a training corpus at least marked with voice content and a dialect type (also called language for short).

Under the condition that the voice data is divided into a plurality of voice frames, the dialect recognition characteristics of each voice frame are input into a pre-constructed dialect recognition model to obtain the recognition result of each voice frame, and the recognition results of all the voice frames of the voice data form the recognition result of the voice data.

In china, dialects are of a wide variety, of which only a few are exemplified below, for example: sichuan, Henan, Fuzhou, Nanchang, Guangzhou, Changsha, etc. In the embodiment of the present application, the corpus may include training samples of the dialects, training samples of more dialects, and certainly, training samples of all dialects may also be included. The specific corpus of training words includes which dialects of training samples, which can be determined according to the types of dialects that need to be supported actually, for example, if the language needs to support Sichuan, Guangzhou and Changsha, the corpus of training words needs to include voice data of several Sichuan, Guangzhou and Changsha simultaneously.

In the training process of the dialect recognition model, for each training sample, besides the voice content of the training sample is recognized, the dialect type to which the training sample belongs is also judged, and then the dialect recognition model is optimally trained based on the recognition result of the voice content and the judgment result of the dialect type.

According to the multi-dialect recognition method, dialects are recognized through the pre-established dialect recognition model, wherein the dialect recognition model is obtained through training of training corpora including multiple dialects, the training process of the dialect recognition model is not limited to voice content of the corpora, the dialect type to which the dialect belongs is introduced, and the dialect recognition model is optimized by combining the dialect type to which the dialect belongs, so that the dialect recognition model can accurately recognize multiple dialects (including the common language), the user does not need to switch voice recognition modes, the user operation is simplified, and the accuracy and the efficiency of multi-dialect recognition are improved.

In addition, because a large amount of marking data is not needed for each dialect in the training process of the dialect recognition model (compared with the training of the dialect recognition model specially used for recognizing a certain dialect, the number of samples is small), the difficulty of audio data collection and manual transcription is reduced, and the cost is reduced, so that when the recognition capability of a new dialect is expected to be increased, the recognition capability of the new dialect can be increased in a short time at a low cost.

The following describes a specific implementation of the dialect recognition model provided in the embodiments of the present application.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a dialect identification model according to an embodiment of the present application, as shown in fig. 2, which may include:

a first feature extractor 21, a first classifier 22, and a first discriminator 23; wherein,

the input of the first feature extractor 21 is the dialect identifying feature of each frame of speech frame extracted in step S12, and the output of the first feature extractor 21 is the corresponding characterizing feature of each frame of speech frame, which is the feature that is more distinctive than the dialect identifying feature. That is, the first feature extractor 21 is configured to extract, from the dialect recognition features, features that characterize the intrinsic characteristics of the input speech data (i.e., the speech data received in step S11), which are high-level features used for dialect recognition. Specifically, when the first feature extractor 21 receives the dialect identifying feature of the first speech frame, corresponding to any one frame of speech frame (for convenience of description, it is denoted as a first speech frame), the characterizing feature corresponding to the first speech frame is extracted from the dialect identifying feature of the first speech frame, and the characterizing feature corresponding to the first speech frame is a feature that characterizes an internal characteristic of the first speech frame.

The first feature extractor 21 may be embodied in a deep Neural Network such as a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN).

The input of the first classifier 22 is the characterizing feature output by the first feature extractor 21, the output of the first classifier 22 is the recognition result of the speech data, i.e. corresponding to the first speech frame, and the first classifier 22 is used to determine the speech content of the first speech frame. Specifically, corresponding to the first speech frame, the input of the first classifier 22 is the characterization feature corresponding to the first speech frame, and the output of the first classifier 22 is the state representation of the speech content corresponding to the first speech frame.

The first classifier 22 may be a shallow Neural Network, for example, a two-layer DNN (Deep Neural Network) Network. The present application does not specifically limit the specific form of the first classifier 22. The specific form of the output of the first classifier 22 may be any one of a word, a syllable, a phone, and a phone state (a phone state is a unit smaller than a phone granularity). The specific form is related to the modeling unit of the first classifier 22:

if the first classifier 22 models a word as a modeling unit, the output of the first classifier 22 is a state representation of the word, i.e. corresponds to the first speech frame, and the first classifier 22 is configured to determine which word the dialect identifying feature of the first speech frame of the input dialect identifying model represents.

If the first classifier 22 models the syllable as a modeling unit, the output of the first classifier 22 is a state representation of the syllable, i.e. corresponds to the first speech frame, and the first classifier 22 is used to determine which syllable is characterized by the dialect identifying characteristics of the first speech frame of the input dialect identifying model.

If the first classifier 22 models the phonemes as a modeling unit, the output of the first classifier 22 is a state representation of the phonemes, that is, corresponding to the first speech frame, and the first classifier 22 is configured to determine which phoneme is characterized by the dialect identifying feature of the first speech frame of the input dialect identifying model.

If the first classifier 22 models the phoneme state as a modeling unit, the output of the first classifier 22 is a state representation of the phoneme state, i.e. corresponding to the first speech frame, and the first classifier 22 is configured to determine which phoneme state the dialect identifying feature of the first speech frame of the input dialect identifying model characterizes.

The input of the first discriminator 23 is the characterization feature output by the first feature extractor 21, and the output of the first discriminator 23 is the dialect class to which the speech data belongs. Specifically, corresponding to the first speech frame, the input of the first discriminator 23 is the characterization feature corresponding to the first speech frame, and the output of the first discriminator 23 is the state representation of the dialect class corresponding to the first speech frame, that is, the first discriminator 23 is configured to determine which dialect class the dialect identification feature of the first speech frame of the input dialect identification model characterizes.

It should be noted that, in the embodiment of the present application, the first discriminator 23 is mainly used for performing optimization training on the dialect model in the training stage of the dialect recognition model, and therefore, in the process of performing speech recognition by using the trained dialect recognition model, the discrimination result output by the first discriminator 23 may or may not be output to the user. Alternatively, a viewing interface may be provided for the user, and when the user operates the viewing interface, the determination result output by the first determiner 23 is output to the user.

In this embodiment, a back propagation Algorithm (back propagation Algorithm) is adopted in training the dialect recognition model, and the Algorithm consists of two processes of forward propagation of signals and back propagation of errors. The forward propagation of the signal refers to a process in which the dialect recognition model receives the dialect recognition features of the sample and outputs the speech recognition result of the sample, and the signal propagation directions thereof are from the first feature extractor 21 to the first classifier 22 and from the first feature extractor 21 to the first discriminator 23. And the back propagation of the error (characterized by the gradient) refers to the process of returning the error between the dialect type discrimination result of the sample output by the first discriminator 23 and the true dialect type of the sample to the input end of the dialect identification model, and the signal transmission direction is from the first discriminator 23 to the first feature extractor 21.

The following describes a specific implementation manner of the first discriminator 23 for the optimization training of the dialect recognition model, in conjunction with a specific structure of the first discriminator 23.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a first discriminator 23 according to an embodiment of the present application, which may include:

a first gradient inversion layer 31 (for convenience of description, the first gradient inversion layer 31 is denoted by R) and a first language discrimination layer 32; wherein,

the first gradient inversion layer 31 is defined as follows:

z＝R(z) (1)

formula (1) is a calculation formula of forward propagation of the first gradient inversion layer 31, where z is an input of the first gradient inversion layer 31, i.e., the characterization feature f output by the first feature extractor 21, and R (z) is an output of the gradient inversion layer, where R () represents that the input passes through the R layer without being processed, and it can be seen that in the forward propagation process, the output of the first gradient inversion layer 31 is an input of the first gradient inversion layer 31, i.e., the input feature is directly transmitted to the next layer (i.e., the first language identification layer 32) by the first gradient inversion layer 31 without any processing. Specifically, corresponding to the first speech frame, the input of the first gradient inversion layer 31 is the characteristic feature of the first speech frame, and the output of the first gradient inversion layer 31 is still the characteristic feature of the first speech frame.

Equation (2) is a calculation equation of the first gradient inversion layer 31 in which,

is the gradient of the first gradient inversion layer 31, E is the identity matrix, α is the predetermined hyperparameter, and it can be seen that the gradient of the first gradient inversion layer 31 is the product of the hyperparameter and a negative identity matrix.

According to the chain rule, the output gradient is equal to the input gradient multiplied by its own gradient (expressed by the formula: if h (x) ═ f (g (x)), then h ' (x) ═ f ' (g (x)) g ' (x)), then the output gradient of the first gradient inversion layer 31 is equal to the input gradient (i.e., the gradient of the first language discrimination layer 32, which represents the output error of the first language discrimination layer 32) multiplied by- α E, and due to the existence of the minus sign, it can be regarded as that the value of the input gradient is transmitted to the previous layer (i.e., the first feature extractor 21) after taking the minus sign. The first gradient inversion layer 31 does not process the input characteristics in forward propagation, and performs inversion processing on the input gradient in reverse propagation (i.e., multiplies the input gradient by- α E) so that the sign of the inversion processing result is opposite to the sign of the input gradient, and thus is called a gradient inversion layer.

According to the gradient descent method, when the updating direction of the model parameters is the gradient direction (namely the gradient is not subjected to inversion treatment), the model can reach the optimal solution at the highest speed. In the embodiment of the present application, the first gradient reversing layer 31 reverses the gradient of the first language discriminating layer 32 and then transmits the reversed gradient to the first feature extractor 21, so that the updating direction of the first feature extractor 21 is opposite to the direction of the first language discriminating layer 32, that is, the training target of the first language discriminating layer 32 is to recognize the dialect type to which the sample belongs as accurately as possible, and the training target of the first feature extractor 21 is to recognize the dialect type of an inaccurate sample as far as possible, so countertraining is introduced through the first gradient reversing layer 31.

The first language identification layer 32 may be a shallow neural network, for example, a two-layer DNN network, and the application does not specifically limit the specific network form of the first language identification layer 32. The input of the first language discrimination layer 32 is the characterization feature output by the first gradient inversion layer 31, and the output is the dialect type to which the voice data belongs. Specifically, corresponding to the first speech frame, the input of the first language identification layer 32 is the characteristic feature of the first speech frame output by the first gradient inversion layer 31, and the output of the first language identification layer 32 is the state representation of the dialect type to which the first speech frame belongs.

As can be seen from the foregoing, the countermeasure training is introduced through the first gradient inversion layer 31, and the objective of the countermeasure training is to train the first language identification layer 32 to more accurately determine the dialect to which the features of the input dialect recognition model belong, and to forward the gradient of the first language identification layer 32 through the first gradient inversion layer 31, train the first feature extractor 21 to extract features that are less language-specific, that is, to make the conditional probability distributions that the speech content represented by the features extracted by the first feature extractor 21 belongs to different dialect categories uniform. The conditional probability distributions that the speech contents belong to different dialect types are consistent means that the pronunciation of the speech contents in different dialects is similar and the same, for example, the feature distributions of the phoneme a of the Sichuan language, the phoneme a of the northeast language and the phoneme a of the Henan language are consistent, that is, the pronunciation of the phoneme a of the Sichuan language, the phoneme a of the northeast language and the phoneme a of the Henan language are similar or the same.

In order to make the feature distribution learned by the first language identification layer 32 be related to the conditional probability distribution of the dialect type of the voice content, in the embodiment of the present application, a control gate is introduced into the first language identification layer 32, and the control gate controls the first language identification layer 32 to learn the conditional probability distribution of the dialect type of different voice content. In the embodiment of the present application, the input of the control gate is the output of the first classifier 22, and for convenience of description, the first classifier 22 is described below by taking a phoneme as a modeling unit for modeling as an example.

For any layer of the first language discrimination layer 32 (for convenience of description, referred to as the k-th layer), the input of the k-th layer

According to the output of the control gate and the characteristics of the output of the (k-1) th layer; the input of the control gate is the vector output by the first classifier 22 corresponding to the feature output by the k-1. The specific formula can be expressed as:

g(c _i )＝σ(Vc _i +b) (4)

wherein h is _i Is the feature corresponding to the i-th frame speech frame output from the k-1 st layer of the first language discrimination layer 32, c _i Is h _i The corresponding one hot vector output by the first classifier 22, i.e. the phoneme vector corresponding to the i-th frame speech frame. For example, assuming that the first classifier 22 employs 83 phonemes as the modeling unit of the classifier, c _i Is an 83-dimensional vector, wherein each dimension corresponds to 1 phoneme, if the phoneme corresponding to the i-th frame speech frame is a, in the 83-dimensional vector, the dimension corresponding to a is 1, and all other dimensions are 0. g (c) _i ) To control gates, where σ is the activation function, V is the matrix weight, and b is the bias weight, i.e., the phoneme vector c _i And obtaining a control gate after matrix transformation, fusing the characteristics of the (k-1) th layer and the corresponding phonemes through the control gate, and inputting the fused information into the k-th layer, so that the first language type distinguishing layer 32 learns the information related to the conditional probability distribution of the dialect types of the phonemes. Note that if k is 1, the k-1 th layer is the first gradient inversion layer 31.

The output layer of the first language identification layer 32 has M nodes, where M is N × C, where N is the number of dialect categories, and C is the total number of modeling units of the first classifier 22, such as the total number of phonemes; the M nodes are divided into C groups, each group of nodes corresponds to a phoneme and is used for representing the judgment condition that the phoneme belongs to each dialect type, and the judgment condition generally refers to the probability that the phoneme belongs to each dialect type; each time the node parameters of the output layer of the first language discrimination layer 32 are updated, only the parameters of the group of nodes corresponding to the recognition result corresponding to the dialect recognition feature input by the dialect recognition model are updated.

For example, assuming that the model trains 20 dialects together and the speech modeling unit is 83 phonemes, then M20 × 83 is 1660 nodes, where one set of 20 nodes, i.e., one phoneme for each 20 nodes, characterizes the probability that the phoneme belongs to each of the 20 dialects. Each time the node parameters of the output layer of the first language discrimination layer 32 are updated, the phoneme prediction result output by the first classifier 22 and corresponding to the dialect recognition feature of the i-th frame speech frame of the input dialect recognition model is determined, and only the parameters of the group of 20 nodes corresponding to the phoneme prediction result are updated.

During the model training process, the loss function is an indispensable part of the model. In the embodiment of the present application, the loss functions are provided for the first classifier 22 and the first discriminator 23, respectively, and the loss function of the dialect recognition model is formed by weighting the loss function of the first classifier 22 and the loss function of the first discriminator 23.

The loss function of the first classifier 22 is used to characterize the difference between the speech content of the sample predicted by the first classifier 22 to be output and the true speech content of the sample. The loss function of the first discriminator 23 is used to characterize the difference between the language class of the sample predicted and output by the first discriminator 23 and the real language class of the sample.

The loss function of the first classifier 22 and the loss function of the first discriminator 23 may be the same or different. When the loss function of the first classifier 22 and the loss function of the first discriminator 23 are weighted, the weight of the loss function of the first classifier 22 and the weight of the loss function of the first discriminator 23 may be the same or different.

Alternatively, the loss function of the first classifier 22 and the loss function of the first discriminator 23 may both be cross entropy functions. The cross entropy is an important concept in the information theory, and is mainly used for measuring difference information between two probability distributions, and when the two probability distributions are the same, the cross entropy is the minimum value. The cross-entropy function (denoted by L1) is explained below using the first classifier 22 as an example:

wherein, I represents the total number of dialect features of the speech frames input with the dialect recognition model at one time (i.e. the dialect features of the I speech frames can be processed simultaneously by the dialect recognition model at one time), I represents the ith speech frame,f denotes a first feature extractor 21, F (x) _i ) Representing the ith speech frame x _i The dialect identifying feature of (1) at the output of the first feature extractor 21, Y denotes the first classifier 22, Y (F (x) _i ) X) represents the ith speech frame _i Identifies features at the output of a classifier 22,

representing the ith speech frame x _i The dialect recognition feature of (a) corresponds to the actual speech content, L _y Is the cross entropy, here Y (F (x) _i ) ) and

cross entropy of (d).

By minimizing the loss function, i.e. minimizing the cross entropy of the output of the first classifier 22 and the real result, the model can be trained such that the output is closer to the real result, i.e. the recognition result of the model is closer to the real result, i.e. the recognition rate is higher.

In the embodiment of the present application, the language type of the output sample predicted by the first discriminator 23 is realized by the first language discrimination layer 32, and therefore, the loss function corresponding to the first discriminator 23 is the loss function corresponding to the first language discrimination layer 32.

Assuming that the loss function of the dialect recognition model is characterized by L, the loss function of the first classifier 22 is characterized by L1, and the loss function of the first language discrimination layer 32 is characterized by L2, then:

L＝a×L1+b×L2。

alternatively, L-L1 + L2, i.e., a-b-1.

During model training, model parameters are updated by minimizing L, L1 and L2. The dialect recognition model has the capability of recognizing multiple dialects by minimizing L, the first classifier 22 has stronger acoustic distinguishing capability by minimizing L1, the first discriminator 23 has stronger dialect recognition capability by minimizing L2, and simultaneously the features generated by the first feature extractor 21 have dialect confusion due to the effect of a gradient inversion layer in the first discriminator 23, wherein the dialect confusion refers to the fact that the dialect features of different dialect types are uniformly distributed through the characterization features generated by the first feature extractor 21, and the first discriminator 23 cannot distinguish which dialect the input features are. In the countermeasure training process, the first discriminator 23 has stronger capability, which promotes better and better dialect confusability of the features generated by the first feature extractor 21, so that the first discriminator 23 cannot discriminate; when the dialects of the features generated by the first feature extractor 21 are better and better in confusion, the first discriminator 23 can further improve the discrimination capability in order to accurately discriminate, and finally, a balanced state is achieved, that is, when the features extracted by the first feature extractor 21 are good enough, the first discriminator 23 cannot discriminate, and at this time, the feature distribution extracted by the first feature extractor 21 is basically consistent, so that dialects of different languages are not needed to be distinguished during voice recognition, and voice recognition can be directly performed, and the effect of multi-party recognition is achieved.

In the foregoing embodiments, the dialect recognition model is trained based on the corpus labeled with the speech content and the type of the dialect to which the dialect belongs. The inventor finds that in the course of implementing the application, if dialect attribute information is introduced in the course of training the dialect recognition model, the recognition effect of the dialect recognition model can be further improved. The dialect attribute information may specifically be a dialect attribute category to which the voice data belongs, such as a dialect section, and taking chinese as an example, a chinese dialect may be divided into seven large sections: official dialect, xiang dialect, gan dialect, wu dialect, min dialect, yue dialect and hakka dialect. Among them, the official dialects can be further subdivided into: the official communications in the north (Beijing official communications, northeast official communications, Guliao official communications, Jilu official communications, the general names of original official communications and lan-yin official communications), southwest official communications and Jianghuai official communications.

Based on this, the dialect recognition model provided by the embodiment of the application can be obtained by training with a corpus labeled with at least voice content, a dialect type and a dialect attribute category.

That is, in the course of training the dialect recognition model, in addition to recognizing the voice content of the training sample, the dialect type to which the training sample belongs and the dialect attribute type to which the training sample belongs are respectively discriminated, and the dialect recognition model is optimally trained based on the recognition result of the voice content, the discrimination result of the dialect type, and the discrimination result of the dialect attribute type, thereby further improving the accuracy of the recognition result of the dialect recognition model.

Based on this, another schematic structural diagram of the dialect identification model provided in the embodiment of the present application is shown in fig. 4, and may include:

a second feature extractor 41, a second classifier 42, and a second discriminator 43; wherein,

the input of the second feature extractor 41 is the dialect identifying feature of each frame of speech frame extracted in step S12, and the output of the second feature extractor 41 is the corresponding characterizing feature of each frame of speech frame, which is a feature that is more distinctive than the dialect identifying feature. That is, the second feature extractor 41 is configured to extract, from the dialect recognition features, features that characterize the intrinsic characteristics of the input speech data (i.e., the speech data received in step S11), which are high-level features used for dialect recognition. Specifically, corresponding to the first speech frame, when the second feature extractor 41 receives the dialect identifying feature of the first speech frame, the characterizing feature corresponding to the first speech frame is extracted from the dialect identifying feature of the first speech frame, and the characterizing feature corresponding to the first speech frame is a feature characterizing the inherent characteristics of the first speech frame.

The second feature extractor 41 may be a CNN, or a deep neural network such as an RNN.

The input of the second classifier 42 is the characterizing feature output by the second feature extractor 41, the output of the second classifier 42 is the recognition result of the speech data, i.e. corresponding to the first speech frame, and the second classifier 42 is used for determining the speech content of the first speech frame. Specifically, corresponding to the first speech frame, the input of the second classifier 42 is the characterization feature corresponding to the first speech frame, and the output of the second classifier 42 is the state representation of the speech content corresponding to the first speech frame.

The second classifier 42 may be embodied in the form of a shallow neural network, for example, a two-layer DNN network. The present application does not specifically limit the specific form of the second classifier 42. The specific form of the output of the second classifier 42 may be any one of words, syllables, phonemes, and phoneme states. The specific form is related to the modeling unit of the second classifier 42, and the specific implementation of the modeling unit of the second classifier 42 can refer to the implementation of the modeling unit of the first classifier 22, which is not described in detail herein.

The input of the second discriminator 43 is the characterization feature output by the second feature extractor 41, and the output of the second discriminator 43 is the dialect class to which the voice data belongs and the dialect attribute class to which the voice data belongs. Specifically, corresponding to the first speech frame, the input of the second discriminator 43 is the characterization feature corresponding to the first speech frame, and the output of the second discriminator 43 is the state representation of the dialect class corresponding to the first speech frame and the state representation of the dialect attribute class to which the first speech frame belongs, that is, the second discriminator 43 is used to determine which dialect class the dialect identification feature of the first speech frame of the input dialect identification model characterizes, and which dialect attribute class the dialect identification feature of the first speech frame characterizes.

Similar to the first discriminator 23, in the embodiment of the present application, the second discriminator 43 is mainly used for performing optimization training on the dialect model in the training stage of the dialect recognition model, so that the discrimination result output by the second discriminator 43 may or may not be output to the user in the process of performing speech recognition by using the trained dialect recognition model. Alternatively, a viewing interface may be provided for the user, and when the user operates the viewing interface, the determination result output by the second determiner 43 is output to the user.

In this embodiment, a back propagation Algorithm (back propagation Algorithm) is adopted in training the dialect recognition model, and the Algorithm consists of two processes of forward propagation of signals and back propagation of errors. The forward propagation of the signal refers to a process in which the dialect recognition model receives the dialect recognition features of the samples and outputs the speech recognition results of the samples, and the signal propagation directions thereof are from the second feature extractor 41 to the second classifier 42 and from the second feature extractor 41 to the second discriminator 43. The back propagation of the error refers to a process of returning the dialect type determination result and the dialect attribute type determination result of the sample output by the second discriminator 43 and the error of the true dialect type and the dialect attribute type of the sample to the input end of the dialect identification model, and the signal transmission direction is from the second discriminator 43 to the second feature extractor 41.

The specific implementation of the second discriminator 43 for the optimization training of the dialect recognition model is described below with reference to the specific structure of the second discriminator 43.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a second discriminator 43 according to an embodiment of the present application, which may include:

a second gradient inversion layer 51 (for convenience of description, the second gradient inversion layer 51 is denoted by R), a second language discrimination layer 52, and an attribute discrimination layer 53; wherein,

the definition of the second gradient inversion layer 51 is the same as that of the first gradient inversion layer 31, that is:

z＝R(z) (1)

equation (1) is a calculation equation of forward propagation of the second gradient inversion layer 51, where z is an input of the second gradient inversion layer 51, i.e., the characterization feature f output by the second feature extractor 41, and R (z) is an output of the second gradient inversion layer 51, where R () represents that the input passes through the R layer without being processed, and it can be seen that, in the forward propagation process, the output of the second gradient inversion layer 51 is an input of the second gradient inversion layer 51, i.e., the input feature is directly passed to the next layer (i.e., the second language identification layer 52 and the attribute identification layer 53) by the second gradient inversion layer 51 without any processing. Specifically, corresponding to the first speech frame, the input of the second gradient inversion layer 51 is the characteristic feature of the first speech frame, and the output of the second gradient inversion layer 51 is still the characteristic feature of the first speech frame.

Equation (2) is a calculation equation of the backward propagation of the second gradient inversion layer 51, in which,

is the firstThe gradient of the second gradient inversion layer 51, E is an identity matrix, α is a predetermined hyperparameter, and it can be seen that the gradient of the second gradient inversion layer 51 is the product of the hyperparameter and a negative identity matrix.

According to the chain rule, the output gradient is equal to the input gradient multiplied by its own gradient, and the output gradient of the second gradient inversion layer 51 is equal to the input gradient (i.e., the sum of the gradient of the second language identification layer 52 and the gradient of the attribute identification layer 53) multiplied by- α E, and due to the existence of the negative sign, it can be regarded as that the value of the input gradient is taken to be negative and then transmitted to the previous layer (i.e., the second feature extractor 41). The second gradient inversion layer 51 does not process the input characteristics in forward propagation, and performs inversion processing on the input gradient in reverse propagation (i.e., multiplies the input gradient by- α E) so that the sign of the inversion processing result is opposite to the sign of the input gradient, and thus is called a gradient inversion layer.

According to the gradient descent method, when the updating direction of the model parameters is the gradient direction (namely the gradient is not subjected to inversion treatment), the model can reach the optimal solution at the highest speed. In the embodiment of the present application, the second gradient reversing layer 51 reverses the gradient of the second language discriminating layer 52 and the gradient of the attribute discriminating layer 53 and then transmits the reversed gradients to the second feature extractor 41, so that the parameter updating direction of the second feature extractor 41 is opposite to the gradient direction of the second language discriminating layer 52 and the attribute discriminating layer 53, that is, the training target of the second language discriminating layer 52 is to recognize the dialect category to which the sample belongs as accurately as possible, the training target of the attribute discriminating layer 53 is to recognize the dialect attribute category to which the sample belongs as accurately as possible, and the training target of the second feature extractor 41 is to recognize the dialect category and the dialect attribute category to which the sample belongs as accurately as possible, so countertraining is introduced through the second gradient reversing layer 51.

The second language identification layer 52 may be a shallow neural network, for example, a two-layer DNN network, and the application does not specifically limit the specific network form of the second language identification layer 52. The input of the second language identification layer 52 is the characterization feature output by the second gradient inversion layer 51, and the output is the dialect type to which the voice data belongs. Specifically, corresponding to the first speech frame, the input of the second language identification layer 52 is the characteristic feature of the first speech frame output by the second gradient inversion layer 51, and the output of the second language identification layer 52 is the state representation of the dialect type to which the first speech frame belongs.

The attribute discrimination layer 53 may be a shallow neural network, for example, a two-layer DNN network, and the application does not specifically limit the specific network form of the attribute discrimination layer 53. The input of the attribute discrimination layer 53 is the characterization feature output by the second gradient inversion layer 51, and the output is the dialect attribute category to which the voice data belongs. Specifically, corresponding to the first speech frame, the input of the attribute discriminating layer 53 is the characteristic feature of the first speech frame output by the second gradient inversion layer 51, and the output of the attribute discriminating layer 53 is the state representation of the dialect attribute class to which the first speech frame belongs.

As can be seen from the foregoing, the countermeasure training is introduced through the second gradient inversion layer 51, and the objective is to train the second language identification layer 52 and the attribute identification layer 53 to more accurately determine which dialect the features of the input dialect recognition model belong to and the attribute class of the dialect, and to train the second gradient inversion layer 51 to forward the gradients of the second language identification layer 52 and the attribute identification layer 53, train the second feature extractor 41 to generate features having no language distinctiveness and attribute class distinctiveness, that is, to make the conditional probability distributions of the speech contents represented by the features extracted by the second feature extractor 41 belonging to different dialect classes consistent, and make the conditional probability distributions of the attribute classes of the speech contents represented by the features extracted by the second feature extractor 41 belonging to consistent. The consistent conditional probability distribution of the dialect attribute categories to which the voice content belongs means that different dialects belong to the same attribute category. For example, Henan and northeast China are north officials.

In order to make the feature distribution learned by the second language identification layer 52 be related to the conditional probability distribution of the dialect type of the voice content, in the embodiment of the present application, a control gate is introduced into the second language identification layer 52, and the second language identification layer 52 learns the conditional probability distribution of the dialect type of different voice content through the control gate. In the embodiment of the present application, the input of the control gate is the output of the second classifier 42, and the specific implementation manner of the control gate can refer to the implementation manner of the control gate in the first language discrimination layer 32, which is not described in detail here.

In order to make the feature distribution learned by the attribute discrimination layer 53 be related to the conditional probability distribution of the dialect attribute category to which the voice content belongs, in the embodiment of the present application, a control gate is also introduced into the attribute discrimination layer 53, and the attribute discrimination layer 53 is controlled by the control gate to learn the conditional probability distribution of the dialect attribute category of different voice contents. In the embodiment of the present application, the input of the control gate is the output of the second classifier 42, and the structure of the control gate in the attribute discrimination layer 53 is the same as that of the control gate in the second language discrimination layer 52, which is specifically shown in formulas (3) to (4):

g(c _i )＝σ(Vc _i +b) (4)

for convenience of explanation, the second classifier 42 will be described below by taking the phoneme as a modeling unit for modeling as an example.

In the attribute discrimination layer 53, the above formulas (3) to (4) mean: input to any layer of the attribute discriminating layer 53 (for convenience of description, referred to as "k layer"), i.e., the k layer

According to the output of the control gate and the characteristics of the output of the (k-1) th layer; the input of the control gate is the vector output by the second classifier 42 corresponding to the feature output by the k-1.

Specifically, in the attribute discriminating layer 53, h _i Is a feature corresponding to the i-th frame speech frame output from the k-1 st layer of the attribute discriminating layer 53, c _i Is h _i The corresponding one hot vector output by the second classifier 42, i.e. the phoneme vector corresponding to the i-th frame speech frame. For example, assuming that the second classifier 42 employs 83 phonemes as the modeling unit of the second classifier 42, c _i Is an 83-dimensional vector, wherein each dimension corresponds to 1 phoneme, if the phoneme corresponding to the i-th frame speech frame is a, the vector is in 83-dimensionalIn the vector of (a), the corresponding dimension of a is 1, and all other dimensions are 0. g (c) _i ) To control gates, where σ is the activation function, V is the matrix weight, and b is the bias weight, i.e., the phoneme vector c _i The feature of the (k-1) th layer and the corresponding phoneme are fused together by the control gate and input into the (k) th layer, so that the attribute discrimination layer 53 learns the information related to the conditional probability distribution of the dialect attribute category of the phoneme. If k is 1, the k-1 th layer is the second gradient inversion layer 51.

The output layer of the attribute discrimination layer 53 has Q nodes, where P is the number of dialect attribute classes, and C is the total number of modeling units of the second classifier 42, such as the total number of phonemes; the Q nodes are divided into C groups, each group of nodes corresponds to a phoneme and is used for representing the judgment condition of the dialect attribute category of the phoneme, and the judgment condition generally refers to the probability that the phoneme belongs to each dialect attribute category; each time the node parameters of the output layer of the attribute discriminating layer 53 are updated, only the parameters of the group of nodes corresponding to the recognition result corresponding to the dialect recognition feature input by the dialect recognition model are updated.

For example, assuming that the model trains 7 dialect attribute classes (corresponding to 7 large regions of dialects) together, and the speech modeling unit is 83 phonemes, then Q7 × 83 — 581 nodes, where each 7 node group, i.e., each 7 nodes, corresponds to a phoneme, characterize the probability that the phoneme belongs to each of the 7 dialect attribute classes. Each time the node parameters of the output layer of the attribute discrimination layer 53 are updated, the phoneme prediction result corresponding to the dialect recognition feature of the i-th frame speech frame of the input dialect recognition model output by the second classifier 42 is determined, and only the parameters of the group of 7 nodes corresponding to the phoneme prediction result are updated.

The following explains a case of setting a loss function in the dialect recognition model in the case of introducing the attribute discrimination layer 53.

With the introduction of the attribute discrimination layer 53, one implementation manner of setting a loss function to the dialect recognition model in the embodiment of the present application may be as follows: the second speech discrimination layer 52 and the attribute discrimination layer 53 have loss functions provided for the second classifier 42, respectively, and the loss function of the dialect recognition model is formed by weighting the loss function of the second classifier 42, the loss function of the second speech discrimination layer 52, and the loss function of the attribute discrimination layer 53.

The loss function of the second classifier 42 is used to characterize the difference between the speech content of the sample predicted by the second classifier 42 to be output and the true speech content of the sample. The loss function of the second speech discrimination layer 52 is used to characterize the difference between the language class of the sample predicted and output by the second speech discrimination layer 52 and the actual language class of the sample. The loss function of the attribute discrimination layer 53 is used to characterize the difference between the dialect attribute class of the sample predicted to be output by the attribute discrimination layer 53 and the true dialect attribute class of the sample.

The loss function of the second classifier 42, the loss function of the second speech discrimination layer 52, and the loss function of the attribute discrimination layer 53 may be the same or different. When weighting the loss function of the second classifier 42, the loss function of the second speech discrimination layer 52, and the loss function of the attribute discrimination layer 53, the weight of the loss function of the second classifier 42, the weight of the loss function of the second speech discrimination layer 52, and the weight of the loss function of the attribute discrimination layer 53 may be the same or different.

Alternatively, the loss function of the second classifier 42, the loss function of the second speech discrimination layer 52 and the loss function of the attribute discrimination layer 53 may all be cross-entropy functions.

Assuming that the loss function of the dialect recognition model is characterized by L, the loss function of the second classifier 42 is characterized by L1, the loss function of the second language discrimination layer 52 is characterized by L2, and the loss function of the attribute discrimination layer 53 is characterized by L3, then:

L＝a×L1+b×L2+c×L3。

alternatively, L-L1 + L2+ L3, i.e., a-b-c-1.

During model training, model parameters are updated by minimizing L, L1 and L2+ L3. The dialect recognition model has the capability of recognizing multiple dialects by minimizing L, the second classifier 42 has stronger acoustic distinguishing capability by minimizing L1, the second discriminator 43 has stronger dialect recognition capability by minimizing L2+ L3, and the features generated by the second feature extractor 41 have dialect confusion due to the effect of a gradient inversion layer in the second discriminator 43, wherein the dialect confusion refers to the fact that the dialect features of different dialect types are consistent in distribution through the characterization features generated by the second feature extractor 41, and the second discriminator 43 cannot distinguish which dialect the input features are. In the above-mentioned countermeasure training process, the second discriminator 43 has stronger and stronger capability, which promotes better and better dialect confusability of the features generated by the second feature extractor 41, so that the second discriminator 43 cannot discriminate; when the dialects of the features generated by the second feature extractor 41 are better and better in confusion, the second discriminator 43 can further improve the discrimination capability in order to accurately discriminate, and finally a balanced state is achieved, that is, when the features extracted by the second feature extractor 41 are good enough, the second discriminator 43 cannot discriminate, and at this time, the feature distribution extracted by the second feature extractor 41 is basically consistent, so that dialects of different languages are not needed to be distinguished during voice recognition, and voice recognition can be directly performed, and the effect of multi-party recognition is achieved.

In addition, considering that the dialect attribute category has a certain correlation with the dialect type, the dialect attribute category and the dialect type having a certain correlation means that the dialect attribute category and the dialect type have a one-to-one or one-to-many relationship, for example, if the area to which the dialect belongs is taken as the dialect attribute category, the area to which the dialect belongs and the dialect type have a one-to-many relationship, for example, if the sichuan belongs to the southwest official language and the southeast and northeast belong to the northern official language, if a sample is judged that the dialect type is the sichuan, the judgment result of the attribute category should be the southwest official language, and if not, the judgment result of the dialect type and the dialect attribute are not consistent, further optimization is needed. In order to optimize the error, when a speech recognition model is provided with a loss function, a language attribute consistency loss function is introduced, and consistency learning of feature distribution is further strengthened through the language attribute consistency loss function. The language attribute consistency loss L4 is defined here as follows:

wherein I represents the total number of dialect features of a speech frame once inputted with the dialect recognition model, D _KL Is KL divergence (Kullback-Leibler divergence), q _outi Is the output q 'of the feature attribute discriminating layer 53 for the i-th frame speech frame' _outi Is the feature output obtained by converting the output of the second language identification layer 52 according to the feature of the i-th frame speech frame. The output of the second language identification layer 52 is a status representation representing the dialect type to which the ith frame speech frame belongs, and the output of the attribute identification layer 53 is a status representation representing the dialect attribute type to which the ith frame speech frame belongs, so when calculating the KL divergence, normalization is required to be performed on the KL divergence, in the embodiment of the present application, normalization means: and converting the state representation which is output by the second language type distinguishing layer 52 and represents the attribute type to which the ith frame speech frame belongs into the state representation of the attribute type to which the ith frame speech frame belongs. The transformation process can be obtained by transformation according to preset transformation rules.

In the case of introducing the language attribute consistency loss function, the loss function of the dialect recognition model is composed of the loss function of the second classifier 42, the loss function of the second language discrimination layer 52, the loss function of the attribute discrimination layer 53, and the language attribute consistency loss function weights of the second language discrimination layer 52 and the attribute discrimination layer 53. Can be expressed by the formula:

L＝a×L1+b×L2+c×L3+d×L4。

alternatively, L-L1 + L2+ L3+ L4, i.e., a-b-c-d-1.

During model training, model parameters are updated by minimizing L, L1 and L2+ L3+ L4. The dialect recognition model has the capability of recognizing multiple dialects by minimizing L, the second classifier 42 has stronger acoustic distinguishing capability by minimizing L1, the second classifier 43 has stronger dialect recognition capability by minimizing L2+ L3+ L4, and the features generated by the second feature extractor 41 have dialect confusability due to the function of a gradient inversion layer in the second classifier 43, wherein the dialect confusability means that the distribution of the dialect features of different dialect types generated by the second feature extractor 41 is consistent, and the second classifier 43 cannot distinguish which dialect the input features are. In the above-mentioned countermeasure training process, the second discriminator 43 has stronger and stronger capability, which promotes better and better dialect confusability of the features generated by the second feature extractor 41, so that the second discriminator 43 cannot discriminate; when the dialects of the features generated by the second feature extractor 41 are better and better in confusion, the second discriminator 43 can further improve the discrimination capability in order to accurately discriminate, and finally a balanced state is achieved, that is, when the features extracted by the second feature extractor 41 are good enough, the second discriminator 43 cannot discriminate, and at this time, the feature distribution extracted by the second feature extractor 41 is basically consistent, so that dialects of different languages are not needed to be distinguished during voice recognition, and voice recognition can be directly performed, and the effect of multi-party recognition is achieved.

Corresponding to the method embodiment, the present application also provides a multi-party language identification device, and a schematic structural diagram of the multi-party language identification device provided in the embodiment of the present application is shown in fig. 6, and may include:

a receiving module 61, an extracting module 62 and an identifying module 63; wherein,

the receiving module 61 is used for receiving voice data;

the extraction module 62 is configured to extract dialect recognition features from the voice data;

the recognition module 63 is configured to input the dialect recognition features into a pre-constructed dialect recognition model to obtain a recognition result of the voice data; the dialect recognition model is obtained by training by using a training corpus at least marked with voice content and the dialect type.

The multi-dialect recognition device provided by the embodiment of the application carries out recognition of dialects through the pre-constructed dialect recognition model, wherein the dialect recognition model is obtained through training of training linguistic data comprising multiple dialects, the training process of the dialect recognition model is not limited to voice content of the linguistic data, the dialect type to which the dialect belongs is introduced, and the dialect recognition model is optimized by combining the dialect type to which the dialect belongs, so that the dialect recognition model can accurately recognize multiple dialects, a user does not need to switch voice recognition modes, user operation is simplified, and the accuracy and efficiency of multi-dialect recognition are improved.

In an alternative embodiment, the dialect recognition model is obtained by training with a corpus labeled with at least speech content, a dialect category and a dialect attribute category.

In an alternative embodiment, the dialect recognition model includes: a feature extractor, a classifier and a discriminator; wherein,

the feature extractor is used for acquiring the dialect identification features and outputting the characterization features, wherein the characterization features are more distinctive than the dialect identification features;

In an optional embodiment, the discriminator comprises: a gradient inversion layer and a language discrimination layer; alternatively, the discriminator may include: a gradient inversion layer, a language discrimination layer and an attribute discrimination layer; wherein,

In an optional embodiment, the gradient inversion layer is configured to invert the gradient of the language identification layer and transmit the inverted gradient to the feature extractor when the dialect recognition model is trained, or the gradient inversion layer is configured to invert the gradients of the language identification layer and the attribute identification layer and transmit the inverted gradients to the feature extractor when the dialect recognition model is trained, so as to update parameters of the feature extractor.

In an alternative embodiment, the loss function of the dialect recognition model during training is formed by weighting the loss function of the classifier and the loss function of the discriminator.

In an optional embodiment, if the discriminator includes a gradient reversal layer and a language discrimination layer, the loss function of the dialect recognition model during training is formed by weighting the loss function of the classifier and the loss function of the language discrimination layer;

or,

In an optional embodiment, if the discriminator includes a gradient reversal layer, a language discrimination layer, and an attribute discrimination layer, the loss function of the dialect recognition model during training is formed by weighting the loss function of the classifier, the loss function of the language discrimination layer, the loss function of the attribute discrimination layer, and the language attribute consistency loss function of the language discrimination layer and the attribute discrimination layer.

In an optional embodiment, the language discrimination layer is a neural network including a control gate; the number of layers of the neural network is more than 1;

The multi-language identification device provided by the embodiment of the application can be applied to multi-language identification equipment, such as a PC terminal, a smart phone, a translator, a robot, a smart home (household appliance), a remote controller, a cloud platform, a server cluster and the like. Alternatively, fig. 7 shows a block diagram of a hardware structure of the multi-party recognition device, and referring to fig. 7, the hardware structure of the multi-party recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

receiving voice data;

extracting dialect recognition features from the voice data;

inputting the dialect recognition characteristics into a pre-constructed dialect recognition model to obtain a recognition result of the voice data; the dialect recognition model is obtained by training by using a training corpus at least marked with voice content and the dialect type.

Alternatively, the detailed function and the extended function of the program may be as described above.

An embodiment of the present application further provides a storage medium, where the storage medium may store a program adapted to be executed by a processor, where the program is configured to:

receiving voice data;

extracting dialect recognition features from the voice data;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-party identification method, comprising:

receiving voice data;

extracting dialect recognition features from the voice data;

inputting the dialect recognition characteristics into a preset dialect recognition model to obtain a recognition result of the voice data; the dialect recognition model is obtained by training a training corpus which comprises a plurality of dialects and is at least marked with voice content and the types of the dialects; wherein the dialect recognition model can recognize voice data of a plurality of dialects;

the dialect recognition model includes: a feature extractor, a classifier and a discriminator; wherein,

2. The method of claim 1, wherein the dialect recognition model is trained using a corpus labeled with at least speech content, a dialect category, and a dialect attribute category.

3. The method of claim 1, wherein the discriminator comprises: a gradient inversion layer and a language discrimination layer; alternatively, the discriminator may include: a gradient inversion layer, a language discrimination layer and an attribute discrimination layer; wherein,

the input of the gradient inversion layer is the characteristic feature, and the output of the gradient inversion layer is the characteristic feature;

4. The method of claim 3, wherein, in training the dialect recognition model,

5. The method according to claim 3, wherein the dialect recognition model is trained, and wherein the loss function of the dialect recognition model is weighted by the loss function of the classifier and the loss function of the discriminator.

6. The method according to claim 5, wherein if the discriminator comprises a gradient inversion layer and a language discrimination layer, the loss function of the dialect recognition model is weighted by the loss function of the classifier and the loss function of the language discrimination layer when the dialect recognition model is trained;

or,

7. The method according to claim 5, wherein if the discriminator comprises a gradient inversion layer, a language discrimination layer and an attribute discrimination layer, when training the dialect recognition model, the loss function of the dialect recognition model is weighted by the loss function of the classifier, the loss function of the language discrimination layer, the loss function of the attribute discrimination layer, and the language attribute consistency loss function of the language discrimination layer and the attribute discrimination layer.

8. The method according to any one of claims 3-7, wherein said language discrimination layer is a neural network comprising a control gate; the number of layers of the neural network is more than 1;

9. A multi-party identification arrangement, comprising:

the receiving module is used for receiving voice data;

the recognition module is used for inputting the dialect recognition characteristics into a preset dialect recognition model to obtain a recognition result of the voice data; the dialect recognition model is obtained by training a training corpus which comprises a plurality of dialects and is at least marked with voice content and the types of the dialects; wherein the dialect recognition model can recognize voice data of a plurality of dialects;

10. A multi-party identification device comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the multi-party identification method according to any of claims 1-8.

11. A readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method for recognizing a party, according to any of claims 1-8.