CN111640419A

CN111640419A - Language identification method, system, electronic equipment and storage medium

Info

Publication number: CN111640419A
Application number: CN202010456194.1A
Authority: CN
Inventors: 柳林; 方磊; 方四安
Original assignee: Hefei Ustc Iflytek Co ltd
Current assignee: Hefei Ustc Iflytek Co ltd
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-09-08
Anticipated expiration: 2040-05-26
Also published as: CN111640419B

Abstract

The embodiment of the invention provides a language identification method, a system, electronic equipment and a storage medium, wherein the language identification method is characterized in that a preset Gaussian mixture model is obtained by fusing a first-class language Gaussian mixture model and a second-class language Gaussian mixture model on the level of a model algorithm by adopting an integrated learning thought, the data distribution of all languages is fitted through the preset Gaussian mixture model, the data distribution fitting of minority languages is more favorably highlighted, the phenomenon that the distribution characteristic of the minority language data is averaged by the majority language data so that the distribution characteristic of the minority language data is masked by the majority language data can be effectively avoided, and the distribution characteristic of the languages in a voice file can be better embodied by the mean value super vector corresponding to the voice file determined by the preset Gaussian mixture model. Furthermore, no matter the languages contained in the voice file belong to the majority category or the minority category, the languages can be accurately identified through the language identification model.

Description

Language identification method, system, electronic equipment and storage medium

Technical Field

The present invention relates to the field of information identification technologies, and in particular, to a language identification method, system, electronic device, and storage medium.

Background

As a mature speech recognition technology, speech recognition has been widely used in a plurality of fields, such as public safety and military reconnaissance, artificial intelligence-type front-end systems, emergency rescue, and the like.

Currently, the mainstream language identification method is to use a Gaussian Mixed Model (GMM) as a basic framework and introduce identification technologies such as a discriminative model, factor analysis, deep learning, and the like, such as speech recognition systems such as SDC-GSV, SDC-TV, BN-GSV, BN-TV, and the like. In these language identification methods, when the applied model is trained, the data distribution ratio of each language in the training sample is approximately equivalent, or the data distribution ratio of a few languages with less data is not less than 10%, so that it can be ensured that the trained model can accurately identify the language.

However, in some special scenarios, the data distribution ratio of the collected minority language cannot satisfy not less than 10% or is equivalent to the data distribution of the majority language with more data, that is, the data distribution ratio of the minority language is less than 10%, in this case, a model obtained by training the data of the minority language and the data of the majority language together is used, and the recognition result obtained when performing language recognition is more biased to the majority language, and the minority language cannot be recognized accurately.

Disclosure of Invention

To overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a language identification method, system, electronic device, and storage medium.

In a first aspect, an embodiment of the present invention provides a language identification method, including:

the method comprises the steps of obtaining posterior features which are used for representing languages and correspond to a voice file to be recognized, and determining a mean value super-vector which corresponds to the voice file based on the posterior features which are used for representing languages and correspond to the voice file and a preset Gaussian mixture model;

inputting the mean value super vector into a language identification model to obtain an identification result output by the language identification model;

the preset Gaussian mixture model is obtained by fusing a first language Gaussian mixture model and a second language Gaussian mixture model;

the language identification model is obtained based on the training of the mean value super vector of a first language voice file sample with a language label and the mean value super vector of a second language voice file sample with a language label, the mean value super vector of the first language voice file sample is based on the posterior feature which is used for representing the language and corresponds to the first language voice file sample and the preset Gaussian mixture model is determined, and the mean value super vector of the second language voice file sample is based on the posterior feature which is used for representing the language and corresponds to the second language voice file sample and the preset Gaussian mixture model is determined.

Preferably, the preset gaussian mixture model is obtained based on the fusion of a first-language gaussian mixture model and a second-language gaussian mixture model, and specifically includes:

determining weights respectively corresponding to the first language-like Gaussian mixture model and the second language-like Gaussian mixture model based on a preset balance coefficient;

fusing the language-class Gaussian mixture model and the second language-class Gaussian mixture model based on the weights respectively corresponding to the first language-class Gaussian mixture model and the second language-class Gaussian mixture model to obtain the preset Gaussian mixture model;

and determining the balance coefficient based on the number of samples respectively corresponding to the first language type voice file sample and the second language type voice file sample.

Preferably, the determining of the balance coefficient is based on the number of samples respectively corresponding to the first language-like voice file sample and the second language-like voice file sample, and specifically includes:

determining the number ratio of the samples corresponding to the first language voice file sample and the second language voice file sample respectively based on the number of the samples corresponding to the first language voice file sample and the second language voice file sample respectively;

and determining the balance coefficient based on the information entropy value of the sample number ratio corresponding to the first language type voice file sample and the information entropy value of the sample number ratio corresponding to the second language type voice file sample.

Preferably, the training process of the language identification model specifically includes:

clustering the mean value super-vector of the first language voice file sample to determine a plurality of clustering centers; the number of the clustering centers is determined based on the number of the second language type voice file samples;

and replacing the mean value super vector of the first language voice file sample with the mean value super vector corresponding to each cluster center, and training the language identification model based on the mean value super vector corresponding to each cluster center with a language label and the mean value super vector of the second language voice file sample with a language label.

Preferably, the number of the cluster centers is the same as the number of the second-language voice file samples.

Preferably, the language identification model comprises a plurality of language identification submodels, and each language identification submodel corresponds to a language respectively; correspondingly, the inputting the mean value super vector into a language identification model to obtain an identification result output by the language identification model specifically includes:

respectively inputting the mean value super vector into each language identification submodel, and respectively obtaining an identification result output by each language identification submodel;

and obtaining the recognition result output by the language recognition model according to the recognition result output by each language recognition submodel.

Preferably, the obtaining of the recognition result output by the language recognition model according to the recognition result output by each language recognition submodel specifically includes:

and comparing the recognition result output by each language recognition submodel with a preset threshold value respectively, and determining the language corresponding to the language recognition submodel with the output recognition result larger than the preset threshold value as the language of the voice file.

Preferably, the obtaining of the posterior feature corresponding to the voice file to be recognized in the language and used for representing the language specifically includes:

inputting the voice file to be recognized into a posterior feature extraction model to obtain posterior features which are used for representing languages and correspond to the voice file to be recognized and output by the posterior feature extraction model;

the posterior feature extraction model is obtained by training based on a voice file sample group and taking a preset measurement criterion target function as a loss function; the set of voice file samples includes an anchor voice file sample, a positive example voice file sample, and a negative example voice file sample.

Preferably, the loss function is determined based on a similarity between the anchor voice file sample and the regular voice file sample in the group of voice file samples and a similarity between the anchor voice file sample and the anti-regular voice file sample.

In a second aspect, an embodiment of the present invention provides a language identification system, including: the device comprises an acquisition module and a processing module. Wherein the content of the first and second substances,

the obtaining module is used for obtaining the posterior feature which is used for representing the language and corresponds to the voice file to be recognized, and determining the mean value super-vector which corresponds to the voice file based on the posterior feature which is used for representing the language and corresponds to the voice file and a preset Gaussian mixture model;

the processing module is used for inputting the mean value super vector into a language identification model to obtain an identification result output by the language identification model;

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the language identification method according to the first aspect when executing the program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the language identification method according to the first aspect.

The language identification method, the system, the electronic equipment and the storage medium provided by the embodiment of the invention adopt the integrated learning thought to fuse the first language Gaussian mixture model and the second language Gaussian mixture model to obtain the preset Gaussian mixture model on the model algorithm level, and the preset Gaussian mixture model is used for fitting the data distribution of the whole languages, compared with the Gaussian mixture model which is established by mixing the data of most languages and the data of few languages in a non-difference way in the prior art, the method is more favorable for highlighting the data distribution fitting of the few languages, can effectively avoid the phenomenon that the distribution characteristic of the data of the minority language is averaged by the data of the majority language so that the distribution characteristic of the data of the minority language is masked by the data of the majority language, the mean value supervectors corresponding to the voice files determined by the preset Gaussian mixture model can better reflect the distribution characteristics of the languages in the voice files. Furthermore, no matter the languages contained in the voice file belong to the majority category or the minority category, the languages can be accurately identified through the language identification model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a language identification method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a method for determining a preset gaussian mixture model in a language identification method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating a method for determining a balance coefficient in a language identification method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart illustrating a training method of a language identification model in the language identification method according to the embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a language identification process of a language identification model in the language identification method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a training process of models applied in a language identification method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a language identification system according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, when language recognition is performed, a speech recognition system is generally adopted, which uses a gaussian mixture model as a basic framework and combines recognition technologies such as a discriminative model, factor analysis, deep learning and the like. Before speech recognition is carried out by applying a speech recognition system, an applied model needs to be trained, and proper training samples need to be adopted during training so that the obtained model can accurately recognize the language. However, for some special scenarios, for example, the number of users in different languages is seriously out of order, only training samples with greatly different data distribution ratios of the languages can be obtained, for example, the data distribution ratio of a few categories of languages in the training samples is less than 10%. In this case, the model obtained by training the training sample has a recognition result more biased to the majority of languages when performing language recognition, and the application effect is drastically reduced because the minority of languages cannot be recognized accurately. In particular, the few categories of languages are often the target languages of interest, and the decrease in the recognition accuracy may seriously affect the usability of the speech recognition system, so that the recognition performance of the speech recognition system may not reach a practical level. Therefore, the embodiment of the invention provides a language identification method.

Fig. 1 is a schematic flow chart of a language identification method according to an embodiment of the present invention. As shown in fig. 1, the language identification method specifically includes:

s11, obtaining posterior features corresponding to the voice file to be recognized in the language and used for representing the language, and determining a mean value super vector corresponding to the voice file based on the posterior features corresponding to the voice file and used for representing the language and a preset Gaussian mixture model;

s12, inputting the mean value super vector into a language identification model to obtain an identification result output by the language identification model;

Specifically, first, step S11 is executed. The voice file to be recognized is a voice file containing a language to be recognized. The posterior feature corresponding to the speech file to be recognized in the language may specifically be a posterior probability, which is used to represent different languages, that is, the different languages may be distinguished through the posterior feature. The posterior feature may be determined by extracting through a conventional posterior feature extraction network, which is not specifically limited in the embodiment of the present invention.

It should be noted that the voice file may include a plurality of frames, and the posterior features may be extracted in units of frames, that is, each frame of the voice file has a posterior feature, so that the posterior features corresponding to the voice file may be understood as a posterior feature matrix formed by the posterior features of all the frames included in the voice file. The dimension of the posterior feature of each frame of the voice file can be set according to the requirement, for example, the dimension can be set to 56, that is, the posterior feature of each frame of the voice file is represented by 56-dimensional data.

The preset Gaussian mixture model in the embodiment of the invention is obtained based on the fusion of the first language type Gaussian mixture model and the second language type Gaussian mixture model. The first language type Gaussian mixture model is a Gaussian mixture model corresponding to the first language type and can be specifically established through posterior features corresponding to the first language type voice file samples. The second language-like gaussian mixture model is a gaussian mixture model corresponding to the second language, and can be specifically established through posterior features corresponding to the second language-like voice file sample. The first language may be a language with a large data size, i.e. a plurality of languages, such as chinese, english, etc. The second category of language may specifically be a language with a small amount of data, i.e. a few categories of language, such as friendship, nepell, etc. The first category language and the second category language both include at least one language.

For example, a first-type speech file sample may be represented as d_maj＝{x₁,x₂,...,x_i,...,x_nThe second language speech file sample can be expressed as d_min＝{y₁,y₂,...,y_j,...,y_m}. Wherein n represents the number of samples in the first language voice file sample, m represents the number of samples in the second language voice file sample, and n is far larger than m. x is the number of₁,x₂,...,x_i,...,x_nRespectively representing the 1 st, 2 nd, … th, i th, … th and n th samples in the first language voice file samples, y₁,y₂,...,y_j,...,y_mRespectively representing the 1 st, 2 nd, … th, j, … th and m th samples in the second language voice file samples. From d_maj＝{x₁,x₂,...,x_i,...,x_nAnd d_min＝{y₁,y₂,...,y_j,...,y_mThen constitute an unbalanced data set D ═ D_maj,d_min}。

The first language gaussian mixture model specifically comprises a plurality of gaussian models, each gaussian model corresponds to a plurality of samples in the first language voice file samples, and is specifically established through posterior features corresponding to the plurality of samples. The mean value of the Gaussian model is the mean value of the posterior features corresponding to the samples, and the variance of the Gaussian model is the variance of the posterior features corresponding to the samples. Each Gaussian model in the first language Gaussian mixture model is weighted, and all Gaussian models in the first language Gaussian mixture model are balanced through the weights.

Similarly, the second-language gaussian mixture model specifically includes a plurality of gaussian models, each gaussian model corresponds to a plurality of samples in the second-language voice file samples, and is specifically established by posterior features corresponding to the plurality of samples. The mean value of the Gaussian model is the mean value of the posterior features corresponding to the samples, and the variance of the Gaussian model is the variance of the posterior features corresponding to the samples. And each Gaussian model in the second language Gaussian mixture model is provided with a weight, and all Gaussian models in the second language Gaussian mixture model are balanced through the weight.

The method for determining the preset gaussian mixture model specifically may be: adjusting the weight of each Gaussian model in the first language Gaussian mixture model and the weight of each Gaussian model in the second language Gaussian mixture model; and determining a preset Gaussian mixture model based on the first language Gaussian mixture model and the second language Gaussian mixture model after the weight is adjusted. The weight may be adjusted as needed, which is not specifically limited in the embodiment of the present invention. The preset gaussian mixture model is determined based on the first-language gaussian mixture model and the second-language gaussian mixture model after the weight is adjusted, specifically, the first-language gaussian mixture model and the second-language gaussian mixture model after the weight is adjusted are combined according to the weight to obtain the preset gaussian mixture model, and the obtained preset gaussian mixture model simultaneously contains all gaussian models in the first-language gaussian mixture model and all gaussian models in the second-language gaussian mixture model. It should be noted that, the preset gaussian mixture model is not specific to a specific language, is a gaussian mixture model unrelated to the specific language, and is a basic model for language identification.

The mean value super vector corresponding to the voice file is a vector used for representing the distribution characteristics of the languages contained in the voice file, and the mean value super vectors corresponding to different voice files are different. The method for determining the mean value super vector corresponding to the voice file may specifically be: updating a preset Gaussian mixture model based on the posterior characteristics corresponding to the voice file; and cascading the mean values of all the Gaussian models contained in the updated preset Gaussian mixture model to obtain the mean value super vector corresponding to the voice file. The preset gaussian mixture model may be updated by using a Maximum A Posteriori (MAP) algorithm, or may be implemented by using other algorithms, which is not specifically limited in the embodiment of the present invention.

Then, step S12 is executed. The language identification model is used for identifying the languages to be identified in the voice file based on the mean value super vector corresponding to the voice file. The recognition result output by the language recognition model can indicate which language the language to be recognized contained in the voice file belongs to.

The language identification model is obtained by training an initial model based on a mean value super vector of a first language voice file sample with a language label and a mean value super vector of a second language voice file sample with a language label. In the process of training to obtain a language identification model, a first language voice file sample with a language label and a second language voice file sample with a language label which are collected in advance can be subjected to posterior feature extraction, then the preset Gaussian mixture model is updated through the posterior features corresponding to the first language voice file sample, and the mean values of all Gaussian models contained in the updated preset Gaussian mixture model are cascaded to obtain a mean value super vector of the first language voice file sample; and updating the preset Gaussian mixture model through the posterior features corresponding to the second language voice file sample, and cascading the mean values of all the Gaussian models contained in the updated preset Gaussian mixture model to obtain the mean value super vector of the second language voice file sample. The posterior feature corresponding to the first-language voice file sample updates the preset gaussian mixture model, and the posterior feature corresponding to the second-language voice file sample updates the preset gaussian mixture model, and the adopted algorithm may be specifically a MAP algorithm or other algorithms, which is not specifically limited in the embodiment of the present invention.

The language identification method provided by the embodiment of the invention adopts the integrated learning thought to fuse the first language Gaussian mixture model and the second language Gaussian mixture model to obtain the preset Gaussian mixture model on the model algorithm level, and the preset Gaussian mixture model is used for fitting the data distribution of the whole languages, compared with the Gaussian mixture model which is established by mixing the data of most languages and the data of few languages in a non-difference way in the prior art, the method is more favorable for highlighting the data distribution fitting of the few languages, can effectively avoid the phenomenon that the distribution characteristic of the data of the minority language is averaged by the data of the majority language so that the distribution characteristic of the data of the minority language is masked by the data of the majority language, the mean value supervectors corresponding to the voice files determined by the preset Gaussian mixture model can better reflect the distribution characteristics of the languages in the voice files. Furthermore, no matter the languages contained in the voice file belong to the majority category or the minority category, the languages can be accurately identified through the language identification model.

On the basis of the above embodiment, the first language gaussian mixture model may be specifically established by using an Expectation-Maximization (EM) algorithm based on posterior features corresponding to the first language voice file samples; the second-language Gaussian mixture model can be specifically established by adopting an EM algorithm based on the posterior features corresponding to the second-language voice file samples.

For example, the posterior features corresponding to all samples in the first language voice file sample form a posterior feature set

Based on

Establishing a first-class language Gaussian mixture model f by adopting EM algorithm_maj. The posterior features corresponding to all samples in the second language voice file sample form a posterior feature set

Based on

Establishing a Gaussian mixture model f of a second class of languages by adopting an EM algorithm_min。f_majAnd f_minEach mixed togetherThe degree of fitness, i.e. the number of gaussian models each contains, can be set as desired, for example f_majMay be 192, f_minMay be 64.

On the basis of the foregoing embodiment, fig. 2 is a schematic flow chart of a method for determining a preset gaussian mixture model in the language identification method provided in the embodiment of the present invention. As shown in fig. 2, the preset gaussian mixture model is obtained based on the fusion of the first-language gaussian mixture model and the second-language gaussian mixture model, and specifically includes:

s21, determining weights respectively corresponding to the first language Gaussian mixture model and the second language Gaussian mixture model based on a preset balance coefficient;

s22, fusing the gaussian mixture models of the first and second languages based on the weights corresponding to the gaussian mixture models of the first and second languages, respectively, to obtain the preset gaussian mixture model;

Specifically, when determining the preset gaussian mixture model, step S21 is first executed. Before that, the balance coefficient may be determined according to the number of samples respectively corresponding to the first-language speech file sample and the second-language speech file sample, so as to adjust the weights respectively corresponding to the first-language gaussian mixture model and the second-language gaussian mixture model. The method for determining the balance coefficient may be to determine the scarcity degree of the minority language by calculating the number of samples corresponding to the first-language voice file sample and the second-language voice file sample, for example, if the first-language is the majority language and the second-language is the minority language, then the ratio of the number of samples in the second-language voice file sample to the total number of samples in the first-language voice file sample and the second-language voice file sample may be calculated, and the ratio is used as the scarcity degree of the second-language. In the embodiment of the invention, the value of the balance coefficient is more than 0 and less than 1.

When determining the weights corresponding to the first-language gaussian mixture model and the second-language gaussian mixture model respectively according to the balance coefficient, the balance coefficient may be specifically used as the coefficient of the second-language gaussian mixture model, and then the difference between 1 and the balance coefficient is used as the coefficient of the first-language gaussian mixture model. Further, the weight of each gaussian model in the gaussian mixture model after the weight is adjusted by the balance coefficient is a product of the coefficient of the gaussian mixture model and the weight of each gaussian model in the gaussian mixture model before the weight is adjusted. Therefore, the process of determining the weights corresponding to the first-language gaussian mixture model and the second-language gaussian mixture model respectively according to the balance coefficient can be regarded as a process of normalizing the weights of the gaussian models in the first-language gaussian mixture model and the second-language gaussian mixture model.

Then, step S22 is executed. Specifically, the first language gaussian mixture model and the second language gaussian mixture model are fused according to the weights respectively corresponding to the first language gaussian mixture model and the second language gaussian mixture model determined in step S21. The specific way of fusion may be according to weight combination, and the obtained preset gaussian mixture model simultaneously includes all gaussian models in the first-type language gaussian mixture model and all gaussian models in the second-type language gaussian mixture model. It should be noted that, in the fusion process, the mean and the variance of each gaussian model in the first-class language gaussian mixture model and each gaussian model in the second-class language gaussian mixture model are kept unchanged.

For example, a Gaussian mixture model f of the first language type_majThe weights of the middle Gaussian models form a weight set

Gaussian mixture model f of second class_minThe weights of the middle Gaussian models form a weight set

Wherein the content of the first and second substances,

are respectively f_majThe weights of the 1 st, 2 nd, … th and 192 th gaussian models,

are respectively f_minThe weights of the 1 st, 2 nd, … th and 64 th Gaussian models, the first language type Gaussian mixture model f determined based on the preset balance coefficient α_majThe corresponding weight may be expressed as (1- α) × w_majSecond-class gaussian mixture model f determined based on equilibrium coefficient α_minThe corresponding weight may be denoted α xw_min。

Preset Gaussian mixture model f obtained by fusion_rThe weight set formed by the weights of the Gaussian models can be represented as w_r＝[(1-α)*w_maj,α*w_min]_256*1。

The language identification method provided by the embodiment of the invention provides a method for determining the weights corresponding to a first-class language Gaussian mixture model and a second-class language Gaussian mixture model respectively through a preset balance coefficient so as to obtain a preset Gaussian mixture model, and considers the sparseness degree of minority languages determined by the number of samples corresponding to the first-class language voice file sample and the second-class language voice file sample respectively, so that the determined preset Gaussian mixture model can more prominently fit the data distribution of the minority languages, and the distribution characteristic of the minority language data is more effectively avoided being averaged by the data of the majority languages, so that the distribution characteristic of the minority language data is masked by the data of the majority languages.

On the basis of the foregoing embodiment, fig. 3 is a schematic flow chart illustrating a method for determining a balance coefficient in a language identification method according to an embodiment of the present invention. As shown in fig. 3, the determining of the balance coefficient based on the number of samples respectively corresponding to the first language-like speech file sample and the second language-like speech file sample specifically includes:

s31, determining the ratio of the sample numbers respectively corresponding to the first language type voice file sample and the second language type voice file sample based on the sample numbers respectively corresponding to the first language type voice file sample and the second language type voice file sample;

s32, determining the balance coefficient based on the information entropy value of the sample number ratio corresponding to the first language type voice file sample and the information entropy value of the sample number ratio corresponding to the second language type voice file sample.

Specifically, when determining the balance coefficient, step S31 is first performed. The sample number ratio of the first language voice file sample is the ratio of the sample number in the first language voice file sample to the total sample number in the first language voice file sample and the second language voice file sample; the sample number ratio of the second-language voice file sample refers to the ratio of the sample number in the second-language voice file sample to the total sample number in the first-language voice file sample and the second-language voice file sample. For example, the sample number ratio of the first-language voice file sample may be represented as n (n + m), and the sample number ratio of the second-language voice file sample may be represented as p ═ m (n + m). If the sample number ratio p of the second type speech file sample is taken as the basis, the sample number ratio of the first type speech file sample can be expressed as 1-p.

Then, step S32 is executed. Based on the sample number ratio 1-p of the first language voice file sample and the sample number ratio p of the second language voice file sample determined in step S31, the information entropy of the sample number ratio corresponding to the first language voice file sample may be represented as- (1-p) log (1-p), and the information entropy of the sample number ratio corresponding to the second language voice file sample may be represented as-plog (p).

When the balance coefficient is determined according to the (1-p) log (1-p) and the-plog (p), the ratio of information entropy values of the speech file samples of the few languages can be specifically adopted for determination. That is, when the second category language is a minority category language, the balance coefficient α may be expressed as:

the language identification method provided by the embodiment of the invention provides a specific determination method of the balance coefficient, namely, the balance coefficient is represented by the ratio of the information entropy values of the speech file samples of the minority languages by calculating the information entropy values of the ratio of the number of samples corresponding to the speech file samples of the first language and the speech file samples of the second language respectively, the obtained balance coefficient is more suitable for adjusting the weight of each Gaussian model in the first language Gaussian mixture model and the weight of each Gaussian model in the second language Gaussian mixture model, the obtained preset Gaussian mixture model can more highlight the data distribution fitting of the minority language, the distribution characteristic of the minority language data is more effectively avoided to be averaged by the data of the majority language, so that the distribution characteristic of the data of the minority language is masked by the data of the majority language.

Because each sample in the first language voice file sample corresponds to a mean value super vector, the mean value super vector of the first language voice file sample is actually a mean value super vector set formed by mean value super vectors corresponding to all samples included in the first language voice file sample. Similarly, each sample in the second-language-document sample corresponds to a mean-value super-vector, and the mean-value super-vector of the second-language-document sample is actually a mean-value super-vector set formed by mean-value super-vectors corresponding to all samples included in the second-language-document sample.

For example, a mean supervector for a speech file sample in a first language may be expressed as

The mean value supervector of the second kind speech file sample can be expressed as

Wherein the content of the first and second substances,

respectively representing the mean value super-vector of the 1 st, 2 nd, … th, i, … th and n th samples in the first language voice file sample,

respectively represent the mean value super vector of the 1 st, 2 nd, … th, j, … th and m th samples in the second language voice file sample. If the posterior feature corresponding to each sample is 56 dimensions, the mean supervector for each sample is 256 × 56 dimensions.

Because the number of samples in the first-type-language voice file sample is different from the number of samples in the second-type-language voice file sample, and is often not in the same magnitude, the number of the mean supervectors of the first-type-language voice file sample and the second-type-language voice file sample is not in the same magnitude, if the mean supervectors of the first-type-language voice file sample and the mean supervectors of the second-type-language voice file sample are directly used subsequently, the classification hyperplane of the trained language identification model deviates to the majority of languages, and the condition that the language identification model identifies the minority of languages mistakenly appears, which is not beneficial to the wide application of the language identification method. Therefore, the training method of the language identification model is improved in the embodiment of the invention so as to avoid the generation of the problems.

Based on the foregoing embodiment, fig. 4 is a schematic flow chart of a training method of the language identification model in the language identification method provided in the embodiment of the present invention. As shown in fig. 4, the training process of the language identification model specifically includes:

s41, clustering the mean value super vector of the first language voice file sample to determine a plurality of clustering centers; the number of the clustering centers is determined based on the number of the second language type voice file samples;

s42, replacing the mean value super vector of the first kind of language voice file sample with the mean value super vector corresponding to each cluster center, and training the language identification model based on the mean value super vector corresponding to each cluster center with language label and the mean value super vector of the second kind of language voice file sample with language label.

Specifically, step S41 is performed first. The clustering process can be realized by adopting a k-means unsupervised clustering method, a mean shift clustering method, a density-based clustering method, a maximum expectation clustering method based on a Gaussian mixture model, an agglomeration hierarchical clustering method, a graph group detection method and other clustering methods. The standard of the clustering process is to make the number of the obtained clustering centers equal to the number of the samples in the second language voice file samples, and the equivalent meaning can be understood to be in the same order of magnitude, so that the difference between the number of the samples related to the first language and the number of the samples related to the second language in the training samples of the language identification model can be greatly reduced.

And then, executing step S42, and replacing the mean value super vector of the first language voice file sample with the mean value super vector corresponding to the cluster center determined in step S41. That is, the mean value supervectors of all samples belonging to a certain clustering center in the first language voice file samples are replaced by the mean value supervectors corresponding to the clustering center, and the mean value supervectors corresponding to the clustering center are used as samples related to the first language in the training samples of the language identification model. Then, still taking the mean value super vector of the second kind of language voice file sample as the sample related to the second kind of language in the training sample of the language identification model, and training the language identification model by taking the mean value super vector corresponding to each clustering center with the language label and the mean value super vector of the second kind of language voice file sample with the language label as a positive example and a negative example.

It should be noted that, because each cluster center corresponds to one super-vector mean, the super-vector mean corresponding to the cluster center determined in step S41 is a super-vector mean set formed by super-vectors mean corresponding to all cluster centers. For example, the number of the clustering centers obtained by the k-means unsupervised clustering method is k, and the mean value supervectors corresponding to the clustering centers can be expressed as

Wherein

The mean value supervectors corresponding to the 1 st, 2 nd, … th, i, … th and k th clustering centers respectively.

The language identification method provided by the embodiment of the invention clusters the mean value super vector of the first language voice file sample on a data level, takes the mean value super vector corresponding to a clustering center as the mean value super vector of the first language voice file sample, and trains the language identification model by combining the mean value super vector of the second language voice file sample, so that the problem of abnormal deviation of the classification hyperplane of the language identification model caused by redundant information and abnormal information of the first language can be effectively reduced, and the identification accuracy of the trained language identification model to the second language is greatly improved.

On the basis of the above embodiment, the number of the clustering centers is the same as the number of the samples in the second language-like speech file samples, so that the difference between the number of the samples related to the first language and the number of the samples related to the second language in the training samples of the language identification model can be completely eliminated, and the accuracy of the language identification model in identifying the second language can be further improved.

On the basis of the foregoing embodiment, the language identification model adopted in the embodiment of the present invention may be a language identification model implemented based on a Support Vector Machine (SVM) algorithm, that is, the language identification model is constructed based on an SVM model.

On the basis of the foregoing embodiment, the recognition result output by the language recognition model may be represented by a score indicating that the language to be recognized included in the voice file specifically belongs to a certain language, and the language corresponding to the highest score corresponds to the language recognized by the language recognition model. Where the score may also be replaced by a probability.

On the basis of the above embodiment, the language identification model includes a plurality of language identification submodels, and each language identification submodel corresponds to a language respectively. Fig. 5 is a schematic diagram illustrating a language identification process of a language identification model in the language identification method according to the embodiment of the present invention. As shown in fig. 5, the inputting the mean value super vector into the language identification model to obtain the identification result output by the language identification model specifically includes:

s51, inputting the mean value super vector into each language identification submodel respectively, and obtaining the identification result output by each language identification submodel respectively;

and S52, obtaining the recognition result output by the language recognition model according to the recognition result output by each language recognition sub-model.

Specifically, the language identification model may be composed of a plurality of language identification submodels, each corresponding to one language. Step S51 is executed first, the mean value super vector corresponding to the voice file is input into each language identification sub-model, and the identification result output by each language identification sub-model may be the score of the language to be identified contained in the voice file belonging to the language corresponding to the language identification sub-model.

Then, step S52 is executed to determine the recognition result output by the language recognition model according to the recognition result output by each language recognition submodel. The determining of the recognition result output by the language recognition model may specifically be comparing the recognition results output by the language recognition submodels with each other, and determining the language corresponding to the language recognition submodel corresponding to the largest recognition result as the language recognized by the language recognition model. In addition, other methods may also be used to determine the recognition result output by the language recognition model, which is not specifically limited in the embodiment of the present invention.

Before step S51, each language identifier model may be obtained by pre-training, specifically, in the unbalanced data set D formed by the first language-like speech file sample and the second language-like speech file sample, speech file samples belonging to different languages are mutually positive and negative examples, and an appropriate loss function is used to train the initial model. For example, a speech file sample belonging to the language B in the unbalanced data set D is taken as a positive example, and a speech file sample belonging to a language other than the language B in the unbalanced data set D is taken as a negative example, and the language identification submodel corresponding to the language B can be trained.

According to the language identification method provided by the embodiment of the invention, the specific structure of the language identification model is limited to comprise a plurality of language identification submodels corresponding to a single language, an identification result is output through each language identification submodel, and then the identification result output by the language identification model is obtained through the identification result output by the language identification submodels, so that the language identification process is clearer, and the identification result output by the language identification model is more accurate.

On the basis of the foregoing embodiment, the obtaining, according to the recognition result output by each language recognition submodel, the recognition result output by the language recognition submodel specifically includes:

Specifically, in the embodiment of the present invention, a preset threshold may be given in advance, then a size relationship between the recognition result output by each language recognition submodel and the preset threshold is determined, and a language corresponding to the language recognition submodel corresponding to the recognition result larger than the preset threshold is used as the recognition result output by the language recognition model. The preset threshold value can be set as required, and it is ensured that only one language identification submodel outputs an identification result greater than the preset threshold value in each identification action.

According to the language identification method provided by the embodiment of the invention, the identification result output by the language identification model is determined by combining the identification result output by the language identification submodel with the preset threshold value, so that the language identification process is clearer, and the identification result output by the language identification model is more accurate.

On the basis of the foregoing embodiment, the obtaining of the posterior feature for representing the language corresponding to the voice file to be recognized by the language specifically includes:

Specifically, when the posterior feature corresponding to the voice file is obtained, the voice file can be directly input to the posterior feature extraction model, and the posterior feature corresponding to the voice file is output by the posterior feature extraction model. The posterior feature extraction model can be specifically constructed based on a long short-Time Memory (LSTM) model, and at least includes a BN layer and a statistical posing layer, where the BN layer is used to extract the posterior features of each frame in the voice file, that is, to output the posterior features corresponding to the voice file; the statistical posing layer is used for averaging the posterior features of all frames in the voice file, namely outputting the average posterior feature corresponding to the voice file. The posterior feature corresponding to the voice file in the embodiment of the present invention is extracted from the BN layer of the posterior feature extraction model, and therefore, the posterior feature may also be referred to as a posterior BN feature.

When the posterior feature extraction model is trained, a speech file sample group including an anchor speech file sample, a normal speech file sample and a reverse speech file sample can be specifically used as a training sample, and a preset measurement criterion target function is used as a loss function. The anchor voice file sample refers to any given voice file sample, the positive example voice file sample may be a voice file sample belonging to the same language as the anchor voice file sample, and the negative example voice file sample may be a voice file sample belonging to a different language from the anchor voice file sample. For example, the group of speech file samples contains e anchor speech file samples in total. Anchor voice file sample is noted as x_i ^A(1 ≦ i ≦ e) in the A language; the sample of the regular example speech file is marked as x_i ^PThe language A is provided; the sample of the counterexample speech file is marked as x_i ^NAnd has the language B. Rear endThe input of the test feature extraction model is a triplet (x)_i ^A,x_i ^P,x_i ^N) The adopted loss function is a preset measurement criterion objective function, the measurement criterion objective function is an objective function constructed through a measurement criterion, and the specifically applied measurement can be similarity measurement.

It should be noted that, the anchor speech file sample, the normal speech file sample, and the reverse speech file sample in the speech file sample group adopted in the embodiment of the present invention do not include the distinction between the first language and the second language, that is, compared with the unbalanced data set D formed by the first language speech file sample and the second language speech file sample, the corpus is a large corpus with a larger capacity.

The language identification method provided by the embodiment of the invention realizes the extraction of the posterior feature corresponding to the voice file through the posterior feature extraction model on the feature level. Due to the addition of the target function of the measurement criterion, the posterior features can more accurately represent different languages, and the language distinctiveness is stronger.

Based on the above embodiment, the loss function is determined based on the similarity between the anchor voice file sample and the positive example voice file sample in the voice file sample group and the similarity between the anchor voice file sample and the negative example voice file sample.

Specifically, when determining the loss function, the loss function may be determined by a similarity between the anchor voice file sample and the regular voice file sample, and a similarity between the anchor voice file sample and the anti-regular voice file sample. The loss function may specifically be a sum of a similarity between the anchor voice file sample and the regular voice file sample and a similarity between the anchor voice file sample and the anti-regular voice file sample. The higher the similarity between the anchor voice file sample and the regular example voice file sample is, the smaller the loss function is; the lower the similarity between the anchor speech file sample and the counterexample speech file sample, the smaller the loss function. Thus, the similarity between the anchor voice file sample and the positive example voice file sample is maximized, the similarity between the anchor voice file sample and the negative example voice file sample is minimized, and the loss function can be minimized.

The similarity can be characterized by the distance between samples, which can be determined by calculating the euclidean distance, manhattan distance, chebyshev distance, minkowski distance, normalized euclidean distance, mahalanobis distance, hamming distance, inter-angle cosine, jaccard distance, and correlation distance.

In the embodiment of the present invention, the loss function may be specifically expressed by the following formula.

Wherein the content of the first and second substances,

anchor point speech file sample x output for statistical posing layer_i ^AThe corresponding posterior characteristics of the mean value are,

sample x of regular speech file output for statistical posing layer_i ^PThe corresponding posterior characteristics of the mean value are,

counterexample speech file sample x output for statistical posing layer_i ^NThe corresponding posterior characteristics of the mean value are,

for the ith anchor voice file sample x_i ^ACorresponding mean posterior feature and ith sample x of speech file_i ^PThe similarity between the corresponding mean a posteriori features,

for the ith anchor voice file sample x_i ^ACorrespond to each otherValue posterior feature and ith counterexample speech file sample x_i ^NSimilarity between corresponding mean posterior features.

It should be noted that, in the following description,

and

all values of (A) are [ -1,1 [)]Therefore, the loss function loss has a value range of [ -2,2 [ ]]。

According to the language identification method provided by the embodiment of the invention, the determined loss function can enable the posterior features extracted by the posterior feature extraction model to represent different languages more accurately, and the language distinction is stronger.

On the basis of the foregoing embodiment, fig. 6 is a schematic diagram illustrating a training flow of each model applied in the language identification method provided in the embodiment of the present invention. As shown in fig. 6, the training process of each model specifically includes:

1) training a posterior feature extraction model based on the voice file sample group;

2) inputting a first language voice file sample and a second language voice file sample in the unbalanced data set into a posterior feature extraction model to obtain posterior features output by the posterior feature extraction model;

3) respectively obtaining a first language Gaussian mixture model and a second language Gaussian mixture model based on the posterior features corresponding to the first language voice file sample and the posterior features corresponding to the second language voice file sample;

4) fusing the first language Gaussian mixture model and the second language Gaussian mixture model to obtain a preset Gaussian mixture model;

5) updating the preset Gaussian mixture model by adopting a MAP algorithm based on the posterior characteristics obtained in the step 2), and respectively extracting a mean value super-vector of the first language voice file sample and a mean value super-vector of the second language voice file sample;

6) clustering the mean value super-vector of the first language voice file sample by adopting a k-means clustering algorithm to obtain a mean value super-vector corresponding to a clustering center;

7) and training the SVM model based on the mean value super vector corresponding to the clustering center and the mean value super vector of the second-class language voice file sample to obtain a language identification model.

In summary, through the above steps, in the embodiment of the present invention, a metric criterion mechanism is introduced to improve the language characterization capability of the features for the problem of unbalanced language distribution from the feature characterization level, the model algorithm level and the data level, the majority category languages and the minority category languages are distinguished and modeled, the preset gaussian mixture model is generated in an integrated fusion manner, the fitting of the data distribution is realized through the preset gaussian mixture model, the data balance processing is performed in an unsupervised clustering manner, and the balance of unbalanced data distribution is finally realized, so that the recognition accuracy of the minority category languages is significantly improved.

On the basis of the foregoing embodiment, fig. 7 is a schematic structural diagram of a language identification system provided in the embodiment of the present invention. As shown in fig. 7, the language identification system specifically includes: an acquisition module 71 and a processing module 72. Wherein the content of the first and second substances,

the obtaining module 71 is configured to obtain a posterior feature for representing a language corresponding to a voice file to be recognized in the language, and determine a mean value hyper-vector corresponding to the voice file based on the posterior feature for representing the language corresponding to the voice file and a preset gaussian mixture model;

the processing module 72 is configured to input the mean value super vector into a language identification model to obtain an identification result output by the language identification model;

Specifically, the operation flows of the modules in the language identification system provided in the embodiment of the present invention, which act on the steps in the above method class embodiments, are in one-to-one correspondence, and the implementation effects are also consistent.

On the basis of the above embodiment, the language identification system further includes a preset gaussian mixture model determination module, and the preset gaussian mixture model determination module specifically includes a weight determination submodule and a fusion submodule. Wherein the content of the first and second substances,

the weight determining submodule is used for determining weights corresponding to the first language type Gaussian mixture model and the second language type Gaussian mixture model respectively based on a preset balance coefficient;

the fusion sub-module is used for fusing the first-language Gaussian mixture model and the second-language Gaussian mixture model based on the weights respectively corresponding to the first-language Gaussian mixture model and the second-language Gaussian mixture model to obtain the preset Gaussian mixture model;

On the basis of the above embodiment, the preset gaussian mixture model determining module further includes a balance coefficient determining submodule. The balance coefficient determination submodule specifically includes: a proportion determining unit and a balance coefficient determining unit. Wherein the content of the first and second substances,

the proportion determining unit is used for determining the proportion of the number of samples respectively corresponding to the first language type voice file sample and the second language type voice file sample based on the number of samples respectively corresponding to the first language type voice file sample and the second language type voice file sample;

the balance coefficient determining unit is configured to determine the balance coefficient based on an information entropy value of a sample number ratio corresponding to the first language-type voice file sample and an information entropy value of a sample number ratio corresponding to the second language-type voice file sample.

On the basis of the above embodiment, the language identification system further includes a language identification model training module. Wherein, language identification model training module specifically includes: a clustering processing sub-module and a training sub-module. Wherein the content of the first and second substances,

the clustering processing sub-module is used for clustering the mean value super-vector of the first language voice file sample to determine a plurality of clustering centers; the number of the clustering centers is determined based on the number of the second language type voice file samples;

and the training submodule is used for replacing the mean value super vector of the first language voice file sample with the mean value super vector corresponding to each cluster center, and training the language identification model based on the mean value super vector corresponding to each cluster center with a language label and the mean value super vector of the second language voice file sample with a language label.

On the basis of the above embodiment, the number of the clustering centers obtained by the clustering sub-module in the language identification model training module is the same as the number of the second-type language voice file samples, so that the difference between the number of the samples related to the first-type language and the number of the samples related to the second-type language in the training samples of the language identification model can be completely eliminated, and the accuracy of the language identification model in identifying the second-type language is further improved.

On the basis of the above embodiment, the language identification model applied in the processing module includes a plurality of language identification submodels, and each language identification submodel corresponds to one language respectively. The processing module specifically comprises a first processing submodule and a second processing submodule. Wherein the content of the first and second substances,

the first processing submodule is used for respectively inputting the mean value super vector into each language identification submodel and respectively obtaining an identification result output by each language identification submodel;

and the second processing submodule is used for obtaining the recognition result output by the language recognition model according to the recognition result output by each language recognition submodel.

On the basis of the foregoing embodiment, the second processing submodule is specifically configured to:

On the basis of the above embodiment, the obtaining module is specifically configured to:

On the basis of the above embodiment, the obtaining module specifically includes a loss function determining submodule.

Wherein the content of the first and second substances,

the loss function determination module is specifically configured to determine the loss function based on a similarity between the anchor voice file sample and the regular voice file sample in the voice file sample group and a similarity between the anchor voice file sample and the anti-regular voice file sample.

As shown in fig. 8, on the basis of the above embodiment, an embodiment of the present invention provides an electronic device, including: a processor (processor)81, a memory (memory)82, a communication Interface (Communications Interface)83, and a communication bus 84; wherein the content of the first and second substances,

the processor 81, the memory 82 and the communication interface 83 complete communication with each other through the communication bus 84. The memory 82 stores program instructions executable by the processor 81, and the processor 81 is configured to call the program instructions in the memory 82 to perform the language identification method provided in each of the above embodiments of the method.

It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 81, the communication interface 83, the memory 82, and the communication bus 84 shown in fig. 8, where the processor 81, the communication interface 83, and the memory 82 complete mutual communication through the communication bus 84, and the processor 81 can call the logic instruction in the memory 82 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.

The logic instructions in memory 82 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone article of manufacture. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Further, an embodiment of the present invention discloses a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer can execute the language identification method provided by the above-mentioned method embodiments.

On the basis of the foregoing embodiments, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to execute the language identification method provided in the foregoing embodiments when executed by a processor.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A language identification method, comprising:

2. The language identification method according to claim 1, wherein the preset gaussian mixture model is obtained based on a first language gaussian mixture model and a second language gaussian mixture model, and specifically comprises:

3. The language identification method according to claim 2, wherein the balance coefficient is determined based on the number of samples respectively corresponding to the first-language speech file sample and the second-language speech file sample, and specifically comprises:

4. The language identification method according to claim 1, wherein the training process of the language identification model specifically comprises:

5. The language identification method as claimed in claim 4 wherein the number of clustering centers is the same as the number of second-language speech file samples.

6. The language identification method as claimed in any of claims 1 to 5 wherein said language identification model comprises a plurality of language identification submodels, and each of said language identification submodels corresponds to a language; correspondingly, the inputting the mean value super vector into a language identification model to obtain an identification result output by the language identification model specifically includes:

7. The language identification method according to claim 6, wherein said obtaining the recognition result outputted by said language identification model according to the recognition result outputted by each said language identification submodel specifically comprises:

8. The language identification method according to any one of claims 1 to 5, wherein the obtaining of the posterior feature for representing the language corresponding to the voice file to be recognized specifically comprises:

9. The language identification method of claim 8 wherein the loss function is determined based on a similarity between the anchor speech file sample and the regular speech file sample in the set of speech file samples and a similarity between the anchor speech file sample and the anti-regular speech file sample.

10. A language identification system, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the posterior feature which is used for representing languages and corresponds to a voice file to be recognized, and determining the mean value super-vector which corresponds to the voice file based on the posterior feature which is used for representing languages and corresponds to the voice file and a preset Gaussian mixture model;

11. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the language identification method according to any of claims 1-9 when executing the program.

12. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the language identification method according to any one of claims 1 to 9.