CN116206622A

CN116206622A - Training and dialect conversion method and device for generating countermeasure network and electronic equipment

Info

Publication number: CN116206622A
Application number: CN202310499126.7A
Authority: CN
Inventors: 钟雨崎; 艾国; 杨作兴
Original assignee: Beijing Bianfeng Information Technology Co ltd
Current assignee: Beijing Bianfeng Information Technology Co ltd
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-06-02
Anticipated expiration: 2043-05-06
Also published as: CN116206622B

Abstract

The embodiment of the invention provides a training and dialect conversion method and device for generating an countermeasure network and electronic equipment. Generating an antagonism network includes a discriminant and a dialect conversion model as a generator, the method comprising: determining a first audio feature belonging to a first dialect; inputting the first audio features into a generated countermeasure network, and performing alternate iterative training by using the dialect conversion model and the discriminator until the quality of the second audio features output by the dialect conversion model reaches a preset condition, wherein the second audio features belong to a second dialect; wherein the alternating iterative training comprises: inputting the first audio feature into a dialect conversion model to generate a second audio feature; inputting the second audio feature into a dialect conversion model to generate a third audio feature belonging to the first dialect; and inputting the second audio feature into a discriminator to obtain a first discrimination result of whether the second audio feature belongs to the second dialect. And a high-accuracy dialect conversion model is generated, so that data enhancement, dialect conversion and other numerous applications are facilitated.

Description

Training and dialect conversion method and device for generating countermeasure network and electronic equipment

Technical Field

The embodiment of the invention belongs to the technical field of natural language processing, in particular to a training and dialect conversion method and device for generating an countermeasure network (Generative Adversarial Network, GAN) and electronic equipment.

Background

Generating an antagonism network generally includes a generator and a arbiter. The generator is used for generating samples, and the discriminator is used for discriminating whether the samples generated by the generator are true or not. The generator is as confusing as possible the arbiter which in turn distinguishes as much as possible the samples generated by the generator from the real samples.

Current generation countermeasure networks are commonly used in the field of image processing and do not involve application in dialect conversion.

Disclosure of Invention

The embodiment of the invention provides a training and dialect conversion method and device for generating an countermeasure network and electronic equipment.

The technical scheme of the embodiment of the invention is as follows:

a training method of generating an countermeasure network, the generating an countermeasure network including a discriminant and a dialect conversion model as a generator, the method comprising:

determining a first audio feature belonging to a first dialect;

inputting the first audio feature into the generating countermeasure network to perform alternate iterative training on the dialect conversion model and the discriminator until the quality of a second audio feature output by the dialect conversion model reaches a predetermined condition, wherein the second audio feature belongs to a second dialect; wherein the alternating iterative training comprises: inputting the first audio feature into the dialect conversion model to generate the second audio feature; inputting the second audio feature into the dialect conversion model to generate a third audio feature that belongs to the first dialect; and inputting the second audio feature into the discriminator to obtain a first discrimination result of whether the second audio feature belongs to a second dialect.

In one embodiment, the alternating iterative training comprises:

fixing model parameters of the discriminator;

executing a training process of the dialect conversion model, wherein the training process of the dialect conversion model comprises: determining a difference between the third audio feature and the first audio feature; determining a loss function value of the dialect conversion model based on the difference and the first discrimination result; model parameters of the dialect conversion model are adjusted so that a loss function value of the dialect conversion model is lower than a first threshold.

In one embodiment, the determining the loss function value of the dialect conversion model based on the difference and the first discrimination result includes:

determining a first sub-loss function value, wherein the first sub-loss function value is greater when the difference is greater;

determining a second sub-loss function value, wherein the second sub-loss function value is smaller as the probability that the first discrimination result characterizes the second audio feature as belonging to a second dialect is greater;

and determining a weighted operation result of the first sub-loss function value and the second sub-loss function value as a loss function value of the dialect conversion model.

In one embodiment, the alternating iterative training comprises:

fixing model parameters of the dialect conversion model;

executing a training process of the arbiter, wherein the training process of the arbiter comprises: inputting a fourth audio feature marked as a second dialect into the discriminator to obtain a second discrimination result of whether the fourth audio feature belongs to the second dialect; determining a loss function value of the discriminator based on the first discrimination result and the second discrimination result; and adjusting model parameters of the discriminator so that a loss function value of the discriminator is lower than a second threshold.

In one embodiment, the determining the loss function value of the arbiter based on the first and second discrimination results includes:

determining a third sub-loss function value, wherein the third sub-loss function value is greater the probability that the first discrimination result characterizes the second audio feature as belonging to a second dialect;

determining a fourth sub-loss function value, wherein the fourth sub-loss function value is smaller the greater the probability that the second discrimination result characterizes the fourth audio feature as belonging to a second dialect;

And determining a weighted operation result of the third sub-loss function value and the fourth sub-loss function value as a loss function value of the discriminator.

In one embodiment, the quality of the second audio feature output by the dialect conversion model reaching a predetermined condition comprises:

converting the second audio feature to an audio file; playing the audio file; when the playing feedback of the audio file represents that the quality of the audio file is qualified, determining that the quality of the second audio feature reaches a preset condition; or (b)

Determining a parallel corpus, the parallel corpus comprising fifth audio features labeled as a first dialect and sixth audio features labeled as a second dialect, wherein the fifth audio features are aligned with the sixth audio features; inputting the parallel corpus into the dialect conversion model to obtain a seventh audio feature belonging to a second dialect and an eighth audio feature belonging to a first dialect, wherein the seventh audio feature is generated based on the fifth audio feature, and the eighth audio feature is generated based on the sixth audio feature; when the seventh audio feature is aligned with the sixth audio feature and the eighth audio feature is aligned with the fifth audio feature, it is determined that the quality of the second audio feature output by the dialect conversion model meets a predetermined condition.

A dialect conversion method, comprising:

obtaining a trained dialect conversion model, the dialect conversion model being trained according to the training method of generating an countermeasure network as described above;

inputting the audio features to be converted belonging to the first dialect into the dialect conversion model to obtain converted audio features belonging to the second dialect; and/or

And inputting the audio features to be converted belonging to the second dialect into the dialect conversion model to obtain the converted audio features belonging to the first dialect.

A training apparatus for generating an countermeasure network, the generating an countermeasure network including a discriminant and a dialect conversion model as a generator, the apparatus comprising:

a determining module for determining a first audio feature belonging to a first dialect;

the training module is used for inputting the first audio features into the generated countermeasure network so as to perform alternate iterative training on the dialect conversion model and the discriminator until the quality of second audio features output by the dialect conversion model reaches a preset condition, wherein the second audio features belong to a second dialect; wherein the alternating iterative training comprises: inputting the first audio feature into the dialect conversion model to generate the second audio feature; inputting the second audio feature into the dialect conversion model to generate a third audio feature that belongs to the first dialect; and inputting the second audio feature into the discriminator to obtain a first discrimination result of whether the second audio feature belongs to a second dialect.

A dialect conversion apparatus comprising:

the obtaining module is used for obtaining a trained dialect conversion model, wherein the dialect conversion model is obtained by training according to the training method for generating the countermeasure network;

the input module is used for inputting the audio characteristics to be converted belonging to the first dialect into the dialect conversion model to obtain the converted audio characteristics belonging to the second dialect; and/or inputting the audio features to be converted belonging to the second dialect into the dialect conversion model to obtain the converted audio features belonging to the first dialect.

An electronic device, comprising:

a memory;

a processor;

wherein the memory has stored therein an application executable by the processor for causing the processor to perform the training method of generating an countermeasure network as claimed in any one of the preceding claims or the dialect conversion method as described above.

According to the technical scheme, the dialect conversion model and the discriminator are alternately and iteratively trained until the quality of the second audio features output by the dialect conversion model reaches the preset condition. The dialect conversion model and the countermeasure process of the discriminator can generate a high-accuracy dialect conversion model, and are convenient for data enhancement, dialect conversion and other numerous applications.

Drawings

Fig. 1 is an exemplary block diagram of a generation countermeasure network according to an embodiment of the present invention.

FIG. 2 is an exemplary flow chart of a training process for generating an countermeasure network in accordance with an embodiment of the present invention.

FIG. 3 is an exemplary schematic diagram of a training process for a dialect conversion model in accordance with an embodiment of the present invention.

Fig. 4 is an exemplary schematic diagram of a training process of a arbiter according to an embodiment of the present invention.

Fig. 5 is an exemplary block diagram of a dialect conversion model in accordance with an embodiment of the present invention.

Fig. 6 is an exemplary block diagram of a downsampling unit according to an embodiment of the present invention.

Fig. 7 is an exemplary block diagram of an up-sampling unit according to an embodiment of the present invention.

Fig. 8 is an exemplary schematic diagram of a discriminator according to an embodiment of the invention.

Fig. 9 is an exemplary flowchart of a dialect conversion method according to an embodiment of the present invention.

Fig. 10 is an exemplary block diagram of a training apparatus for generating an countermeasure network according to an embodiment of the present invention.

Fig. 11 is an exemplary configuration diagram of a dialect converting apparatus according to an embodiment of the present invention.

Fig. 12 is an exemplary structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

For simplicity and clarity of description, the following description sets forth aspects of the invention by describing several exemplary embodiments. Numerous details in the embodiments are provided solely to aid in the understanding of the invention. It will be apparent, however, that the embodiments of the invention may be practiced without limitation to these specific details. Some embodiments are not described in detail in order to avoid unnecessarily obscuring aspects of the present invention, but rather only to present a framework. Hereinafter, "comprising" means "including but not limited to", "according to … …" means "according to at least … …, but not limited to only … …". The term "a" or "an" is used herein to refer to a number of components, either one or more, or at least one, unless otherwise specified.

Dialects are local words in a language system that are distinguished from standard words. Dialects can be divided into regional dialects and social dialects. Regional dialects are variants of languages formed by regional differences, are branches of different regions of the national language, and are reflection of language development in regions. Social dialects are social members of the same territory that form different social variants due to social differences in occupation, class, age, gender, culture, etc. For example, sichuan and Tianjin can be regarded as dialects of a standard language (Mandarin). For ease of description, in embodiments of the present invention, the standard language is also considered a dialect, such as mandarin chinese, as well.

Dialect conversion tasks are often involved in many applications. For example, when it is desired to train a dialect recognition model with good accuracy, a large amount of voice training data of the target dialect with labels is generally required, and there is a need to convert the voice data of the existing labeled dialect (e.g., mandarin) into the voice data of the target dialect (e.g., sichuan). Examples: assuming that there is a labeled mandarin training dataset, if the mandarin training dataset can be quickly converted into a Sichuan pronunciation, it is equivalent to providing the training dataset of the Sichuan dialect in a data enhancement manner, thereby helping to train out an artificial intelligent model capable of automatically identifying the Sichuan with various mature network structures. Alternatively, based on converting the mandarin training dataset into Sichuan pronunciation, an artificial intelligent model capable of automatically converting Sichuan can be directly trained.

In an embodiment of the invention, a dialect conversion model with high accuracy is trained based on generating an countermeasure network. The dialect conversion model can be applied to data enhancement of dialect data, and can also be directly applied to various applications such as dialect conversion.

Fig. 1 is an exemplary block diagram of a generation countermeasure network according to an embodiment of the present invention. An antagonism network is generated to include a discriminant and a dialect conversion model as a generator. The dialect conversion model is used for converting the first audio features of the first dialect into the second audio features (called false samples) of the second dialect, and the discriminator is used for judging whether the second audio features output by the dialect conversion model belong to the second dialect. The first and second dialects typically belong to the same language. The generator needs to confuse the discriminator as much as possible to generate realistic false samples as much as possible, while the discriminator is to distinguish as accurately as possible the samples generated by the generator (i.e. false samples) from the real samples (audio features labeled as second dialect).

Based on the generated countermeasure network shown in fig. 1, fig. 2 is an exemplary flowchart of a training process for generating the countermeasure network according to an embodiment of the present invention. As shown in fig. 2, the method includes:

step 201: a first audio feature belonging to a first dialect is determined.

Here, the first audio feature is an audio feature extracted from voice data of the first dialect. For example, the first audio feature may be a time domain feature or a frequency domain feature extracted from the speech data of the first dialect. The voice data of the first dialect may be unlabeled voice data. Preferably, the first audio feature is an FBank frequency domain feature.

Step 202: inputting the first audio features into a generated countermeasure network, and performing alternate iterative training by using the dialect conversion model and the discriminator until the quality of the second audio features output by the dialect conversion model reaches a preset condition, wherein the second audio features belong to a second dialect; wherein the alternating iterative training comprises: inputting the first audio feature into a dialect conversion model to generate a second audio feature; inputting the second audio feature into a dialect conversion model to generate a third audio feature belonging to the first dialect; and inputting the second audio feature into a discriminator to obtain a first discrimination result of whether the second audio feature belongs to the second dialect.

In the alternate iterative training of generating the reactance network, the dialect conversion model and the arbiter are alternately trained to realize the countermeasure of the dialect conversion model and the arbiter. Wherein: fixing model parameters of the discriminator when training the dialect conversion model; model parameters of the dialect conversion model are fixed while training the discriminators. And when the quality of the second audio feature output by the dialect conversion model reaches a preset condition, completing the alternate iterative training. Alternate iterative training can prevent the dialect conversion model from being too powerful compared to the discriminant, and can also prevent the discriminant from being too powerful compared to the dialect conversion model. In fact, if the performance difference between the dialect conversion model and the arbiter is too large, the network performance of the entire generated countermeasure network is weakened.

Alternating iterative training involves two training processes that are performed alternately: (1) training a dialect conversion model; (2) training process of the discriminant. When training is started, the training process of the dialect conversion model can be executed first, then the training process of the discriminator can be executed, or the training process of the discriminator can be executed first, and then the training process of the dialect conversion model can be executed.

In one embodiment, the alternating iterative training includes: fixing model parameters of the discriminator; performing a training process of the dialect conversion model, wherein the training process of the dialect conversion model comprises: determining a difference between the third audio feature and the first audio feature; determining a loss function value of the dialect conversion model based on the difference and the first discrimination result; model parameters of the dialect conversion model are adjusted so that the loss function value of the dialect conversion model is lower than a first threshold value.

The difference between the third audio feature and the first audio feature characterizes a loss value of the second dialect converted by the dialect conversion model into the first dialect. The first discrimination result characterizes a loss value of the dialect conversion model converting the first dialect into the second dialect. It can be seen that the good network performance of the dialect conversion model is ensured based on the loss function value of the dialect conversion model determined by the two losses together.

In one embodiment, determining the loss function value of the dialect conversion model based on the difference and the first discrimination result includes: determining a first sub-loss function value, wherein the first sub-loss function value is greater when the difference is greater; determining a second sub-loss function value, wherein the second sub-loss function value is smaller the greater the probability that the first discrimination result characterizes the second audio feature as belonging to the second dialect; and determining a weighted operation result of the first sub-loss function value and the second sub-loss function value as a loss function value of the dialect conversion model. It can be seen that the larger the difference, the lower the quality of the conversion of the second dialect into the first dialect by the representation dialect conversion model, and therefore the larger the first sub-loss function value; when the first discrimination result indicates that the probability that the second audio feature belongs to the second dialect is larger, the higher the quality of converting the first dialect into the second dialect by the dialect conversion model is characterized, and therefore the smaller the second sub-loss function value is. The weighted operation result of the first sub-loss function value and the second sub-loss function value well represents the loss of the dialect conversion model for the mutual conversion of the first dialect and the second dialect.

FIG. 3 is an exemplary schematic diagram of a training process for a dialect conversion model in accordance with an embodiment of the present invention. Model parameters of the discriminant remain fixed during each training of the dialect conversion model.

The first audio feature is input to a dialect conversion model. The dialect conversion model generates second audio features of the second dialect based on the first audio features. The discriminator discriminates whether the second audio feature belongs to the second dialect, and obtains a first discrimination result (usually characterized as a probability that the second audio feature belongs to the second dialect). The second audio feature is input to the dialect conversion model. The dialect conversion model generates third audio features of the first dialect based on the second audio features.

It can be seen that the dialect conversion model has two generation processes:

(1) The generation process 1, input is: a first audio feature; the output is: a second audio feature;

(2) The generation process 2, the input is: a second audio feature; the output is: and a third audio feature.

The generation process 1 and the generation process 2 each have a respective loss function value. For generating process 1, calculating a distance that the second audio feature output by the arbiter is a predicted value of the second dialect from a value of 1.0 (i.e., the probability of the second dialect is one hundred percent) to obtain a loss function value L1 for generating process 1; for the generation process 2, a difference (e.g., a mean square error operation) between the third audio feature and the first audio feature may be calculated to obtain a loss function value L2 for the generation process 2.

Therefore, the loss function value l_g of the training process of the dialect conversion model is: l_g=λ ₁ *L1+λ ₂ * L2. Wherein lambda is ₁ And lambda (lambda) ₂ Respectively preset weights. Based on the loss function value l_g of the dialect conversion model, back propagation is performed to adjust model parameters of the dialect conversion model, thereby completing a single training process of the dialect conversion model.

In one embodiment, the alternating iterative training includes: fixing model parameters of a dialect conversion model; executing a training process of the discriminant, wherein the training process of the discriminant comprises: inputting a fourth audio feature (i.e., a true sample) labeled as a second dialect into the discriminator to obtain a second discrimination result of whether the fourth audio feature belongs to the second dialect; determining a loss function value of the discriminator based on the first discrimination result and the second discrimination result; model parameters of the arbiter are adjusted so that the loss function value of the arbiter is below the second threshold.

In one embodiment, determining the loss function value of the arbiter based on the first and second discrimination results includes: determining a third sub-loss function value, wherein the third sub-loss function value is greater the probability that the first discrimination result characterizes the second audio feature as belonging to the second dialect; determining a fourth sub-loss function value, wherein the fourth sub-loss function value is smaller the greater the probability that the second discrimination result characterizes the fourth audio feature as belonging to the second dialect; and determining a weighted operation result of the third sub-loss function value and the fourth sub-loss function value as a loss function value of the discriminator.

It can be seen that the larger the probability that the first discrimination result characterizes the second audio feature as belonging to the second dialect, the lower the capability of the discriminator to identify the false sample, and the larger the probability that the second discrimination result characterizes the fourth audio feature as belonging to the second dialect, the higher the capability of the discriminator to identify the true sample. The weighted operation results of the third sub-loss function value and the fourth sub-loss function value well represent the discrimination loss of the discriminator on the true sample and the false sample.

Fig. 4 is an exemplary schematic diagram of a training process of a arbiter according to an embodiment of the present invention. Model parameters of the dialect conversion model remain fixed during each training of the arbiter.

The fourth audio feature is input to the arbiter. The discriminator discriminates whether the fourth audio feature belongs to the second dialect to obtain a second discrimination result (typically characterized as a probability that the fourth audio feature belongs to the second dialect).

For the arbiter, it is desirable that:

(1) The predicted value obtained after the fourth audio feature passes through the discriminator approaches 1.0. That is, it is desirable that the second discrimination result can accurately recognize that the fourth audio feature is a true sample.

(2) The predicted value obtained after the second audio feature passes through the discriminator approaches 0.0. That is, it is desirable that the first discrimination result be able to accurately recognize that the second audio feature is a false sample.

The first discrimination result and the second discrimination result have respective loss function values. The distance between the predicted value of the second audio feature and the value 0.0 (i.e., the probability that the second audio feature belongs to the second dialect is zero) is calculated to obtain a loss function value L3 of the first discrimination result. The distance between the predicted value of the fourth audio feature and the value 1.0 (i.e., the probability that the fourth audio feature belongs to the second dialect is one hundred percent) is calculated to obtain a loss function value L4 of the second discrimination result.

Thus, the training process of the discriminantThe loss function value l_d of (2) is: l_d=λ ₃ *L3+λ ₄ * And L4. Wherein lambda is ₃ And lambda (lambda) ₄ Respectively preset weights. Based on the loss function value L_d of the discriminator, back propagation is performed to adjust model parameters of the discriminator, thereby completing a single training process of the discriminator.

Based on the process shown in fig. 3 and 4, an alternating iterative training of dialect conversion model and discriminant is performed. When the quality of the second audio feature output by the dialect conversion model reaches a predetermined condition, alternate iterative training may be completed.

In one embodiment, the quality of the second audio feature output by the dialect conversion model reaching the predetermined condition comprises: converting the second audio feature to an audio file; playing the audio file; and when the playing feedback of the audio file represents that the quality of the audio file is qualified, determining that the quality of the second audio feature reaches a preset condition. For example, the playback feedback of the audio file may be implemented as a manual feedback after the user who is skilled in the second dialect listens to the audio file. When the user confirms that the quality of the audio file is qualified, determining that the quality of the second audio feature reaches a preset condition, and completing the alternate iterative training of the dialect conversion model and the discriminator. When the user confirms that the quality of the audio file is unqualified, the quality of the second audio feature is determined to not reach the preset condition, and accordingly the alternating iterative training of the dialect conversion model and the discriminator is continuously carried out.

In one embodiment, the quality of the second audio feature output by the dialect conversion model reaching the predetermined condition comprises: determining a parallel corpus comprising fifth audio features labeled as a first dialect and sixth audio features labeled as a second dialect, wherein the fifth audio features are aligned with the sixth audio features; inputting the parallel corpus into a dialect conversion model to obtain seventh audio features belonging to a second dialect and eighth audio features belonging to the first dialect, wherein the seventh audio features are generated based on the fifth audio features, and the eighth audio features are generated based on the sixth audio features; when the seventh audio feature is aligned with the sixth audio feature and the eighth audio feature is aligned with the fifth audio feature, it is determined that the quality of the second audio feature output by the dialect conversion model meets a predetermined condition.

It can be seen that based on the parallel corpus, it can be automatically determined whether the quality of the second audio feature meets a predetermined condition without user intervention in mastering the second dialect.

The specific structure of the dialect conversion model is exemplarily described below. Fig. 5 is an exemplary block diagram of a dialect conversion model in accordance with an embodiment of the present invention. In fig. 5, a dialect transformation model is specifically built with a U-NET network structure. The dialect conversion model includes an encoder and a decoder. The encoder includes a first downsampling, a second downsampling, and a third downsampling. The decoder includes a first upsampling, a second upsampling, and a third upsampling. The first downsampling result is added with the second upsampling result and then fed into the third upsampling. The second downsampling result is added with the first upsampling result and then fed into the second upsampling.

Fig. 6 is an exemplary block diagram of a downsampling unit according to an embodiment of the present invention. As can be seen from fig. 6, the downsampling unit includes a first Convolutional Neural Network (CNN), an activation function (e.g., leakyRelu), and a second CNN, which are connected in sequence. The downsampling unit is of a residual structure, namely the output of the first CNN is added with the output of the second CNN and then output.

Fig. 7 is an exemplary block diagram of an up-sampling unit according to an embodiment of the present invention. As can be seen from fig. 7, the up-sampling unit includes a layer 1 deconvolution neural network (CNN-T), an activation function (e.g., leakyRelu), and a layer 1 CNN. The up-sampling unit is of a residual structure, namely the output of CNN-T is added with the output of CNN and output.

Fig. 8 is an exemplary schematic diagram of a discriminator according to an embodiment of the invention. The discriminators may be constructed using a classical image classification model VGG 16. The arbiter consists of 3 blocks (blocks) in series, a full connection layer and a scorer. Each block contains 3 CNNs and a pooling layer, respectively. In the score, there are only 1 class to be identified (i.e., the second dialect), and the predicted value of the score reflects the distance of the current sample from the target.

While the above exemplary description describes typical structures of a dialect conversion model and recognizer, those skilled in the art will recognize that such description is merely exemplary and is not intended to limit the scope of embodiments of the present invention.

Based on the dialect conversion model obtained through training in fig. 2, a training data set of the second dialect can be provided through a data enhancement mode, so that training of an artificial intelligent model capable of automatically identifying the second dialect through various mature network structures is facilitated. The dialect conversion model can also be directly determined as an artificial intelligence model for automatically converting the second dialect.

Fig. 9 is an exemplary flowchart of a dialect conversion method according to an embodiment of the present invention. As shown in fig. 9, the method includes:

step 301: a trained dialect conversion model is obtained, the dialect conversion model being trained according to the training method for generating the countermeasure network as described above.

Step 302: inputting the audio features to be converted belonging to the first dialect into a dialect conversion model to obtain converted audio features belonging to the second dialect; and/or inputting the audio features to be converted belonging to the second dialect into a dialect conversion model to obtain the converted audio features belonging to the first dialect.

In one embodiment, the audio feature to be converted belonging to the first dialect may be implemented as an audio feature of the annotated speech data of the first dialect. The method further comprises the steps of: based on the converted audio features of the second dialect, an artificial intelligence model is trained that automatically identifies the first dialect.

In one embodiment, the audio feature to be converted belonging to the first dialect may be implemented as an audio feature of real-time speech data acquired from a speaker, and the converted audio feature is an audio feature of real-time speech data of the second dialect converted for the acquired real-time speech data. The method further comprises the steps of: the converted audio features are converted into real-time speech data of the second dialect.

Fig. 10 is an exemplary block diagram of a training apparatus for generating an countermeasure network according to an embodiment of the present invention. Generating an antagonism network inclusion discriminator and a dialect conversion model as a generator, the training apparatus 400 includes:

a determining module 401 for determining a first audio feature belonging to a first dialect; the training module 402 is configured to input the first audio feature into a generating countermeasure network, and perform iterative training alternately with the dialect conversion model and the arbiter until quality of a second audio feature output by the dialect conversion model reaches a predetermined condition, where the second audio feature belongs to a second dialect; wherein the alternating iterative training comprises: inputting the first audio feature into a dialect conversion model to generate a second audio feature; inputting the second audio feature into a dialect conversion model to generate a third audio feature belonging to the first dialect; and inputting the second audio feature into a discriminator to obtain a first discrimination result of whether the second audio feature belongs to the second dialect.

In one embodiment, the training module 402 is configured to fix model parameters of the discriminant; performing a training process of the dialect conversion model, wherein the training process of the dialect conversion model comprises: determining a difference between the third audio feature and the first audio feature; determining a loss function value of the dialect conversion model based on the difference and the first discrimination result; model parameters of the dialect conversion model are adjusted so that the loss function value of the dialect conversion model is lower than a first threshold value.

In one embodiment, the training module 402 is configured to determine a first sub-loss function value, wherein the first sub-loss function value is greater when the difference is greater; determining a second sub-loss function value, wherein the second sub-loss function value is smaller the greater the probability that the first discrimination result characterizes the second audio feature as belonging to the second dialect; and determining a weighted operation result of the first sub-loss function value and the second sub-loss function value as a loss function value of the dialect conversion model.

In one embodiment, a training module 402 is used to fix model parameters of the dialect conversion model; executing a training process of the discriminant, wherein the training process of the discriminant comprises: inputting the fourth audio feature marked as the second dialect into a discriminator to obtain a second discriminating result of whether the fourth audio feature belongs to the second dialect; determining a loss function value of the discriminator based on the first discrimination result and the second discrimination result; model parameters of the arbiter are adjusted so that the loss function value of the arbiter is below the second threshold.

In one embodiment, the training module 402 is configured to determine a third sub-loss function value, where the third sub-loss function value is greater when the first discrimination result characterizes a greater probability that the second audio feature belongs to the second dialect; determining a fourth sub-loss function value, wherein the fourth sub-loss function value is smaller the greater the probability that the second discrimination result characterizes the fourth audio feature as belonging to the second dialect; and determining a weighted operation result of the third sub-loss function value and the fourth sub-loss function value as a loss function value of the discriminator.

In one implementation, the training module 402 is configured to convert the second audio feature into an audio file; playing the audio file; and when the playing feedback of the audio file represents that the quality of the audio file is qualified, determining that the quality of the second audio feature reaches a preset condition.

In one implementation, the training module 402 is configured to determine a parallel corpus comprising fifth audio features labeled as a first dialect and sixth audio features labeled as a second dialect, wherein the fifth audio features are aligned with the sixth audio features; inputting the parallel corpus into a dialect conversion model to obtain seventh audio features belonging to a second dialect and eighth audio features belonging to the first dialect, wherein the seventh audio features are generated based on the fifth audio features, and the eighth audio features are generated based on the sixth audio features; when the seventh audio feature is aligned with the sixth audio feature and the eighth audio feature is aligned with the fifth audio feature, it is determined that the quality of the second audio feature output by the dialect conversion model meets a predetermined condition.

Fig. 11 is an exemplary configuration diagram of a dialect converting apparatus according to an embodiment of the present invention. The dialect conversion 500 includes:

an obtaining module 501, configured to obtain a trained dialect conversion model, where the dialect conversion model is obtained by training according to the training method for generating an countermeasure network as described in any one of the above; an input module 502, configured to input an audio feature to be converted, which belongs to a first dialect, into the dialect conversion model to obtain a converted audio feature, which belongs to a second dialect; and/or inputting the audio features to be converted belonging to the second dialect into the dialect conversion model to obtain the converted audio features belonging to the first dialect.

Fig. 12 is an exemplary structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 600 includes: a processor 601 and a memory 602. Processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 601 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 601 may integrate with an image processor (Graphics Processing Unit, GPU) for rendering and rendering of content required to be displayed by the display screen. In some implementations, the processor 601 may also include an AI processor for processing computing operations related to machine learning. For example, the AI processor may be implemented as a neural network processor.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.

In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the training method or dialect conversion method of generating an countermeasure network provided by various embodiments in the present disclosure. In some embodiments, the electronic device 600 may further optionally include: a peripheral interface 603, and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 603 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 604, a touch display 605, a camera assembly 606, audio circuitry 607, a positioning assembly 608, and a power supply 609. Peripheral interface 603 may be used to connect at least one Input/Output (I/O) related peripheral to processor 601 and memory 602. In some implementations, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 601, memory 602, and peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 604 is configured to receive and transmit Radio Frequency (RF) signals, also known as electromagnetic signals. The radio frequency circuit 604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 604 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or wireless fidelity (Wireless Fidelity, wi-Fi) networks. In some implementations, the radio frequency circuitry 804 may also include circuitry related to near field wireless communication (Near Field Communication, NFC), which is not limited by the present disclosure.

The display screen 605 is used to display a User Interface (UI). The UI may include graphics, text, icons, video, and any combination thereof. When the display 605 is a touch display, the display 605 also has the ability to collect touch signals at or above the surface of the display 605. The touch signal may be input as a control signal to the processor 601 for processing. At this point, the display 605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 605 may be one, disposed on the front panel of the electronic device 600; in other embodiments, the display screen 605 may be at least two, respectively disposed on different surfaces of the electronic device 600 or in a folded design; in some implementations, the display 605 may be a flexible display disposed on a curved surface or a folded surface of the electronic device 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 605 may be made of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like.

The camera assembly 606 is used to capture images or video. Optionally, the camera assembly 606 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some implementations, the camera assembly 606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing, or inputting the electric signals to the radio frequency circuit 604 for voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple and separately disposed at different locations of the electronic device 600. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some implementations, the audio circuit 607 can also include a headphone jack. The location component 608 is used to locate the current geographic location of the electronic device 800 to enable navigation or location-based services (Location Based Service, LBS). The positioning component 608 may be a positioning component based on the U.S. global positioning system (Global Positioning System, GPS), the beidou system of china, the grainer system of russia, or the galileo system of the european union. The power supply 609 is used to power the various components in the electronic device 600. The power source 609 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging.

It will be appreciated by those skilled in the art that the foregoing structure is not limiting of the electronic device 600 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components. It should be noted that not all the steps and modules in the above processes and the structure diagrams are necessary, and some steps or modules may be omitted according to actual needs. The execution sequence of the steps is not fixed and can be adjusted as required. The division of the modules is merely for convenience of description and the division of functions adopted in the embodiments, and in actual implementation, one module may be implemented by a plurality of modules, and functions of a plurality of modules may be implemented by the same module, and the modules may be located in the same device or different devices. The hardware modules in the various embodiments may be implemented mechanically or electronically. For example, a hardware module may include specially designed permanent circuits or logic devices (e.g., special purpose processors such as FPGAs or ASICs) for performing certain operations. A hardware module may also include programmable logic devices or circuits (e.g., including a general purpose processor or other programmable processor) temporarily configured by software for performing particular operations. As regards implementation of the hardware modules in a mechanical manner, either by dedicated permanent circuits or by circuits that are temporarily configured (e.g. by software), this may be determined by cost and time considerations.

The present invention also provides a machine-readable storage medium storing instructions for causing a machine to perform a method as herein described. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium. Further, some or all of the actual operations may be performed by an operating system or the like operating on a computer based on instructions of the program code. The program code read out from the storage medium may also be written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then, based on instructions of the program code, a CPU or the like mounted on the expansion board or the expansion unit may be caused to perform part or all of actual operations, thereby realizing the functions of any of the above embodiments. Storage medium implementations for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs, DVD+RWs), magnetic tapes, non-volatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.

The foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A training method for generating an countermeasure network, the generating the countermeasure network including a discriminant and a dialect conversion model as a generator, the method comprising:

determining a first audio feature belonging to a first dialect;

2. The method of claim 1, wherein the alternating iterative training comprises:

fixing model parameters of the discriminator;

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the determining a loss function value of the dialect conversion model based on the difference and the first discrimination result includes:

4. The method of claim 1, wherein the alternating iterative training comprises:

fixing model parameters of the dialect conversion model;

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the determining a loss function value of the arbiter based on the first and second discrimination results includes:

6. The method of any one of claims 1-5, wherein the quality of the second audio feature output by the dialect conversion model reaching a predetermined condition comprises:

7. A method of dialect conversion comprising:

obtaining a trained dialect conversion model, the dialect conversion model being trained according to the training method of any of claims 1-6 for generating an countermeasure network;

8. A training apparatus for generating an countermeasure network, the generating an countermeasure network including a discriminant and a dialect conversion model as a generator, the apparatus comprising:

9. A dialect conversion apparatus, characterized by comprising:

an acquisition module for acquiring a trained dialect conversion model, the dialect conversion model being trained according to the training method of generating an countermeasure network of any one of claims 1 to 6;

10. An electronic device, comprising:

a memory;

a processor;

wherein the memory has stored therein an application executable by the processor for causing the processor to perform the training method of generating an countermeasure network as claimed in any one of claims 1 to 6 or the dialect conversion method as claimed in claim 7.