WO2021179717A1

WO2021179717A1 - Speech recognition front-end processing method and apparatus, and terminal device

Info

Publication number: WO2021179717A1
Application number: PCT/CN2020/135511
Authority: WO
Inventors: 王健宗; 贾雪丽
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-03-11
Filing date: 2020-12-11
Publication date: 2021-09-16
Also published as: CN111445900A

Abstract

Disclosed are a speech recognition front-end processing method and apparatus, and a terminal device. The method comprises: acquiring an original speech signal, and preprocessing the original speech signal according to a preset format to obtain source speech data (S201); performing speech feature extraction on the source speech data to obtain a first speech feature parameter of the source speech data, wherein the first speech feature parameter is an acoustic feature parameter describing the timbre and prosody of speech (S202); inputting the first speech feature parameter into a speech conversion model, and outputting a second speech feature parameter after conversion, wherein the second speech feature parameter is a feature parameter of target speech data (S203); and synthesising the target speech data according to the second speech feature parameter, and taking the target speech data as an input of a speech recognition model for performing speech recognition (204). Source speech data with a first speech feature parameter is converted into speech data with a second speech feature parameter, and non-parallel conversion of speech data is realised, thereby improving the robustness and accuracy of speech recognition.

Description

[Corrected according to Rule 91 07.01.2021] 　A front-end processing method, device and terminal equipment for speech recognition

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 11, 2020, the application number is 202010165112.8, and the invention title is "a front-end processing method, device and terminal equipment for speech recognition". The entire content of the application is approved The reference is incorporated in this application.

Technical field

This application belongs to the technical field of speech recognition, and in particular relates to a front-end processing method, device and terminal equipment for speech recognition.

Background technique

Automatic Speech Recognition (ASR) converts the vocabulary content of human speech into computer-readable input, which is different from speaker recognition or speaker confirmation. With the development and application of deep learning technology, automatic speech recognition technology has been significantly improved and is widely used in different fields of daily life.

However, the inventor realizes that when there is a small amount of noise in the speech signal or the speech signal undergoes subtle changes, such as the natural interference in human language due to psychological or physiological interference (including laugh, excitement, frustration, and expressive speech signals of different emotions) Or voice signals with squeaking and breathing sounds generated by different sound qualities) will affect the performance of automatic speech recognition and reduce the performance of automatic speech recognition.

technical problem

In view of this, the embodiments of the present application provide a front-end processing method, device, and terminal equipment for speech recognition to solve the natural interference in human language due to psychological or physiological interference, which affects the performance of automatic speech recognition and reduces automatic speech recognition. The performance problem of speech recognition.

Technical solutions

In the first aspect, an embodiment of the present application provides a front-end processing method for speech recognition, including:

Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;

Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;

Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;

The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.

In the second aspect, an embodiment of the present application provides a front-end processing device for speech recognition, including:

The acquiring unit is configured to acquire an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;

A feature extraction unit, configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;

A data processing unit, configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;

The synthesis unit is configured to synthesize the target voice data according to the second voice feature parameter, and use the target voice data as the input of a voice recognition model to perform voice recognition.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program when the computer program is executed:

In the fourth aspect, the embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program. Realized when executed by the processor:

In a fifth aspect, the embodiments of the present application provide a computer program product that, when the computer program product runs on a terminal device, causes the terminal device to execute the front-end processing method for speech recognition according to any one of the above-mentioned first aspects.

Beneficial effect

Compared with the prior art, the embodiment of this application has the following beneficial effects: through the embodiment of this application, the original voice signal is obtained, and the original voice signal is preprocessed according to a preset format to obtain source voice data; Perform voice feature extraction on the voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; input the first voice feature parameter to the voice conversion The model is converted and output to obtain a second voice feature parameter, the second voice feature parameter being a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as Input to the speech recognition model for speech recognition. Before speech recognition, the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted. The natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data The visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only of the present application. For some embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of an application scenario system provided by an embodiment of the present application;

2 is a schematic flowchart of a front-end processing method for speech recognition provided by an embodiment of the present application;

FIG. 3 is a schematic flowchart of an iterative training method for a confrontation network model provided by another embodiment of the present application;

4 is a schematic diagram of the network structure of the confrontation network model provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a front-end processing device for speech recognition provided by an embodiment of the present application;

Fig. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Embodiments of the present invention

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.

It should be understood that when used in the specification and appended claims of this application, the term "comprising" indicates the existence of the described features, wholes, steps, operations, elements and/or components, but does not exclude one or more other The existence or addition of features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations.

As used in the description of this application and the appended claims, the term "if" can be construed as "when" or "once" or "in response to determination" or "in response to detecting ". Similarly, the phrase "if determined" or "if detected [described condition or event]" can be interpreted as meaning "once determined" or "in response to determination" or "once detected [described condition or event]" depending on the context ]" or "in response to detection of [condition or event described]".

In addition, in the description of the specification of this application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.

Reference to "one embodiment" or "some embodiments" described in the specification of this application means that one or more embodiments of this application include a specific feature, structure, or characteristic described in combination with the embodiment. Therefore, the sentences "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless it is specifically emphasized otherwise. The terms "including", "including", "having" and their variations all mean "including but not limited to", unless otherwise specifically emphasized.

The front-end processing method for speech recognition provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super For terminal devices such as ultra-mobile personal computers (UMPCs), netbooks, and personal digital assistants (personal digital assistants, PDAs), the embodiments of this application do not impose any restrictions on the specific types of terminal devices.

Refer to Figure 1, which is a schematic diagram of an application scenario system provided by an embodiment of the present application. As shown in the figure, the front-end processing method for speech recognition provided by the embodiment of the present application can be applied to mobile terminals or fixed devices, such as smart phones 101 and laptops. 102. Desktop computer 103, etc., the embodiment of the application does not impose any restrictions on the specific types of terminal devices. The terminal device interacts with the server 104 in a wired or wireless manner; the voice assistant of the terminal device obtains external voice signals, Perform front-end processing to filter out some interference factors in the voice signal, convert the disturbed voice signal into a natural voice signal with no disturbance or minimum disturbance, and then transmit it to the server through wired or wireless means, and the server will perform voice recognition , Natural language processing and related business processing are fed back to the terminal device, and the terminal device executes corresponding actions according to the business processing information; among them, voice assistants such as Siri, Google Assistant, Amazon Alexa, etc., respond to the voice in the automatic speech recognition ASR system Application of recognized front-end processing methods. Wireless methods include the Internet, WiFi networks or mobile networks. Mobile networks can include existing 2G (such as Global System for Mobile Communication, GSM), 3G (such as Universal Mobile Communication System (English: Universal)). Mobile Telecommunications System, UMTS)), 4G (such as FDD LTE, TDD LTE), 4.5G, 5G, etc.

FIG. 2 shows a schematic flowchart of the front-end processing method of speech recognition provided by the present application, and the front-end processing method of speech recognition includes:

Step S201: Obtain an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data.

In a possible implementation manner, the executive body of this embodiment may be a terminal device with a voice recognition function, which implements front-end processing of voice signals for application scenarios where voice recognition is performed; that is, before performing semantic recognition on voice , Perform front-end processing on the voice signal with disturbance or noise to obtain normal and noise-free voice data, and use the normal and noise-free voice data as the input of the voice recognition system to improve the accuracy and robustness of voice recognition.

Among them, the original voice signal may be a voice signal with disturbance or noise, such as a voice signal with natural interference caused by psychology or physiology. Specifically, it may include: voice signals expressed in different emotions such as laughter, excitement, depression, etc. , Or voice signals with squeaking and breathing sounds produced by different sound qualities.

In one embodiment, obtaining the original voice signal, and preprocessing the original voice signal according to a preset format to obtain the source voice data includes:

A1. Perform filtering processing on the original speech signal;

A2. Periodically sample the filtered voice signal to obtain voice sampling data with a preset frequency;

In a possible implementation manner, the original speech signal is filtered and sampled at a frequency of 16 kHz.

A3. Perform windowing and framing processing on the voice sample data to obtain the source voice data.

In a possible implementation manner, the voice sample data is windowed. Since the voice signal has strong time-varying properties in the time domain, the voice signal is divided into short-term to obtain a short signal with a fixed time length. The characteristics of a fixed frame of short signal remain unchanged within a fixed time. The fixed time can be a fixed period of time between 10 milliseconds and 30 milliseconds, which can be realized by adding windows, such as multiplying the voice by a window function with a length of 20 milliseconds. Signal, the spectral characteristics of the windowed speech signal are stable within the duration of the window (20 milliseconds).

In addition, after the voice data is windowed, the voice signal is processed into frames; in order to ensure the continuity and reliability of the information of the dynamic change of the voice signal, the overlap between two adjacent frames of the voice signal is set to maintain the voice signal Smooth transition between frames. After framing the voice signal, endpoint detection is performed on the voice signal to mark and determine the starting point and ending point of each frame of voice signal, reducing the impact of bursts or discontinuities on voice signal analysis. Finally, the acquired voice data frame is used as the source voice data to be analyzed.

It should be noted that the original voice signal may also be a normal noise-free voice signal. As the front-end processing part of the voice recognition system, the front-end processing of the obtained normal voice without disturbance will not affect the subsequent recognition of the voice signal.

Step S202: Perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice.

In a possible implementation, the first voice feature parameter is an acoustic feature parameter describing the timbre of the voice extracted based on the voice data frame, such as a frequency spectrum parameter; the first voice feature parameter also includes a parameter for characterizing prosodic features of the voice, For example, the pitch frequency parameter.

In one embodiment, performing voice feature extraction on the source voice data to obtain the first voice feature parameter of the source voice data includes:

B1. Extract the Mel spectrum feature parameters, logarithmic fundamental frequency feature parameters, and non-periodic component feature parameters of the source voice data through the Mel filter bank;

B2. Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.

In a possible implementation manner, within the speech data window of 20 milliseconds in each frame, the first speech feature parameter is extracted every 5 milliseconds, including the mel spectrum extracted based on the mel filter bank (MFB) Characteristic parameters, logarithmic fundamental frequency (log F0) characteristic parameters, and aperiodic components (APs) characteristics. Wherein, the Mel frequency spectrum characteristic parameter and the aperiodic component (APs) characteristic are respectively 24-dimensional speech characteristic parameters.

Among them, for the Mel spectrum feature parameters, feature extraction is performed every 5 milliseconds in the voice data window of 20 milliseconds per frame; the time domain signal is supplemented to the length by recording the time domain signal of the source voice data of each frame The sequence is the same as the window width, the discrete Fourier transform is performed on the sequence to obtain the linear spectrum of each frame of speech data, and the linear spectrum is passed through the mel frequency filter bank to obtain the mel spectrum; among them, the mel filter bank generally includes 24 A triangular bandpass filter smoothes the acquired spectrum characteristics, effectively emphasizes the low-frequency information of the voice data, highlights useful information, and shields the interference of noise.

For the characteristic parameter of the logarithmic fundamental frequency (log F0), when people are making voiced sounds, the airflow passes through the glottis to cause relaxation and oscillation of the vocal cords, resulting in a quasi-periodic pulsed airflow. This airflow excites the vocal tract to produce voiced sounds. The frequency of this vocal cord vibration is the pitch frequency. Specifically, after performing windowing processing on each frame of source speech data after preprocessing, calculate the cepstrum of the frame of speech data, set the length range of pitch search, and query the maximum value of the cepstrum of the frame speech data in this length range If the maximum value is greater than the threshold value of the window, the pitch frequency of voiced voice is calculated according to the maximum value, and the characteristics of the voice data are reflected by obtaining the logarithm of the pitch frequency; if the maximum value of the cepstrum is less than or equal to the threshold value of the window, It means that the source voice data of the frame is muted or unvoiced.

For the non-periodic component feature parameters, perform inverse Fourier transform according to the windowed signal of the source speech data to obtain the time domain characteristics of the non-periodic component, and determine according to the minimum phase of the windowed signal and the spectral feature of the source speech data Frequency domain characteristics of aperiodic components.

Step S203: Input the first voice feature parameter into the voice conversion model, and output the second voice feature parameter after the conversion. The second voice feature parameter is the feature parameter of the target voice data.

In a possible implementation manner, the voice conversion model is a model obtained by training a sample training data set using a confrontation network model with a consistent period. The first voice feature parameter extracted from the source voice data is input into the voice conversion model, and the second voice feature parameter is output after the voice conversion. The second voice feature parameter is the voice feature parameter most similar to the actual normal voice feature parameter , That is, the characteristic parameter of the target voice data, and the target voice data is voice data with minimal or no disturbance.

In one embodiment, as shown in FIG. 3, a schematic flow chart of a method for iterative training of a confrontation network model provided by another embodiment of the present application, the training step of the voice conversion model includes:

Step S301: Obtain a random sample and an actual sample in the speech sample training data set, and extract the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;

In a possible implementation, two spontaneous speech data sets, such as the AMI conference corpus and the Buckeye corpus of conversational speech, are used to analyze the impact of natural disturbance; from the two speech data sets, 40 female speeches are obtained Voice data composed of speakers and 30 male speakers; using the voice data as a voice sample training data set, a total of 210 utterances, including each gender and each type (normal language, laughing language and squeaking language). Among these 210 utterances, 150 are used for training and 60 are used for testing; the duration of each sentence is 1-2 seconds; to train the speech conversion model.

Specifically, in the training process of the speech conversion model, the adversarial network model with consistent cycles includes a generator and a discriminator; specifically, random samples and actual samples are obtained from the speech sample training data set, and the random sample characteristics of the random samples are extracted Parameters and actual sample characteristic parameters in the actual sample, and the distribution of random sample characteristic parameters will be used as the input of the generator.

Step S302: Perform iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;

In a possible implementation, the adversarial network model with consistent cycles includes a generator and a discriminator; according to the random sample characteristic parameters, the generator generates a pseudo sample characteristic parameter distribution similar to the actual sample characteristic parameter distribution. The feature parameter distribution of the pseudo sample is input into the discriminator, and the discriminator distinguishes the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution.

Step S303: Calculate the output error of the confrontation network model in the iterative training process according to the preset loss function;

In a possible implementation manner, the confrontation network model uses a preset loss function to calculate the error in the iterative training process, and uses the error as the target training value of the confrontation network model.

Step S304, when the error is less than or equal to the preset error threshold, stop training to obtain the voice conversion model.

In a possible implementation manner, when the error is less than or equal to the preset error threshold, the trained confrontation network model meets the conversion conditions, then the training is stopped to obtain the voice conversion model; through the voice conversion model, disturbing speech The characteristic parameters are converted into actual normal speech characteristic parameters, completing the conversion of non-parallel speech.

In one embodiment, performing iterative training on the confrontation network to be trained according to the distribution of the random sample characteristic parameters and the distribution of the actual sample characteristic parameters includes:

C1. Input the feature parameter distribution of the random sample to the generator network of the confrontation network model to be trained, and generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;

Specifically, by adopting a period-consistent confrontation network model, the conversion of disturbing voice features to normal voice features is modeled; the acquired random samples are extracted from the voice feature parameters, and the distribution of the extracted voice feature parameters (x∈ X) is input to the generator, and the generator generates a pseudo-sample feature parameter distribution GX→Y(x); through the first confrontation loss function L _adv (G _X→Y (x),Y), calculate the pseudo-sample feature parameter distribution GX →The distance between Y(x) and the actual sample feature parameter distribution (y∈Y), that is, the conversion from disturbed speech to normal speech is realized.

C2, through the discriminator network of the adversarial network model to be trained, discriminate the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;

Specifically, the discriminator distinguishes the generated pseudo sample features from the actual sample features, and obtains the differentiated result GY→X(y), and passes the second adversarial loss function L _adv (G _Y→X (y),X) Calculate the distance between the identification result and the feature of the random sample.

C3. Input the feature distribution of the identification result to the generator network again to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution, and again compare the pseudo sample feature parameter distribution with the actual sample through the discriminator network The characteristic parameter distribution is identified, and the characteristic distribution of the identification result is obtained;

C4. According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, perform cyclic iterative training on the to-be-trained confrontation network model.

As shown in Figure 4, a schematic diagram of the network structure of the confrontation network model provided by an embodiment of the present application. The confrontation network model includes a generator G and a discriminator D. The generator G generates a pseudo-sample feature parameter distribution G(x), and The feature parameter distribution of the pseudo sample and the feature distribution of the actual sample are input to the discriminator, and the discriminator is used to identify the discriminator to obtain the identification result, and then the identification result is fed back to the generator G or the discriminator D to perform cyclic training on the adversarial network model.

Among them, the generator and discriminator networks in the speech conversion model are respectively composed of convolution blocks. The generator network consists of 9 convolution blocks; among them, it includes a stride-1 convolution block, a stride-2 convolution block, 5 residual blocks, a 1/2stride convolution block and a stride-1 convolution block Block; In order to maintain the time structure, all convolutional layers are one-dimensional; as the activation function of the convolutional layer, the gated linear unit has achieved the most advanced performance in language and speech modeling. The discriminator network is composed of four two-dimensional convolutional blocks; the gated linear unit is used as the activation function of all convolutional blocks; for the discriminator network, a 6×6 patch GAN is used to classify each 6×6 patch as true and false.

In one embodiment, calculating the error output by the confrontation network model during the iterative training process according to a preset loss function includes:

D1. According to the first confrontation loss function and the second confrontation loss function, the cyclic consistency loss function and the identity mapping loss function of the confrontation network model are obtained; wherein, the first confrontation loss function is to calculate the pseudo sample feature A loss function for the distance between the parameter distribution and the actual sample feature parameter distribution, and the second counter loss function is a loss function for calculating the distance between the identification result feature distribution and the random sample feature distribution;

In one possible implementation, the first against the loss function _{_{L adv (G X → Y (}} x), Y), calculates the pseudo sample feature distributions in G _{X → Y} (x) and the actual sample feature quantity distribution (y ∈ Y); the distance between the _{identification result and the random sample feature is calculated through the second adversarial loss function L adv} (G _Y→X (y), X); according to the first adversarial loss function L _adv (G _X→Y (x),Y), the second adversarial loss function L _adv (G _Y→X (y),X), the cyclic consistency loss function is obtained: L _cyc =E _y ||G _Y→X ( G _X→Y (x)|| ₁ +E _y ||G _X→Y (G _Y→X (y)|| ₁ , and the identity mapping loss function L _id =E _x ||G _Y→X (x) -x|| ₁ +E _y ||G _X→Y (y)-y|| ₁ ; through the cyclic consistency loss function Lcyc, the context information in the voice feature is retained during the calculation process, and the identity mapping loss function Lid, During the calculation process, save the important voice information of the voice data during the conversion process.

D2. Obtain the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;

In a possible implementation manner, according to the cyclic consistency loss function and the identity mapping loss function, the preset loss function L=L _adv (G _X→Y (x) of the confrontation network model is obtained ,y)+L _adv (G _Y→X (y),x)+λ _cyc L _cyc +λ _id L _id , where λ _cyc and λ _id are hyperparameters to control the cyclic consistency loss function and the identity mapping The relative importance of the loss function and the three loss functions of the preset loss function.

D3. The confrontation network model outputs an error calculated by the preset loss function, and uses the error as a target training value.

In a possible implementation manner, the error is taken as the target training value, and when the value of the complete loss function is minimized, the training of the speech conversion model is completed to obtain the speech conversion model.

Step S204: Synthesize the target voice data according to the second voice feature parameter, and use the target voice data as an input of a voice recognition model to perform voice recognition.

In an embodiment, synthesizing the target voice data according to the second voice feature parameter includes:

According to the second voice feature parameter, the waveform splicing and time domain gene synchronization superposition algorithm is used to synthesize target voice data with no disturbance or with the least disturbance feature.

In a possible implementation manner, the target voice data is synthesized according to the second voice feature parameter, for example, based on the second voice feature parameter, waveform splicing is used, and a time-domain pitch synchronization superposition algorithm is used to synthesize a voice signal containing the target feature parameter.

Further, the synthesized speech data is used as the input of the speech recognition model to perform speech recognition; specifically, in the actual application process, based on the specific speech recognition system, the front-end processing method proposed in this application and the unused In this case, use laughter voice (voice disturbed by emotions) and squeak (voice disturbed by voice quality) respectively for testing. The performance is evaluated by word error rate (WER) and sentence error rate (SER). Lower WER and SER values indicate better performance. As can be seen from the experimental test data in Table 1 below, modeling with spectral features and aperiodic components (ie MFB+AP) has better performance than modeling only MFB in the proposed front-end.

Table 1

The ASR performance shown in Table 1 is affected by the strength of the language model used by each ASR system. In order to test the performance of ASR without being affected by the language model, the deep speech model that converts speech into English character sequence is tested. As shown in Table 2, the character error rate of the language model with and without the front-end speech conversion model (CER) performance. The language model is trained through 1000 hours of LibriSpeech data, and the language model is not used for decoding. It can be seen from Table 2 that the deep language model after the front-end processing through the voice conversion model reduces the character error rate CER of the deep voice model.

Table 2

In addition, through the two-dimensional t-SNE projection of Mel filter bank features, it is used for normal speech and laughter disturbed speech, and the front-end processing method based on the anti-network model CycleGANs of this embodiment is used to convert laughter disturbed speech into Normal speech; it can be concluded that the characteristics of the filter bank output of normal speech and the speech obtained by conversion are very similar, and are significantly different from the characteristics of the filter bank output of laughter speech; therefore, the speech conversion model of this embodiment can capture The distribution of Mel filter bank output of normal and laughter disturbed speech, and can convert laughter disturbed speech into equivalent normal speech.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Through this embodiment, the original voice signal is obtained, and the original voice signal is preprocessed in a preset format to obtain source voice data; voice feature extraction is performed on the source voice data to obtain the first voice of the source voice data The first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; the first voice feature parameter is input to the voice conversion model, and the second voice feature parameter is output after the conversion. The second voice feature parameter is The voice feature parameter is a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition. Before speech recognition, the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted. The natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data The visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.

Corresponding to the front-end processing method for speech recognition described in the above embodiment, FIG. 5 shows a structural block diagram of the front-end processing device for speech recognition provided in an embodiment of the present application. The relevant part.

Referring to Figure 5, the device includes:

The obtaining unit 51 is configured to obtain an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;

The feature extraction unit 52 is configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;

The data processing unit 53 is configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;

The synthesis unit 54 is configured to synthesize the target voice data according to the second voice characteristic parameter, and use the target voice data as an input of a voice recognition model to perform voice recognition.

Optionally, the acquiring unit includes:

The filtering module is used to perform filtering processing on the original speech signal;

The sampling module is used to periodically sample the filtered voice signal to obtain voice sampling data with a preset frequency;

The processing module is used to perform windowing and framing processing on the voice sample data to obtain the source voice data.

Optionally, the feature extraction unit is further configured to extract Mel spectrum feature parameters, logarithmic fundamental frequency feature parameters, and aperiodic component feature parameters of the source voice data through a Mel filter bank; to obtain the source voice data The parameter distribution corresponding to the characteristic parameters of the Mel spectrum, the characteristic parameters of the logarithmic fundamental frequency and the characteristic parameters of the non-periodic component.

Optionally, the front-end processing device for speech recognition further includes:

The sample data acquisition unit is configured to acquire a random sample and an actual sample in the speech sample training data set, and extract the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;

The model training unit is configured to perform iterative training on the confrontation network model to be trained according to the random sample feature parameter distribution and the actual sample feature parameter distribution;

An error calculation unit, configured to calculate the error output by the confrontation network model in the iterative training process according to a preset loss function;

The model generating unit is used to stop training when the error is less than or equal to the preset error threshold to obtain the voice conversion model.

Optionally, the model training unit includes:

A generator network, which is used to input the random sample feature parameter distribution to the generator network of the confrontation network model to be trained, and generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;

The discriminator network is used to discriminate the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution through the discriminator network of the confrontation network model to be trained to obtain the feature distribution of the identification result;

The cyclic training module is used to input the feature distribution of the identification result to the generator network again, to regenerate the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution, and to re-analyze the pseudo sample feature parameter through the discriminator network The distribution is distinguished from the actual sample characteristic parameter distribution, and the characteristic distribution of the identification result is obtained;

An iterative training module, configured to perform cyclic iteration on the to-be-trained confrontation network model according to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution train.

Optionally, the error calculation unit includes:

The first calculation module is used to obtain the cyclic consistency loss function and the identity mapping loss function of the confrontation network model according to the first confrontation loss function and the second confrontation loss function; wherein, the first confrontation loss function is calculated A loss function for the distance between the feature parameter distribution of the pseudo sample and the feature parameter distribution of the actual sample, and the second counter loss function is a loss function for calculating the distance between the feature distribution of the identification result and the feature distribution of the random sample;

A second calculation module, configured to obtain the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;

The target training value calculation module is configured to output the error calculated by the preset loss function by the confrontation network model, and use the error as the target training value.

Optionally, the synthesis unit is further configured to adopt a waveform splicing and time domain gene synchronization superposition algorithm according to the second speech characteristic parameter to synthesize target speech data with no disturbance or with the least disturbance characteristic.

Through this embodiment, the original voice signal is obtained, and the original voice signal is preprocessed according to a preset format to obtain source voice data; voice feature extraction is performed on the source voice data to obtain the first voice of the source voice data The first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; the first voice feature parameter is input to the voice conversion model, and the second voice feature parameter is output after the conversion. The second voice feature parameter is The voice feature parameter is a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition. Before speech recognition, the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted. The natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data The visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.

It should be noted that the information interaction and execution process between the above-mentioned devices/units are based on the same concept as the method embodiment of this application, and its specific functions and technical effects can be found in the method embodiment section. I won't repeat it here.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.

The embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, and the computer program is When the processor is executed, the steps in the foregoing method embodiments can be realized.

The embodiments of the present application provide a computer program product. When the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be realized when the mobile terminal is executed.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in the present application can be accomplished by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc. In some jurisdictions, according to legislation and patent practices, computer-readable media cannot be electrical carrier signals and telecommunication signals.

FIG. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the application. As shown in FIG. 6, the terminal device 6 of this embodiment includes: at least one processor 60 (only one is shown in FIG. 6), a processor, a memory 61, and a processor that is stored in the memory 61 and can be processed in the at least one processor. The computer program 62 running on the processor 60 implements the steps in any of the foregoing embodiments of the front-end processing method for speech recognition when the processor 60 executes the computer program 62.

The terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, a processor 60 and a memory 61. Those skilled in the art can understand that FIG. 6 is only an example of the terminal device 6 and does not constitute a limitation on the terminal device 6. It may include more or fewer components than shown in the figure, or a combination of certain components, or different components. , For example, can also include input and output devices, network access devices, and so on.

The so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), and the processor 60 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (Application Specific Integrated Circuits). , ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6 in some embodiments, such as a hard disk or a memory of the terminal device 6. In other embodiments, the memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk equipped on the terminal device 6, a smart media card (SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as the program code of the computer program. The memory 61 can also be used to temporarily store data that has been output or will be output.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus/network equipment and method may be implemented in other ways. For example, the device/network device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units. Or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A front-end processing method for speech recognition, which includes:

Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;

Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;

Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;

The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
5. The front-end processing method for speech recognition according to claim 1, wherein obtaining the original speech signal and preprocessing the original speech signal according to a preset format to obtain the source speech data comprises:

Filtering the original speech signal;

Perform periodic sampling on the filtered voice signal to obtain voice sampling data with a preset frequency;

Perform windowing and framing processing on the voice sample data to obtain the source voice data.
5. The front-end processing method for speech recognition according to claim 1, wherein performing speech feature extraction on the source speech data to obtain the first speech feature parameter of the source speech data comprises:

Extracting the Mel spectrum feature parameters, the logarithmic fundamental frequency feature parameters, and the non-periodic component feature parameters of the source voice data through a Mel filter bank;

Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.
The front-end processing method for speech recognition according to claim 1, wherein the training step of the speech conversion model comprises:

Acquiring a random sample and an actual sample in the speech sample training data set, and extracting the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;

Performing iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;

According to a preset loss function, calculate the error output by the confrontation network model in the iterative training process;

When the error is less than or equal to the preset error threshold, the training is stopped to obtain the voice conversion model.
The front-end processing method of speech recognition according to claim 4, wherein, according to the distribution of the characteristic parameter of the random sample and the distribution of the characteristic parameter of the actual sample, the iterative training of the confrontation network to be trained comprises:

Inputting the random sample feature parameter distribution to the generator network of the to-be-trained confrontation network model to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;

Through the discriminator network of the confrontation network model to be trained, discriminating the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;

The feature distribution of the identification result is input to the generator network again, and the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution is generated again, and the pseudo sample feature parameter distribution and the actual sample feature parameter distribution are again analyzed through the discriminator network. Distribute the identification, and obtain the characteristic distribution of the identification result;

According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, cyclic iterative training is performed on the to-be-trained confrontation network model.
8. The front-end processing method for speech recognition according to claim 5, wherein, according to a preset loss function, calculating the error output by the confrontation network model in the iterative training process comprises:

According to the first confrontation loss function and the second confrontation loss function, the cyclic consistency loss function and the identity mapping loss function of the confrontation network model are obtained; wherein, the first confrontation loss function is to calculate the feature parameter distribution of the pseudo sample A loss function for the distance from the actual sample feature parameter distribution, and the second counter loss function is a loss function for calculating the distance between the identification result feature distribution and the random sample feature distribution;

Obtaining the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;

The confrontation network model outputs an error calculated by the preset loss function, and uses the error as a target training value.
The front-end processing method for speech recognition according to claim 1, wherein synthesizing the target speech data according to the second speech characteristic parameter comprises:

According to the second voice feature parameter, the waveform splicing and time domain gene synchronization superposition algorithm is used to synthesize target voice data with no disturbance or with the least disturbance feature.
A front-end processing device for speech recognition, which includes:

The acquiring unit is configured to acquire an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;

A feature extraction unit, configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;

A data processing unit, configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;

The synthesis unit is configured to synthesize the target voice data according to the second voice feature parameter, and use the target voice data as the input of a voice recognition model to perform voice recognition.
A terminal device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program:

Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;

Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;

Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;

The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:

Filtering the original speech signal;

Perform periodic sampling on the filtered voice signal to obtain voice sampling data with a preset frequency;

Perform windowing and framing processing on the voice sample data to obtain the source voice data.
The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:

Extracting the Mel spectrum feature parameters, the logarithmic fundamental frequency feature parameters, and the non-periodic component feature parameters of the source voice data through a Mel filter bank;

Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.
The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:

Acquiring a random sample and an actual sample in the speech sample training data set, and extracting the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;

Performing iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;

According to a preset loss function, calculate the error output by the confrontation network model in the iterative training process;

When the error is less than or equal to the preset error threshold, the training is stopped to obtain the voice conversion model.
The terminal device according to claim 12, wherein, when the processor executes the computer program, it further implements:

Inputting the random sample feature parameter distribution to the generator network of the to-be-trained confrontation network model to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;

Through the discriminator network of the confrontation network model to be trained, discriminating the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;

The feature distribution of the identification result is input to the generator network again, and the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution is generated again, and the pseudo sample feature parameter distribution and the actual sample feature parameter distribution are again analyzed through the discriminator network. Distribute the identification, and obtain the characteristic distribution of the identification result;

According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, cyclic iterative training is performed on the to-be-trained confrontation network model.
The terminal device according to claim 13, wherein, when the processor executes the computer program, it further implements:

According to the first confrontation loss function and the second confrontation loss function, the cyclic consistency loss function and the identity mapping loss function of the confrontation network model are obtained; wherein, the first confrontation loss function is to calculate the feature parameter distribution of the pseudo sample A loss function for the distance from the actual sample feature parameter distribution, and the second counter loss function is a loss function for calculating the distance between the identification result feature distribution and the random sample feature distribution;

Obtaining the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;

The confrontation network model outputs an error calculated by the preset loss function, and uses the error as a target training value.
The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:

According to the second voice feature parameter, the waveform splicing and time domain gene synchronization superposition algorithm is used to synthesize target voice data with no disturbance or with the least disturbance feature.
A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:

Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;

Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;

Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;

The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
15. The computer-readable storage medium of claim 16, wherein the computer program, when executed by the processor, further implements:

Filtering the original speech signal;

Perform periodic sampling on the filtered voice signal to obtain voice sampling data with a preset frequency;

Perform windowing and framing processing on the voice sample data to obtain the source voice data.
15. The computer-readable storage medium of claim 16, wherein the processor further implements when the computer program is executed:

Extracting the Mel spectrum feature parameters, the logarithmic fundamental frequency feature parameters, and the non-periodic component feature parameters of the source voice data through a Mel filter bank;

Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.
15. The computer-readable storage medium of claim 16, wherein the processor further implements when the computer program is executed:

Acquiring a random sample and an actual sample in the speech sample training data set, and extracting the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;

Performing iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;

According to a preset loss function, calculate the error output by the confrontation network model in the iterative training process;

When the error is less than or equal to the preset error threshold, the training is stopped to obtain the voice conversion model.
The computer-readable storage medium of claim 19, wherein, when the processor executes the computer program, it further implements:

Inputting the random sample feature parameter distribution to the generator network of the to-be-trained confrontation network model to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;

Through the discriminator network of the confrontation network model to be trained, discriminating the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;

The feature distribution of the identification result is input to the generator network again, and the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution is generated again, and the pseudo sample feature parameter distribution and the actual sample feature parameter distribution are again analyzed through the discriminator network. Distribute the identification, and obtain the characteristic distribution of the identification result;

According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, cyclic iterative training is performed on the to-be-trained confrontation network model.