CN111341304A

CN111341304A - Method, device and equipment for training speech characteristics of speaker based on GAN

Info

Publication number: CN111341304A
Application number: CN202010130403.3A
Authority: CN
Inventors: 陈昊亮; 许敏强
Original assignee: Guangzhou Speakin Intelligent Technology Co ltd
Current assignee: Guangzhou Speakin Intelligent Technology Co ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-06-26

Abstract

The application discloses a speaker voice feature training method, a device and equipment based on GAN, after the speaker voice data is subjected to conventional denoising processing, the obtained first denoising voice data Jining feature is extracted, the obtained first voice feature data is input into a generator preset with a GAN network, the first denoising voice data is subjected to secondary denoising by using a mask value to obtain second denoising voice data, and the second denoising voice data is used for voice feature training and recognition, so that the accuracy of speaker voice recognition is effectively improved, and the technical problem that the recognition accuracy of the existing voice recognition mode is low is solved.

Description

Method, device and equipment for training speech characteristics of speaker based on GAN

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, and a device for training speech characteristics of a speaker based on GAN.

Background

The voice recognition is an important means for identifying the same speaker, the existing speaker voiceprint identification is to acquire speaker voice data, perform voice feature extraction after denoising the speaker voice data, and then perform voice recognition through a preset voice recognition model, but the existing voice recognition mode is not high in recognition accuracy, so that the technical problem to be solved urgently by technical personnel in the field is still to further improve the accuracy of speaker voice recognition.

Disclosure of Invention

The application provides a method, a device and equipment for training the speech characteristics of a speaker based on GAN, which are used for solving the technical problem that the recognition accuracy of the existing speech recognition mode is not high.

In view of the above, a first aspect of the present application provides a method for training speech features of a speaker based on GAN, including:

acquiring voice data of a speaker through a recording device;

carrying out conventional denoising processing on the speaker voice data to obtain first denoising voice data;

performing feature extraction on the first de-noised voice data to obtain first voice feature data;

inputting the first voice feature data into a generator of a preset GAN network, and outputting an ideal mask value of second voice feature data corresponding to the first voice feature data, wherein the ideal mask value is the ratio of the second voice feature data to the first voice feature data;

determining second de-noised voice data of the speaker voice according to the ideal mask value;

and inputting the second denoising voice data into a preset training network for voice characteristic training.

Optionally, the performing conventional denoising processing on the speaker voice data to obtain first denoised voice data includes:

and carrying out voice denoising processing based on a deep cyclic neural network on the speaker voice data to obtain first denoising voice data.

Optionally, the performing feature extraction on the first denoising voice data to obtain first voice feature data includes:

and performing MFCC feature extraction on the first de-noised voice data to obtain first voice feature data.

Optionally, after the feature extraction is performed on the first de-noised speech data to obtain first speech feature data, before the inputting the first speech feature data into a generator of a preset GAN network and outputting an ideal mask value of second speech feature data corresponding to the first speech feature data, the method further includes:

calculating a mean square error normalization processing value of the first voice characteristic data;

correspondingly, the inputting the first voice feature data into a generator of a preset GAN network and outputting an ideal mask value of second voice feature data corresponding to the first voice feature data includes:

and inputting the mean square error normalization processing value of the first voice characteristic data into a generator of a preset GAN network, and outputting an ideal mask value of second voice characteristic data corresponding to the first voice characteristic data.

Optionally, the inputting the first voice feature data into a generator of a preset GAN network, and outputting an ideal mask value of second voice feature data corresponding to the first voice feature data, may further include:

and training and testing the initial GAN network until the initial GAN network converges to obtain the preset GAN network.

A second aspect of the present application provides a GAN-based device for training speech characteristics of a speaker, comprising:

the acquisition unit is used for acquiring the voice data of the speaker through the recording equipment;

the first denoising unit is used for carrying out conventional denoising processing on the speaker voice data to obtain first denoising voice data;

the feature extraction unit is used for performing feature extraction on the first denoising voice data to obtain first voice feature data;

a mask unit, configured to input the first voice feature data into a generator of a preset GAN network, and output an ideal mask value of second voice feature data corresponding to the first voice feature data, where the ideal mask value is a ratio of the second voice feature data to the first voice feature data;

the second denoising unit is used for determining second denoising voice data of the speaker voice according to the ideal mask value;

and the first training unit is used for inputting the second denoising voice data into a preset training network for voice characteristic training.

Optionally, the feature extraction unit is specifically configured to:

Optionally, the method further comprises:

the second training unit is used for training and testing the initial GAN network until the initial GAN network converges to obtain the preset GAN network;

the normalization unit is used for calculating a mean square error normalization processing value of the first voice characteristic data;

correspondingly, the mask unit is specifically configured to:

In a third aspect, the present application provides a method and apparatus for training speech characteristics of a speaker based on GAN, the apparatus including a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute any of the GAN-based speaker speech feature training methods of the first aspect according to instructions in the program code.

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a speaker voice feature training method based on GAN, comprising the following steps: acquiring voice data of a speaker through a recording device; carrying out conventional denoising processing on speaker voice data to obtain first denoising voice data; performing feature extraction on the first de-noised voice data to obtain first voice feature data; inputting the first voice characteristic data into a generator of a preset GAN network, and outputting an ideal mask value of second voice characteristic data corresponding to the first voice characteristic data, wherein the ideal mask value is the ratio of the second voice characteristic data to the first voice characteristic data; determining second de-noised voice data of the speaker voice according to the ideal mask value; and inputting the second denoising voice data into a preset training network for voice characteristic training. According to the method and the device, after the conventional denoising processing is carried out on the speaker voice data, the obtained first denoising voice data Jining feature is extracted, the obtained first voice feature data is input into a generator of a preset GAN network, the first denoising voice data is denoised for the second time by utilizing a mask value to obtain second denoising voice data, the second denoising voice data is utilized to carry out voice feature training and recognition, the accuracy of speaker voice recognition is effectively improved, and the technical problem that the recognition accuracy of the existing voice recognition mode is not high is solved.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for training speech features of a speaker based on GAN according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a GAN-based speaker speech feature training apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

To facilitate understanding, referring to fig. 1, the present application provides an embodiment of a GAN-based speaker speech feature training method, including:

step 101, obtaining voice data of a speaker through a recording device.

It should be noted that, in the embodiment of the present application, speaker voice data needs to be acquired first, and the acquisition of the speaker voice data may be acquired by a recording device, or may be acquired from existing speaker voice data in a network in a web crawler manner.

And 102, carrying out conventional denoising processing on the speaker voice data to obtain first denoising voice data.

It should be noted that after the speaker voice data is obtained, the speaker voice data is subjected to conventional denoising processing, and the conventional denoising processing mode may preferentially select a voice denoising processing mode based on a deep cyclic neural network to obtain first denoised voice data.

And 103, performing feature extraction on the first denoising voice data to obtain first voice feature data.

It should be noted that, the feature extraction performed on the first denoised speech data may be MFCC feature extraction or PLP feature extraction.

And 104, inputting the first voice characteristic data into a generator of a preset GAN network, and outputting an ideal mask value of the second voice characteristic data corresponding to the first voice characteristic data, wherein the ideal mask value is the ratio of the second voice characteristic data to the first voice characteristic data.

And 105, determining second de-noised voice data of the speaker voice according to the ideal mask value.

It should be noted that before inputting the first speech feature data into the generator of the preset GAN network, the initial GAN network needs to be trained and tested to obtain the preset GAN network. For the first language feature data, the mean and variance of each dimension element in the first voice feature data can be calculated, normalization processing is respectively carried out on the mean and variance of each dimension, and a mean variance normalization processing value of each dimension feature data of the first voice data is formed, so that valuable voice is effectively reserved, and noise is suppressed. And inputting the mean square error normalization processing value of the first voice characteristic data into a generator of a preset GAN network, denoising the first voice characteristic data by the generator of the preset GAN network according to the mean square error normalization processing value of the first voice characteristic data, generating an ideal mask value of second voice characteristic data corresponding to the first voice characteristic data, and outputting the ideal mask value. Because the ideal mask value is the ratio of the second voice characteristic data to the first voice characteristic data, the second voice characteristic data is calculated according to the ideal mask value and the first voice characteristic data, and then the second voice characteristic data is subjected to inverse transformation of feature extraction to obtain second de-noised voice data.

And 106, inputting the second denoising voice data into a preset training network for voice characteristic training.

It should be noted that the second denoising voice data is input into the preset training network for voice feature training, and the training voice data is used for speaker voice recognition, so that the accuracy of speaker recognition can be effectively improved.

According to the method for training the speech characteristics of the speaker based on the GAN, after the conventional denoising processing is carried out on the speech data of the speaker, the obtained first denoising speech data Jining characteristics are extracted, the obtained first speech characteristic data are input into a generator preset with the GAN network, the first denoising speech data of the speaker are denoised for the second time by using a mask value to obtain second denoising speech data, and the second denoising speech data are used for carrying out speech characteristic training and recognition, so that the accuracy of speech recognition of the speaker is effectively improved, and the technical problem that the recognition accuracy of the existing speech recognition mode is not high is solved.

For ease of understanding, referring to fig. 2, an embodiment of a GAN-based speaker phonetic feature training apparatus is provided, comprising:

the feature extraction unit is used for performing feature extraction on the first de-noised voice data to obtain first voice feature data;

the mask unit is used for inputting the first voice characteristic data into a generator of a preset GAN network and outputting an ideal mask value of the second voice characteristic data corresponding to the first voice characteristic data, wherein the ideal mask value is the ratio of the second voice characteristic data to the first voice characteristic data;

Further, the first denoising unit is specifically configured to:

Further, the feature extraction unit is specifically configured to:

Further, still include:

the second training unit is used for training and testing the initial GAN network until the initial GAN network converges to obtain a preset GAN network;

correspondingly, the mask unit is specifically configured to:

and inputting the mean square error normalization processing value of the first voice characteristic data into a generator of a preset GAN network, and outputting an ideal mask value of the second voice characteristic data corresponding to the first voice characteristic data.

The application provides an embodiment of a method and a device for training speech characteristics of a speaker based on GAN, wherein the device comprises a processor and a memory:

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is used for executing the GAN-based speaker voice feature training method in the embodiment of the GAN-based speaker voice feature training method according to instructions in the program code.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer system (which may be a personal computer, a server, or a network system) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for training speech features of a speaker based on GAN, comprising:

acquiring voice data of a speaker through a recording device;

2. The GAN-based speaker voice feature training method as claimed in claim 1, wherein the performing a conventional denoising process on the speaker voice data to obtain a first denoised voice data comprises:

3. The GAN-based speaker voice feature training method as claimed in claim 2, wherein the performing feature extraction on the first de-noised voice data to obtain first voice feature data comprises:

4. The GAN-based speaker voice feature training method as claimed in claim 3, wherein after the feature extraction is performed on the first de-noised voice data to obtain first voice feature data, before the inputting the first voice feature data into a generator of a preset GAN network and outputting an ideal mask value of second voice feature data corresponding to the first voice feature data, the method further comprises:

5. The GAN-based speaker voice feature training method as claimed in claim 1, wherein the inputting the first voice feature data into a generator of a preset GAN network and outputting the ideal mask value of the second voice feature data corresponding to the first voice feature data further comprises:

6. A GAN-based speaker speech feature training device, comprising:

7. The GAN-based speaker voice feature training device as claimed in claim 6, wherein the first denoising unit is specifically configured to:

8. The GAN-based speaker voice feature training device as claimed in claim 7, wherein the feature extraction unit is specifically configured to:

9. The GAN-based speaker voice feature training device as claimed in claim 8, further comprising:

correspondingly, the mask unit is specifically configured to:

10. A method and a device for training speech features of a speaker based on GAN are characterized in that the device comprises a processor and a memory:

the processor is configured to execute the GAN-based speaker speech feature training method according to any one of claims 1-5 according to instructions in the program code.