CN112581929B

CN112581929B - Voice privacy density masking signal generation method and system based on generation countermeasure network

Info

Publication number: CN112581929B
Application number: CN202011450095.9A
Authority: CN
Inventors: 李晔; 冯涛; 张鹏; 李姝�; 汪付强
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2022-06-03
Anticipated expiration: 2040-12-11
Also published as: CN112581929A

Abstract

The application discloses a method and a system for generating a voice privacy masking signal based on a generation countermeasure network, comprising the following steps: generating a random noise signal; and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network. The masking signal has the characteristics similar to the voice of a speaker in a conference room, the naturalness is higher, and the content of the masking signal has no practical significance to eavesdroppers, so that the purpose of interfering the eavesdroppers is achieved. The invention not only solves the problems that the common masking signal has low masking efficiency and can have negative influence on speakers, but also saves manpower and material resources and has higher environmental adaptability.

Description

Voice privacy density masking signal generation method and system based on generation countermeasure network

Technical Field

The present application relates to the field of speech signal processing technologies, and in particular, to a method and a system for generating a speech privacy masking signal based on a generation countermeasure network.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

At this stage, many countries and companies have leaked their secrets because they do not pay attention to the security of the conference room. The confidentiality of the conference room has great significance to the aspects of national security, commercial security and the like. In terms of conference room privacy, protection from sound is an important point of its work. The company's business secret is eavesdropped, which may cause bidding failure, and even make the company fall back to the public to hurt the benefit of the country.

At present, for the security of sound information in a confidential conference room, a sound masking technology is mainly adopted, and main masking signals comprise white noise, sub-noise and voice-like signals. Compared with a noise masking signal, the voice-like signal has similar characteristics with the voice signal, is confusing and has better masking effect.

At present, a method for generating a similar speech masking signal mainly generates a random text, and then generates a similar speech signal by adopting a speech synthesis technology, but the method has large workload, needs to consume a large amount of manpower and material resources to count the probability of characters, words, segments and the like, and meanwhile, the similar speech generated by the existing similar speech generating method has low naturalness and can not track the characteristics of a speaker.

Disclosure of Invention

In order to solve the defects of the prior art, the application provides a voice privacy density masking signal generation method and system based on a generation countermeasure network (generic adaptive Networks); the masking signal has the characteristics similar to the voice of a speaker in a conference room, the naturalness is higher, and the content of the masking signal has no practical significance to eavesdroppers, so that the purpose of interfering the eavesdroppers is achieved. The invention not only solves the problems that the common masking signal has low masking efficiency and can have negative influence on speakers, but also saves manpower and material resources and has higher environmental adaptability.

In a first aspect, the present application provides a method for generating a voice privacy masking signal based on generation of a countermeasure network;

the voice privacy density masking signal generation method based on the generation countermeasure network comprises the following steps:

generating a random noise signal;

and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network.

In a second aspect, the present application provides a system for generating a voice privacy masking signal based on generating a countering network;

a voice privacy density masking signal generation system based on generation of an antagonistic network comprises:

a generation module configured to: generating a random noise signal;

an output module configured to: inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator of the generation countermeasure network.

In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the beneficial effects of this application are:

the method and the device fully consider the requirement of sound masking in the conference room, abandon the prior method of generating signals by similar voices, introduce a neural network and utilize the strong learning capacity of the neural network and the game idea of generating a confrontation network. The method enables the generation of a more disruptive masking signal that is of no practical significance.

Advantages of additional aspects of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments and illustrations of the application are intended to explain the application and are not intended to limit the application.

Fig. 1 is a block diagram of a signal masking method for the private density of chinese speech based on GAN.

Fig. 2 is a flow chart of a training stage of a method for generating a masking signal based on the private density of chinese speech of GAN.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise, and furthermore, it should be understood that the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments in the present application may be combined with each other without conflict.

There are two main ways to reveal sound information in a secure conference room:

active leaks and unintentional leaks.

Active leakage refers to leakage caused by eavesdropping on equipment installed inside a conference room.

Unintentional leakage refers to that sound leaks in the form of air conduction sound, solid conduction sound and the like during a conference, and is heard by unauthorized people.

Specifically, the path of the unintentional leakage of the sound signal mainly includes: doors, windows, walls, various ducts, and the like.

The method proposed by the present application is mainly directed to the unintentional leakage of sound signals.

At present, aiming at the unconscious leakage of sound signals, the sound masking technology is mostly adopted for protection. Specifically, an interference source is arranged at a position and a path where sound leakage possibly exists, and an interference signal is generated to mask a useful voice signal, so that the function of sound leakage protection is achieved.

Example one

The embodiment provides a voice privacy density masking signal generation method based on a generation countermeasure network;

as shown in fig. 1 and fig. 2, the method for generating a voice privacy density masking signal based on generation of a countermeasure network includes:

s101: generating a random noise signal;

s102: and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network.

For example, after a masking signal for protecting the privacy of voice is generated inside a conference room, a masking signal obtained by an eavesdropper outside the conference room through a wall medium is a voice signal having no substantial content to the eavesdropper.

Illustratively, the generating of the random noise signal refers to generating a piece of random noise through an np.

As one or more embodiments, the generating a countering network includes:

a generator and a discriminator connected to each other.

Illustratively, the selection of network configurations may be such that different network configurations are selected for different sized data sets,

for small data sets, the generator (genetic) and Discriminator (Discriminator) in the generation countermeasure network GAN may select a full convolutional network;

for large data sets, the generator (genetic) and Discriminator (Discriminator) in the generating confrontation network GAN may select convolutional neural networks.

As one or more embodiments, the training step of generating the countermeasure network after training includes:

s102a 1: constructing a training set; the training set is a target phonetic library;

s102a 2: inputting the random noise signal into a generator to obtain a voice-like signal generated by the generator;

s102a 3: the generated similar voice signal and the similar voice signal in the target similar voice library are simultaneously input into the discriminator, the discriminator outputs the probability that the generated similar voice signal is the target similar voice signal, the capability of the similar voice signal generated by the generator approaching the target similar voice signal is improved through the game of the generator and the discriminator, and finally the generation confrontation network after training is obtained.

Illustratively, the target-class voice signal is selected from the voice signals in the THCHS30 data set.

As one or more embodiments, as shown in fig. 2, the detailed steps of generating the countermeasure network after training include:

s102b 1: initializing a generator to obtain an initialized generator;

s102b 2: initializing the discriminator to obtain an initialized discriminator;

s102b 3: optimizing the weight values of the generator and the discriminator;

s102b 4: repeating the step S102b3, judging whether the set iteration times is reached, if so, stopping training to obtain a trained generated confrontation network; if not, training continues.

As one or more embodiments, the S101: generating a random noise signal, said S102: before the step of inputting the random noise signal into the trained generation countermeasure network and the trained generator for generating the countermeasure network to generate the masking signal for protecting the voice privacy density, the method further comprises the following steps:

s101-2: the speech signals in the data set are pre-processed.

Further, preprocessing the target voice signal; the method specifically comprises the following steps:

s101-21: pre-emphasis processing is carried out on the voice signals of the target class;

s101-22: and carrying out data normalization processing on the pre-emphasized signal to be processed.

Further, the S102b 1: initializing a generator to obtain an initialized generator; the method specifically comprises the following steps:

s102b 11: independently taking out random noise, and carrying out dimension adjustment on the taken out random noise;

s102b 12: determining the size, step length and filling mode of a two-dimensional convolution kernel, adjusting dimensionality after two-dimensional convolution, and using an activation function for the convolution result of each layer.

And splicing the two-dimensional convolution result with Gaussian noise with the same size.

Performing two-dimensional deconvolution on the splicing result, wherein the deconvolution result of each deconvolution layer uses an activation function;

s102b 13: and using an activation function for the output value of the last layer to obtain the generated voice-like signal.

Illustratively, the S102b 11: performing dimension adjustment on the generated random noise; the method specifically comprises the following steps: the dimension of the random noise is adjusted to 4 dimensions, and the dimension size is [150,16384,1,1 ].

Illustratively, the S102b 12: in this example, the size of the batch size is set to 150, the size is set to [31,1, the number of input channels, the number of output channels ] according to the number of channels of each layer of the convolutional neural network, the step size is set to [1,2,1,1], and the filling mode is SAME.

The sizes of the layers are different, and the adjustment is carried out according to the convolution result of the neural network of each layer in the program. The size of a convolution kernel is four-dimensional [ the height of the convolution kernel, the width of the convolution kernel, the number of input channels and the number of output channels ], the height of the convolution kernel is 31 in the convolution and deconvolution processes, the width of the convolution kernel is 1, the number of input channels is the number of output channels of the previous layer, the number of output channels of the layer is [16,32,.32,64,64,128, 256,512,1024] in the encoder stage, and the number of output channels of each layer in the decoder stage is [1024,512, 256,128, 64,64,32,1 ]; the step length of each layer is [1,2,1,1], and the filling mode is SAME.

Illustratively, the S102b 13: using a tanh activation function, which is formulated as

Further, the S102b 2: initializing the discriminator to obtain an initialized discriminator; the method specifically comprises the following steps:

s102b 21: defining a target class voice as an initial w sequence;

s102b 22: creating a Gaussian noise sequence with the same dimension and the same size as the initial w sequence, and adding the Gaussian noise sequence and the initial w sequence to obtain a first w sequence;

s102b 23: adjusting the dimension of the first w sequence; determining the convolution kernel size, step length, filling mode and the like of the two-dimensional convolution layer, performing virtual batch standardization on the w sequence after the two-dimensional convolution, using an activation function for batch standardization results, and performing eleven times of two-dimensional convolution to obtain a second w sequence.

S102b 23: and carrying out one-dimensional convolution on the second w sequence and then sending the second w sequence into the full-link layer to obtain the probability of outputting true data with the probability value close to 1.

Illustratively, the S102b 22: creating a Gaussian noise sequence with the same dimension and the same size as the w sequence, and adding the Gaussian noise sequence and the w sequence to obtain a new w sequence; the average value of Gaussian noise in the Gaussian noise sequence is zero; the variance was 0.5.

Illustratively, the S102b 23: the parameter selection is the same as the generator initialization phase configuration, wherein the virtual batch normalization aims to accelerate the convergence speed of the model.

The step length is [1,2,1,1], and the filling mode is SAME. The same as the encoder in the generator, the height of the convolution kernel is 31 in the convolution and deconvolution processes, the width of the convolution kernel is 1, the number of input channels is the number of output channels of the previous layer, and the number of output channels of the neural network layer is respectively:

[16,32,.32,64,64,128,128,256,256,512,1024]。

further, the S102b 3: optimizing the weight values of the generator and the discriminator; the method specifically comprises the following steps:

s102b 31: the discriminator uses the speech in the data set as the true data, and outputs the probability of 'true' during the initialization phase of the discriminator, which is expressed as true data.

The discriminator inputs the voice-like data generated by the generator as false data, and the discriminator outputs the probability of "false" as false data by performing the operation of the initialization stage.

Calculating loss function loss value of the discriminator;

s102b 32: updating convolution kernel weight of convolution and deconvolution in generator initialization and batch normalization according to loss value of generator_γ、_βThe value is obtained.

Updating convolution kernel values of convolution and deconvolution in initialization of the discriminator according to loss value of the discriminator, and in virtual batch normalization_γ、_βThe value is obtained.

Further, the S102a 1: constructing a training set; the training set is a target type voice signal; the method comprises the following specific steps:

s102a 11: integrating the THCHS30 data set into tfrecrds files, and marking the target class voice data in the files as wav classes;

s102a 12: determining an optimizer for generating a countermeasure network, and simultaneously reading a target class voice of a tfrecrds file from the tfrecrds file;

s102a 13: changing the amplitude of the target voice, and simultaneously performing pre-emphasis within the range of 0.9-1 on the target voice;

s102a 14: the target speech is put into a queue and batches of the desired target speech and random noise generated by the program are taken out each time.

It will be appreciated that the optimizer acts to update and calculate the network parameters that affect the model training and model output to approximate or reach optimal values.

In step S102a11, the data type in the tfrecrds file is int type, the data size ranges from-32767 to 32767, the sampling rate of the input data set is 16KHZ, so each data size is set to 16384, but each data size is not limited to this and can be adjusted by itself according to the data sampling rate.

Illustratively, in the step S102a12, the optimizer is determined to be RMSProp.

It should be understood that the amplitude range of random noise and clean speech is changed to-1 to prevent problems such as gradient explosion, and 0.95 pre-emphasis is implemented to make the high frequency characteristics of the speech have better performance.

Illustratively, in the step S102a14, the batch size is 150,16384 sampling point data.

Inputting random noise z into a trained generator, and generating a speech-like signal through the generator by the following process:

1. and reading the random noise file and judging whether the sampling rate is 16 KHz. If yes, entering the next step; if not, ending;

2. and configuring the weight values of the trained model.

3. And converting the size of the read data into-1.

The input random noise is 16Bit, namely the amplitude of the input noise is-32767, and the amplitude can be changed to-1 by dividing the random noise by 32767.

4. The data size is determined by the python instruction.

5. Data is fed to the generator at 16384 intervals, and the generated result is saved.

Each time 16384 samples are input, 16384 samples are about equal to one second at a sampling rate of 16 Khz.

6. And writing the saved data into the wav file.

The method is based on generation of the countermeasure network, a section of random noise is defined by a program and is converted into interference voice (voice-like) output through a generator multilayer convolutional neural network, the input of a discriminator is the voice-like generated by the generator and a target voice signal is formed by data in a known data set, the discriminator judges the probability of inputting the target voice signal through the multilayer convolutional neural network, the capability of approaching the target signal by the voice-like generated by the generator can be improved through mutual game of the generator and the discriminator, the masking signal obtained by the method is higher in naturalness and smoother than the traditional masking signal, the confusion of the interference voice is further improved, and therefore the confidentiality and the safety of a conference room are improved.

According to the method, through the generation countermeasure network technology, an input clean voice signal is converted into a voice-like output through a generator multilayer convolutional neural network, the input of a discriminator is interference voice and a target signal generated by the generator, the discriminator judges the probability that the input is the target signal through the multilayer convolutional neural network, and the capability that the generator approaches the target signal can be improved through the mutual game of the generator and the discriminator. It should be noted that the generation of the antagonistic network described in the present application includes not only the convolutional neural network in the example, but also a fully convolutional neural network, a cyclic neural network, and the like.

Example two

The embodiment provides a voice privacy masking signal generation system based on generation of a countermeasure network;

a generation module configured to: generating a random noise signal;

an output module configured to: and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network.

It should be noted here that the generating module and the output module correspond to steps S101 to S102 in the first embodiment, and the modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The method for generating the voice privacy density masking signal based on the generation countermeasure network is characterized by comprising the following steps:

generating a random noise signal;

inputting a random noise signal into a trained generation countermeasure network, generating a masking signal for protecting the voice privacy density by a trained generator for generating the countermeasure network, wherein the masking signal is a voice-like signal;

the training step of generating the confrontation network after training comprises the following steps:

constructing a training set; the training set is a target phonetic library;

inputting the random noise signal into a generator to obtain a voice-like signal generated by the generator;

the generated similar voice signal and the similar voice signal in the target similar voice library are simultaneously input into the discriminator, the discriminator outputs the probability that the generated similar voice signal is the target similar voice signal, the capability of the similar voice signal generated by the generator approaching the target similar voice signal is improved through the game of the generator and the discriminator, and finally the generation confrontation network after training is obtained.

2. The method as claimed in claim 1, wherein the step of generating the trained countering network comprises:

initializing a generator to obtain an initialized generator;

initializing the discriminator to obtain an initialized discriminator;

optimizing the weight values of the generator and the discriminator;

repeating the optimization steps of the weights of the generator and the discriminator, judging whether the set iteration times is reached, and if the set iteration times is reached, stopping training to obtain a trained generation countermeasure network; if not, continuing training;

initializing a generator to obtain an initialized generator; the method comprises the following specific steps:

independently taking out random noise, and carrying out dimension adjustment on the taken out random noise;

determining the size, step length and filling mode of a two-dimensional convolution kernel, adjusting dimensionality after two-dimensional convolution, and using an activation function for the convolution result of each layer; splicing the two-dimensional convolution result with Gaussian noise with the same size; performing two-dimensional deconvolution on the splicing result, wherein the deconvolution result of each deconvolution layer uses an activation function;

using an activation function for the output value of the last layer to obtain a generated voice-like signal;

initializing the discriminator to obtain an initialized discriminator; the method comprises the following specific steps:

defining a target class voice as an initial w sequence;

creating a Gaussian noise sequence with the same dimension and the same size as the initial w sequence, and adding the Gaussian noise sequence and the initial w sequence to obtain a first w sequence;

adjusting the dimension of the first w sequence; determining the convolution kernel size, step length and filling mode of the two-dimensional convolution layer, performing two-dimensional convolution on the first w sequence, performing virtual batch standardization on the w sequence generated after the convolution, using an activation function for the batch standardization result, and performing eleven-time two-dimensional convolution to obtain a second w sequence;

and carrying out one-dimensional convolution on the second w sequence and then sending the second w sequence into the full-link layer to obtain the probability of outputting true data with the probability value close to 1.

3. The method for generating a voice privacy masking signal based on a generative countermeasure network as claimed in claim 1, wherein said step of generating a random noise signal is followed by said step of generating a random noise signal; before the step of inputting the random noise signal into the trained generation countermeasure network, and the trained generator for generating the countermeasure network generating the masking signal for protecting the speech privacy density, the method further includes:

preprocessing a target voice signal;

preprocessing a target voice signal; the method specifically comprises the following steps:

pre-emphasis processing is carried out on the voice signals of the target class;

and carrying out data normalization processing on the pre-emphasized signal to be processed.

4. The method for generating a speech privacy density masking signal based on generation of an antagonistic network as claimed in claim 1, wherein a training set is constructed; the training set is a target phonetic library; the method comprises the following specific steps:

integrating the THCHS30 data set into tfrecrds files, and marking the target class voice data in the files into wav classes;

determining an optimizer for generating a countermeasure network, and simultaneously reading a target class voice of a tfrecrds file from the tfrecrds file;

changing the amplitude of the target voice, and simultaneously performing pre-emphasis within the range of 0.9-1 on the target voice;

and putting the target class voice into a queue, and taking out the required target class voice and random noise generated by the program each time.

5. A system for generating a voice privacy masking signal based on a generation countermeasure network is characterized by comprising:

a generation module configured to: generating a random noise signal;

an output module configured to: inputting a random noise signal into a trained generation countermeasure network, generating a masking signal for protecting the voice privacy density by a trained generator for generating the countermeasure network, wherein the masking signal is a voice-like signal;

constructing a training set; the training set is a target phonetic library;

6. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-4.

7. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 4.