CN112581929B - Voice privacy density masking signal generation method and system based on generation countermeasure network - Google Patents

Voice privacy density masking signal generation method and system based on generation countermeasure network Download PDF

Info

Publication number
CN112581929B
CN112581929B CN202011450095.9A CN202011450095A CN112581929B CN 112581929 B CN112581929 B CN 112581929B CN 202011450095 A CN202011450095 A CN 202011450095A CN 112581929 B CN112581929 B CN 112581929B
Authority
CN
China
Prior art keywords
signal
voice
generating
generator
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011450095.9A
Other languages
Chinese (zh)
Other versions
CN112581929A (en
Inventor
李晔
冯涛
张鹏
李姝�
汪付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202011450095.9A priority Critical patent/CN112581929B/en
Publication of CN112581929A publication Critical patent/CN112581929A/en
Application granted granted Critical
Publication of CN112581929B publication Critical patent/CN112581929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a method and a system for generating a voice privacy masking signal based on a generation countermeasure network, comprising the following steps: generating a random noise signal; and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network. The masking signal has the characteristics similar to the voice of a speaker in a conference room, the naturalness is higher, and the content of the masking signal has no practical significance to eavesdroppers, so that the purpose of interfering the eavesdroppers is achieved. The invention not only solves the problems that the common masking signal has low masking efficiency and can have negative influence on speakers, but also saves manpower and material resources and has higher environmental adaptability.

Description

Voice privacy density masking signal generation method and system based on generation countermeasure network
Technical Field
The present application relates to the field of speech signal processing technologies, and in particular, to a method and a system for generating a speech privacy masking signal based on a generation countermeasure network.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
At this stage, many countries and companies have leaked their secrets because they do not pay attention to the security of the conference room. The confidentiality of the conference room has great significance to the aspects of national security, commercial security and the like. In terms of conference room privacy, protection from sound is an important point of its work. The company's business secret is eavesdropped, which may cause bidding failure, and even make the company fall back to the public to hurt the benefit of the country.
At present, for the security of sound information in a confidential conference room, a sound masking technology is mainly adopted, and main masking signals comprise white noise, sub-noise and voice-like signals. Compared with a noise masking signal, the voice-like signal has similar characteristics with the voice signal, is confusing and has better masking effect.
At present, a method for generating a similar speech masking signal mainly generates a random text, and then generates a similar speech signal by adopting a speech synthesis technology, but the method has large workload, needs to consume a large amount of manpower and material resources to count the probability of characters, words, segments and the like, and meanwhile, the similar speech generated by the existing similar speech generating method has low naturalness and can not track the characteristics of a speaker.
Disclosure of Invention
In order to solve the defects of the prior art, the application provides a voice privacy density masking signal generation method and system based on a generation countermeasure network (generic adaptive Networks); the masking signal has the characteristics similar to the voice of a speaker in a conference room, the naturalness is higher, and the content of the masking signal has no practical significance to eavesdroppers, so that the purpose of interfering the eavesdroppers is achieved. The invention not only solves the problems that the common masking signal has low masking efficiency and can have negative influence on speakers, but also saves manpower and material resources and has higher environmental adaptability.
In a first aspect, the present application provides a method for generating a voice privacy masking signal based on generation of a countermeasure network;
the voice privacy density masking signal generation method based on the generation countermeasure network comprises the following steps:
generating a random noise signal;
and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network.
In a second aspect, the present application provides a system for generating a voice privacy masking signal based on generating a countering network;
a voice privacy density masking signal generation system based on generation of an antagonistic network comprises:
a generation module configured to: generating a random noise signal;
an output module configured to: inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator of the generation countermeasure network.
In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the beneficial effects of this application are:
the method and the device fully consider the requirement of sound masking in the conference room, abandon the prior method of generating signals by similar voices, introduce a neural network and utilize the strong learning capacity of the neural network and the game idea of generating a confrontation network. The method enables the generation of a more disruptive masking signal that is of no practical significance.
Advantages of additional aspects of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments and illustrations of the application are intended to explain the application and are not intended to limit the application.
Fig. 1 is a block diagram of a signal masking method for the private density of chinese speech based on GAN.
Fig. 2 is a flow chart of a training stage of a method for generating a masking signal based on the private density of chinese speech of GAN.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise, and furthermore, it should be understood that the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments in the present application may be combined with each other without conflict.
There are two main ways to reveal sound information in a secure conference room:
active leaks and unintentional leaks.
Active leakage refers to leakage caused by eavesdropping on equipment installed inside a conference room.
Unintentional leakage refers to that sound leaks in the form of air conduction sound, solid conduction sound and the like during a conference, and is heard by unauthorized people.
Specifically, the path of the unintentional leakage of the sound signal mainly includes: doors, windows, walls, various ducts, and the like.
The method proposed by the present application is mainly directed to the unintentional leakage of sound signals.
At present, aiming at the unconscious leakage of sound signals, the sound masking technology is mostly adopted for protection. Specifically, an interference source is arranged at a position and a path where sound leakage possibly exists, and an interference signal is generated to mask a useful voice signal, so that the function of sound leakage protection is achieved.
Example one
The embodiment provides a voice privacy density masking signal generation method based on a generation countermeasure network;
as shown in fig. 1 and fig. 2, the method for generating a voice privacy density masking signal based on generation of a countermeasure network includes:
s101: generating a random noise signal;
s102: and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network.
For example, after a masking signal for protecting the privacy of voice is generated inside a conference room, a masking signal obtained by an eavesdropper outside the conference room through a wall medium is a voice signal having no substantial content to the eavesdropper.
Illustratively, the generating of the random noise signal refers to generating a piece of random noise through an np.
As one or more embodiments, the generating a countering network includes:
a generator and a discriminator connected to each other.
Illustratively, the selection of network configurations may be such that different network configurations are selected for different sized data sets,
for small data sets, the generator (genetic) and Discriminator (Discriminator) in the generation countermeasure network GAN may select a full convolutional network;
for large data sets, the generator (genetic) and Discriminator (Discriminator) in the generating confrontation network GAN may select convolutional neural networks.
As one or more embodiments, the training step of generating the countermeasure network after training includes:
s102a 1: constructing a training set; the training set is a target phonetic library;
s102a 2: inputting the random noise signal into a generator to obtain a voice-like signal generated by the generator;
s102a 3: the generated similar voice signal and the similar voice signal in the target similar voice library are simultaneously input into the discriminator, the discriminator outputs the probability that the generated similar voice signal is the target similar voice signal, the capability of the similar voice signal generated by the generator approaching the target similar voice signal is improved through the game of the generator and the discriminator, and finally the generation confrontation network after training is obtained.
Illustratively, the target-class voice signal is selected from the voice signals in the THCHS30 data set.
As one or more embodiments, as shown in fig. 2, the detailed steps of generating the countermeasure network after training include:
s102b 1: initializing a generator to obtain an initialized generator;
s102b 2: initializing the discriminator to obtain an initialized discriminator;
s102b 3: optimizing the weight values of the generator and the discriminator;
s102b 4: repeating the step S102b3, judging whether the set iteration times is reached, if so, stopping training to obtain a trained generated confrontation network; if not, training continues.
As one or more embodiments, the S101: generating a random noise signal, said S102: before the step of inputting the random noise signal into the trained generation countermeasure network and the trained generator for generating the countermeasure network to generate the masking signal for protecting the voice privacy density, the method further comprises the following steps:
s101-2: the speech signals in the data set are pre-processed.
Further, preprocessing the target voice signal; the method specifically comprises the following steps:
s101-21: pre-emphasis processing is carried out on the voice signals of the target class;
s101-22: and carrying out data normalization processing on the pre-emphasized signal to be processed.
Further, the S102b 1: initializing a generator to obtain an initialized generator; the method specifically comprises the following steps:
s102b 11: independently taking out random noise, and carrying out dimension adjustment on the taken out random noise;
s102b 12: determining the size, step length and filling mode of a two-dimensional convolution kernel, adjusting dimensionality after two-dimensional convolution, and using an activation function for the convolution result of each layer.
And splicing the two-dimensional convolution result with Gaussian noise with the same size.
Performing two-dimensional deconvolution on the splicing result, wherein the deconvolution result of each deconvolution layer uses an activation function;
s102b 13: and using an activation function for the output value of the last layer to obtain the generated voice-like signal.
Illustratively, the S102b 11: performing dimension adjustment on the generated random noise; the method specifically comprises the following steps: the dimension of the random noise is adjusted to 4 dimensions, and the dimension size is [150,16384,1,1 ].
Illustratively, the S102b 12: in this example, the size of the batch size is set to 150, the size is set to [31,1, the number of input channels, the number of output channels ] according to the number of channels of each layer of the convolutional neural network, the step size is set to [1,2,1,1], and the filling mode is SAME.
The sizes of the layers are different, and the adjustment is carried out according to the convolution result of the neural network of each layer in the program. The size of a convolution kernel is four-dimensional [ the height of the convolution kernel, the width of the convolution kernel, the number of input channels and the number of output channels ], the height of the convolution kernel is 31 in the convolution and deconvolution processes, the width of the convolution kernel is 1, the number of input channels is the number of output channels of the previous layer, the number of output channels of the layer is [16,32,.32,64,64,128, 256,512,1024] in the encoder stage, and the number of output channels of each layer in the decoder stage is [1024,512, 256,128, 64,64,32,1 ]; the step length of each layer is [1,2,1,1], and the filling mode is SAME.
Illustratively, the S102b 13: using a tanh activation function, which is formulated as
Figure BDA0002831731600000071
Further, the S102b 2: initializing the discriminator to obtain an initialized discriminator; the method specifically comprises the following steps:
s102b 21: defining a target class voice as an initial w sequence;
s102b 22: creating a Gaussian noise sequence with the same dimension and the same size as the initial w sequence, and adding the Gaussian noise sequence and the initial w sequence to obtain a first w sequence;
s102b 23: adjusting the dimension of the first w sequence; determining the convolution kernel size, step length, filling mode and the like of the two-dimensional convolution layer, performing virtual batch standardization on the w sequence after the two-dimensional convolution, using an activation function for batch standardization results, and performing eleven times of two-dimensional convolution to obtain a second w sequence.
S102b 23: and carrying out one-dimensional convolution on the second w sequence and then sending the second w sequence into the full-link layer to obtain the probability of outputting true data with the probability value close to 1.
Illustratively, the S102b 22: creating a Gaussian noise sequence with the same dimension and the same size as the w sequence, and adding the Gaussian noise sequence and the w sequence to obtain a new w sequence; the average value of Gaussian noise in the Gaussian noise sequence is zero; the variance was 0.5.
Illustratively, the S102b 23: the parameter selection is the same as the generator initialization phase configuration, wherein the virtual batch normalization aims to accelerate the convergence speed of the model.
The step length is [1,2,1,1], and the filling mode is SAME. The same as the encoder in the generator, the height of the convolution kernel is 31 in the convolution and deconvolution processes, the width of the convolution kernel is 1, the number of input channels is the number of output channels of the previous layer, and the number of output channels of the neural network layer is respectively:
[16,32,.32,64,64,128,128,256,256,512,1024]。
further, the S102b 3: optimizing the weight values of the generator and the discriminator; the method specifically comprises the following steps:
s102b 31: the discriminator uses the speech in the data set as the true data, and outputs the probability of 'true' during the initialization phase of the discriminator, which is expressed as true data.
The discriminator inputs the voice-like data generated by the generator as false data, and the discriminator outputs the probability of "false" as false data by performing the operation of the initialization stage.
Calculating loss function loss value of the discriminator;
s102b 32: updating convolution kernel weight of convolution and deconvolution in generator initialization and batch normalization according to loss value of generatorγβThe value is obtained.
Updating convolution kernel values of convolution and deconvolution in initialization of the discriminator according to loss value of the discriminator, and in virtual batch normalizationγβThe value is obtained.
Figure BDA0002831731600000081
Further, the S102a 1: constructing a training set; the training set is a target type voice signal; the method comprises the following specific steps:
s102a 11: integrating the THCHS30 data set into tfrecrds files, and marking the target class voice data in the files as wav classes;
s102a 12: determining an optimizer for generating a countermeasure network, and simultaneously reading a target class voice of a tfrecrds file from the tfrecrds file;
s102a 13: changing the amplitude of the target voice, and simultaneously performing pre-emphasis within the range of 0.9-1 on the target voice;
s102a 14: the target speech is put into a queue and batches of the desired target speech and random noise generated by the program are taken out each time.
It will be appreciated that the optimizer acts to update and calculate the network parameters that affect the model training and model output to approximate or reach optimal values.
In step S102a11, the data type in the tfrecrds file is int type, the data size ranges from-32767 to 32767, the sampling rate of the input data set is 16KHZ, so each data size is set to 16384, but each data size is not limited to this and can be adjusted by itself according to the data sampling rate.
Illustratively, in the step S102a12, the optimizer is determined to be RMSProp.
It should be understood that the amplitude range of random noise and clean speech is changed to-1 to prevent problems such as gradient explosion, and 0.95 pre-emphasis is implemented to make the high frequency characteristics of the speech have better performance.
Illustratively, in the step S102a14, the batch size is 150,16384 sampling point data.
Inputting random noise z into a trained generator, and generating a speech-like signal through the generator by the following process:
1. and reading the random noise file and judging whether the sampling rate is 16 KHz. If yes, entering the next step; if not, ending;
2. and configuring the weight values of the trained model.
3. And converting the size of the read data into-1.
The input random noise is 16Bit, namely the amplitude of the input noise is-32767, and the amplitude can be changed to-1 by dividing the random noise by 32767.
4. The data size is determined by the python instruction.
5. Data is fed to the generator at 16384 intervals, and the generated result is saved.
Each time 16384 samples are input, 16384 samples are about equal to one second at a sampling rate of 16 Khz.
6. And writing the saved data into the wav file.
The method is based on generation of the countermeasure network, a section of random noise is defined by a program and is converted into interference voice (voice-like) output through a generator multilayer convolutional neural network, the input of a discriminator is the voice-like generated by the generator and a target voice signal is formed by data in a known data set, the discriminator judges the probability of inputting the target voice signal through the multilayer convolutional neural network, the capability of approaching the target signal by the voice-like generated by the generator can be improved through mutual game of the generator and the discriminator, the masking signal obtained by the method is higher in naturalness and smoother than the traditional masking signal, the confusion of the interference voice is further improved, and therefore the confidentiality and the safety of a conference room are improved.
According to the method, through the generation countermeasure network technology, an input clean voice signal is converted into a voice-like output through a generator multilayer convolutional neural network, the input of a discriminator is interference voice and a target signal generated by the generator, the discriminator judges the probability that the input is the target signal through the multilayer convolutional neural network, and the capability that the generator approaches the target signal can be improved through the mutual game of the generator and the discriminator. It should be noted that the generation of the antagonistic network described in the present application includes not only the convolutional neural network in the example, but also a fully convolutional neural network, a cyclic neural network, and the like.
Example two
The embodiment provides a voice privacy masking signal generation system based on generation of a countermeasure network;
a voice privacy density masking signal generation system based on generation of an antagonistic network comprises:
a generation module configured to: generating a random noise signal;
an output module configured to: and inputting the random noise signal into the trained generation countermeasure network, and generating a masking signal for protecting the voice privacy density by the trained generator for generating the countermeasure network.
It should be noted here that the generating module and the output module correspond to steps S101 to S102 in the first embodiment, and the modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (7)

1. The method for generating the voice privacy density masking signal based on the generation countermeasure network is characterized by comprising the following steps:
generating a random noise signal;
inputting a random noise signal into a trained generation countermeasure network, generating a masking signal for protecting the voice privacy density by a trained generator for generating the countermeasure network, wherein the masking signal is a voice-like signal;
the training step of generating the confrontation network after training comprises the following steps:
constructing a training set; the training set is a target phonetic library;
inputting the random noise signal into a generator to obtain a voice-like signal generated by the generator;
the generated similar voice signal and the similar voice signal in the target similar voice library are simultaneously input into the discriminator, the discriminator outputs the probability that the generated similar voice signal is the target similar voice signal, the capability of the similar voice signal generated by the generator approaching the target similar voice signal is improved through the game of the generator and the discriminator, and finally the generation confrontation network after training is obtained.
2. The method as claimed in claim 1, wherein the step of generating the trained countering network comprises:
initializing a generator to obtain an initialized generator;
initializing the discriminator to obtain an initialized discriminator;
optimizing the weight values of the generator and the discriminator;
repeating the optimization steps of the weights of the generator and the discriminator, judging whether the set iteration times is reached, and if the set iteration times is reached, stopping training to obtain a trained generation countermeasure network; if not, continuing training;
initializing a generator to obtain an initialized generator; the method comprises the following specific steps:
independently taking out random noise, and carrying out dimension adjustment on the taken out random noise;
determining the size, step length and filling mode of a two-dimensional convolution kernel, adjusting dimensionality after two-dimensional convolution, and using an activation function for the convolution result of each layer; splicing the two-dimensional convolution result with Gaussian noise with the same size; performing two-dimensional deconvolution on the splicing result, wherein the deconvolution result of each deconvolution layer uses an activation function;
using an activation function for the output value of the last layer to obtain a generated voice-like signal;
initializing the discriminator to obtain an initialized discriminator; the method comprises the following specific steps:
defining a target class voice as an initial w sequence;
creating a Gaussian noise sequence with the same dimension and the same size as the initial w sequence, and adding the Gaussian noise sequence and the initial w sequence to obtain a first w sequence;
adjusting the dimension of the first w sequence; determining the convolution kernel size, step length and filling mode of the two-dimensional convolution layer, performing two-dimensional convolution on the first w sequence, performing virtual batch standardization on the w sequence generated after the convolution, using an activation function for the batch standardization result, and performing eleven-time two-dimensional convolution to obtain a second w sequence;
and carrying out one-dimensional convolution on the second w sequence and then sending the second w sequence into the full-link layer to obtain the probability of outputting true data with the probability value close to 1.
3. The method for generating a voice privacy masking signal based on a generative countermeasure network as claimed in claim 1, wherein said step of generating a random noise signal is followed by said step of generating a random noise signal; before the step of inputting the random noise signal into the trained generation countermeasure network, and the trained generator for generating the countermeasure network generating the masking signal for protecting the speech privacy density, the method further includes:
preprocessing a target voice signal;
preprocessing a target voice signal; the method specifically comprises the following steps:
pre-emphasis processing is carried out on the voice signals of the target class;
and carrying out data normalization processing on the pre-emphasized signal to be processed.
4. The method for generating a speech privacy density masking signal based on generation of an antagonistic network as claimed in claim 1, wherein a training set is constructed; the training set is a target phonetic library; the method comprises the following specific steps:
integrating the THCHS30 data set into tfrecrds files, and marking the target class voice data in the files into wav classes;
determining an optimizer for generating a countermeasure network, and simultaneously reading a target class voice of a tfrecrds file from the tfrecrds file;
changing the amplitude of the target voice, and simultaneously performing pre-emphasis within the range of 0.9-1 on the target voice;
and putting the target class voice into a queue, and taking out the required target class voice and random noise generated by the program each time.
5. A system for generating a voice privacy masking signal based on a generation countermeasure network is characterized by comprising:
a generation module configured to: generating a random noise signal;
an output module configured to: inputting a random noise signal into a trained generation countermeasure network, generating a masking signal for protecting the voice privacy density by a trained generator for generating the countermeasure network, wherein the masking signal is a voice-like signal;
the training step of generating the confrontation network after training comprises the following steps:
constructing a training set; the training set is a target phonetic library;
inputting the random noise signal into a generator to obtain a voice-like signal generated by the generator;
the generated similar voice signal and the similar voice signal in the target similar voice library are simultaneously input into the discriminator, the discriminator outputs the probability that the generated similar voice signal is the target similar voice signal, the capability of the similar voice signal generated by the generator approaching the target similar voice signal is improved through the game of the generator and the discriminator, and finally the generation confrontation network after training is obtained.
6. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-4.
7. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 4.
CN202011450095.9A 2020-12-11 2020-12-11 Voice privacy density masking signal generation method and system based on generation countermeasure network Active CN112581929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011450095.9A CN112581929B (en) 2020-12-11 2020-12-11 Voice privacy density masking signal generation method and system based on generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011450095.9A CN112581929B (en) 2020-12-11 2020-12-11 Voice privacy density masking signal generation method and system based on generation countermeasure network

Publications (2)

Publication Number Publication Date
CN112581929A CN112581929A (en) 2021-03-30
CN112581929B true CN112581929B (en) 2022-06-03

Family

ID=75131264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011450095.9A Active CN112581929B (en) 2020-12-11 2020-12-11 Voice privacy density masking signal generation method and system based on generation countermeasure network

Country Status (1)

Country Link
CN (1) CN112581929B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN110808057A (en) * 2019-10-31 2020-02-18 南昌航空大学 Voice enhancement method for generating confrontation network based on constraint naive
WO2020232180A1 (en) * 2019-05-14 2020-11-19 Dolby Laboratories Licensing Corporation Method and apparatus for speech source separation based on a convolutional neural network

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102014214052A1 (en) * 2014-07-18 2016-01-21 Bayerische Motoren Werke Aktiengesellschaft Virtual masking methods
CN107945811B (en) * 2017-10-23 2021-06-01 北京大学 Frequency band expansion-oriented generation type confrontation network training method and audio encoding and decoding method
CN108681774B (en) * 2018-05-11 2021-05-14 电子科技大学 Human body target tracking method based on generation of confrontation network negative sample enhancement
US20200075035A1 (en) * 2018-08-30 2020-03-05 Thomas Garth, III Sound Shield Device
CN109492416B (en) * 2019-01-07 2022-02-11 南京信息工程大学 Big data image protection method and system based on safe area
US20200242771A1 (en) * 2019-01-25 2020-07-30 Nvidia Corporation Semantic image synthesis for generating substantially photorealistic images using neural networks
US12001950B2 (en) * 2019-03-12 2024-06-04 International Business Machines Corporation Generative adversarial network based audio restoration
CN109859737A (en) * 2019-03-28 2019-06-07 深圳市升弘创新科技有限公司 Communication encryption method, system and computer readable storage medium
CN111261147B (en) * 2020-01-20 2022-10-11 浙江工业大学 Music embedding attack defense method for voice recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
WO2020232180A1 (en) * 2019-05-14 2020-11-19 Dolby Laboratories Licensing Corporation Method and apparatus for speech source separation based on a convolutional neural network
CN110808057A (en) * 2019-10-31 2020-02-18 南昌航空大学 Voice enhancement method for generating confrontation network based on constraint naive

Also Published As

Publication number Publication date
CN112581929A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
Alzantot et al. Did you hear that? adversarial examples against automatic speech recognition
GB2567703A (en) Secure voice biometric authentication
US9589560B1 (en) Estimating false rejection rate in a detection system
CN102938254A (en) Voice signal enhancement system and method
US20210264906A1 (en) Intent recognition model creation from randomized intent vector proximities
US20070083361A1 (en) Method and apparatus for disturbing the radiated voice signal by attenuation and masking
US11545136B2 (en) System and method using parameterized speech synthesis to train acoustic models
US20230056680A1 (en) Integrating dialog history into end-to-end spoken language understanding systems
US20210343292A1 (en) Techniques for converting natural speech to programming code
US11335329B2 (en) Method and system for generating synthetic multi-conditioned data sets for robust automatic speech recognition
CN112581929B (en) Voice privacy density masking signal generation method and system based on generation countermeasure network
WO2022134351A1 (en) Noise reduction method and system for monophonic speech, and device and readable storage medium
ES2928736T3 (en) Low-level features compensated per channel for speaker recognition
CN114664318A (en) Voice enhancement method and system based on generation countermeasure network
Ma et al. Measuring dependence for permutation alignment in convolutive blind source separation
Cheng et al. Uniap: Protecting speech privacy with non-targeted universal adversarial perturbations
US20200219496A1 (en) Methods and systems for managing voice response systems based on signals from external devices
KR20220133993A (en) Learnable rate control of speech synthesis
Yu et al. Masker: Adaptive mobile security enhancement against automatic speech recognition in eavesdropping
US10418024B1 (en) Systems and methods of speech generation for target user given limited data
Xu et al. HAMPER: high-performance adaptive mobile security enhancement against malicious speech and image recognition
US20240135954A1 (en) Learning method for integrated noise echo cancellation system using multi-channel based cross-tower network
US20210342530A1 (en) Framework for Managing Natural Language Processing Tools
Zuo et al. Speaker-Specific Utterance Ensemble based Transfer Attack on Speaker Identification.
Wu et al. Catch Me If You Can: Blackbox Adversarial Attacks on Automatic Speech Recognition using Frequency Masking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant