CN117877501A

CN117877501A - Voice interference noise authorization elimination method based on neural network

Info

Publication number: CN117877501A
Application number: CN202311818377.3A
Authority: CN
Inventors: 王庆龙; 黄鹏; 巴钟杰; 程鹏; 秦湛; 任奎
Original assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Current assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-04-12

Abstract

The invention discloses a voice interference noise authorization elimination method based on a neural network, which comprises the following steps: step 1, acquiring a voice data set; step 2, collecting voice noise data in various different forms; step 3, introducing random time offset into noise data; step 4, introducing random impulse response characteristics into noise data; step 5, adding noise in the noise data set to the collected voice data; and step 6, training and using the noise elimination neural network model. The method provided by the invention can realize authorized directional removal of different forms of interference noise so as to obtain high-quality voice data.

Description

Voice interference noise authorization elimination method based on neural network

Technical Field

The invention belongs to the field of voice privacy protection, and particularly relates to a voice interference noise authorization elimination method based on a neural network.

Background

Microphones have been ubiquitous in modern society, and they are widely embedded in various commercial electronic products such as smart phones, smart sounds, and the like. Such ubiquitous microphone devices may be used by attackers to collect voice information of users, exacerbating the concern of ordinary users for voice privacy. Meanwhile, because the microphone device has a black box characteristic, the behavior of an attacker for eavesdropping by using, destroying or misconfiguring the microphone is difficult to identify. In fact, with automatic speech recognition techniques, an attacker can extract personal information from the user's voice data, thereby infringing his privacy.

The severe situation of voice privacy leakage problems caused by microphone devices being manipulated has raised widespread attention for devices and service providers, as well as for average users, and has created urgent demands for anti-eavesdropping technology. Typical prior art solutions include the use of electromagnetic interference to induce noise in the microphone internal circuitry, however, this approach may affect other nearby electronic devices, such as cardiac pacemakers, resulting in unpredictable results. Other methods include a microphone side channel based speech interference method found by Roy and Zhang et al in 2017, which implements the injection of audible sound signals into a microphone using human inaudible ultrasound. This approach is considered to be a more practical speech interference solution because of its moderate transmission distance and negligible impact on other users and devices in the vicinity. Based on this study, other students have devised wearable devices equipped with a plurality of ultrasonic emission probes to interfere with microphone devices in a recording environment by continuously emitting ultrasonic waves.

A significant disadvantage of these methods is that they mostly use simple forms of noise, such as gaussian noise and random frequency noise, relying only on high power to achieve audio interference. This simple form of noise can be easily removed by the prior art, and thus privacy disclosure cannot be completely prevented. For other methods of noise interference based on human speech, this also results in the added noise possibly being removed from the recording by an attacker using speech separation techniques, since these methods use continuous speech signals. Therefore, more complex speech interference noise generation methods have emerged in recent years.

Patent document CN117059073a provides a voice recognition information acquisition method, including S1, acquiring a voice signal; s2, converting the voice signal into an electric signal; s3, preprocessing the electric signal, firstly carrying out energy normalization, namely adjusting the amplitude of the signal to enable the signal to have consistent amplitude under the same energy, which is helpful for eliminating the difference caused by different recording volumes, secondly, adopting a short-time energy method to carry out endpoint detection so as to identify an active paragraph in the voice signal, removing a mute segment, and finally, adopting Principal Component Analysis (PCA) to reduce the dimension of the voice signal and reserve the component which can most represent the voice feature, thereby reducing the complexity of subsequent calculation; s4, extracting features; s5, optimizing efficiency and precision; s6, data matching; s7, automatically updating; s8, error correction is carried out based on the language model.

However, the interference noise added by the above methods is often effective for authorized users and attackers, that is, the authorized recording devices are also affected while the unauthorized recording devices are interfered, so that the authorized users cannot perform authorized recording while avoiding voice privacy leakage. This problem severely limits the use scenario of the above-mentioned interference noise generation method, and cannot meet the actual requirements of voice privacy protection and security.

Disclosure of Invention

The invention aims to provide a voice interference noise authorization elimination method based on a neural network, which can realize authorization and directional elimination of interference noise in different forms so as to obtain high-quality voice data.

In order to achieve the first object of the present invention, there is provided a voice interference noise authorization cancellation method based on a neural network, comprising the steps of:

step 1, acquiring a voice data set which comprises different speakers and corresponding speaking contents;

step 2, collecting different audio noise data sets;

step 3, corresponding time stamp information is marked on the audio noise data set and the voice data set to obtain an initial noise training data set, and then random time offset is introduced to the initial noise data set according to the audio sampling rate;

step 4, randomly selecting impulse responses from a plurality of public acoustic data sets and fusing the impulse responses with an initial noise training data set after introducing random time offset;

step 5, randomly fusing the noise data processed in the step 4 with the voice data to obtain a voice training data set containing noise;

and 6, taking the initial noise training data set obtained in the step 3 and the voice training set obtained in the step 5 as inputs of a pre-constructed noise elimination neural network, and taking the voice data set in the step 1 as an output evaluation target to obtain a noise elimination model for filtering voice signal noise.

Specifically, the voice data set is obtained by data expansion, and the data expansion expands voice data of fewer speakers in a tone migration mode.

Specifically, the specific process of tone migration is as follows: for a speaker A with less data in a data set, a neural network is used for extracting voiceprints of the speaker A, wherein the neural network for tone extraction is expressed as e=f (x), x is a voice signal with the length of more than 1.6 seconds, and e is the characteristic of the output one-dimensional voiceprints. And inputting the voice data of any other person and the voiceprint characteristic e of the speaker A into the tone migration neural network, so as to obtain a large amount of voice data conforming to the tone of the speaker A.

Specifically, the noise signal includes single-frequency noise, sweep noise, or voice noise.

Specifically, in addition to some existing noise types, interference noise based on human voice phonemes is additionally generated, so that the denoising capability of the denoising network for complex noise is ensured.

Specifically, in step 5, the expression of the speech signal after adding noise in the speech training dataset is as follows:

r(t)＝s(t)+α·h ₁ (t)*n(t)+a(t)

wherein r (t) represents a voice signal added with noise, h ₁ (t) represents impulse response of signal propagation in different environments, a (t) represents environmental noise, alpha represents amplification factor of noise, s (t) represents speech signal, and n (t) represents audio noise.

Specifically, for speech and noise, the uniform resampling to a 16kHz sampling rate gives the time stamp of the noise accurate to a single sampling point at the 16kHz sampling rate, i.e., 1/16000=0.0625 ms.

Specifically, in step 3, a random offset is introduced into the noise signal, the offset interval is [ -fs, fs ], where fs represents the audio sampling rate, the random offset is introduced into the training data, and the model is forced to automatically align the noise-added audio signal with the time domain of the noise-added signal by optimizing the training loss function.

Specifically, in step 4, a test set is constructed by using a plurality of disclosed acoustic CIR data sets, impulse responses are randomly selected in each model training step, and the model is stimulated to learn impulse response characteristics by introducing rich CIR data in the training process so as to eliminate the impact of impulse responses when the model generates a mask.

Specifically, the noise cancellation neural network is constructed based on a transducer architecture.

Specifically, the noise cancellation neural network comprises an encoder, a mask network and a decoder;

the encoder is used for converting an input voice signal into a two-dimensional characteristic diagram;

the masking network comprises two sepperor modules for learning short-term and long-term audio features contained in the audio data graph representation, such that the masking network is able to capture an inherent relationship between the recorded signal with added interference noise and the feature graph of the noise signal;

the decoder adopts a model framework symmetrical to the encoder and is used for recovering the audio feature map processed by the mask network into a clean audio signal with noise removed.

Specifically, a loss function is used to train the noise cancellation neural network.

Specifically, the loss function is designed based on a scale-invariant signal-to-noise ratio, and the expression is as follows:

wherein s is the target pure voice in the training process,for the estimated clean speech output by the neural network, T represents the transpose operation of the matrix, and x represents the L2 norm of x.

Compared with the prior art, the invention has the beneficial effects that:

based on the existing voice interference noise generation technology, a training data time domain random offset processing algorithm is adopted to improve the robustness of a noise elimination neural network model, and the universality of noise elimination is stronger.

Drawings

Fig. 1 is a flowchart of a voice interference noise authorization elimination method based on a neural network according to the present embodiment;

fig. 2 is a block diagram of a design for removing interference noise through a neural network model according to the present embodiment;

FIG. 3 is a comparison of the recognizability of noise frequencies after processing by different algorithms in the digital domain provided by the present embodiment;

fig. 4 is a comparison of the recognizability of noise frequencies after processing by different algorithms in the real world scene provided by the present embodiment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a voice interference noise authorization elimination method based on a neural network includes the following steps:

and 1, constructing a voice data set S.

The speech data set should be as rich as possible in content to cover a wide range of speakers of different ages, sexes, accents, moods, and the content of speaking should be as rich as possible. Public data sets such as LibriSpeech, gigaSpeech, etc. may be utilized.

And 2, collecting different audio noise data sets.

A variety of different forms of audio noise data are collected, including single frequency noise, swept frequency noise, speech noise, and the like. Meanwhile, noise based on the phoneme fragments of the voice is additionally generated, and the generation method is that a large number of voices in the data set are firstly segmented into fragments with phoneme level, phonemes are randomly selected from the phoneme fragments and spliced, and the phoneme noise which is highly similar to the voice structure is obtained.

And 3, marking corresponding time stamp information on the audio noise data set and the voice data set to obtain an initial noise training data set N, and introducing random time offset to noise according to the audio sampling rate.

A random offset is introduced in the added noise signal according to the audio sampling rate fs, the offset interval being [ -fs, fs ]. The random offset is introduced to enable the distribution of the voice data received in the model operation stage and the voice data received in the training stage to be as close as possible.

And 4, randomly selecting impulse responses from a plurality of public acoustic data sets H and fusing the impulse responses with noise training data after introducing random time offsets.

An impulse response is randomly selected from a plurality of disclosed acoustic CIR datasets. The introduced random impulse response features facilitate the model to adaptively learn the impulse response of the input audio as the training process generates the mask.

And 5, randomly fusing the audio noise data set obtained in the step 4 with voice data to obtain a voice training data set R containing noise.

The collected interference noise is added from the collected voice data set, and the time stamp information corresponding to the added noise is recorded. Assuming that the clean speech signal without added interference noise is s (t), and the added interference noise is n (t), where t represents time information, the speech signal r (t) after adding noise can be expressed as:

r(t)＝s(t)+α·h ₁ (t)*n(t)+a(t),

wherein h is ₁ (t) represents the random impulse response (Channel Impulse Response) added in step 3, a (t) represents the ambient noise, α represents the amplification factor of the noise, and is an unknown coefficient. Thus, the target of noise removal can be expressed as solving s (t) in the above equation, i.e., removing the noise from the actual recorded noise-added speech signal r (t) by α·h ₁ And (t) ×n (t) +a (t).

And 6, training and using a noise elimination neural network model based on the training set { N, R } as input and the voice data set { S } as output evaluation target.

As shown in fig. 2, the neural network used in the present design is constructed based on a transducer architecture, and includes three sub-neural networks of an encoder, a mask network, and a decoder.

The recorded signals added with the interference noise and the added noise signals collected in the step (1) and the step (2) are input as models, and the models are output as estimates of the recorded signals without the added interference noise.

Constructing an encoder sub-neural network from the speech data representation collected in step (1), the encoder using two convolutional layers to convert the input speech signal into a two-dimensional feature map. Inputting the recorded signals added with the interference noise and the added noise signals collected in the step (1) and the step (2) into an encoder to respectively obtain characteristic diagram representations of the recorded signals added with the interference noise and the noise signals. And splicing the characteristic graphs of the recorded signal added with the interference noise and the noise signal, and inputting the characteristic graphs into a mask network.

The mask network is designed according to the characteristic diagram of the spliced voice signal output by the encoder. The masking network comprises two sepsemer modules that learn short-term and long-term audio features contained in the audio data map representation so that the masking network captures the inherent relationship between the noise signal and the noise signal signature of the disturbance noise added audio recording signal to generate a mask for filtering noise components from the disturbance noise added audio recording signal signature.

And constructing a decoder according to the encoder architecture, wherein the decoder adopts a model architecture symmetrical to the encoder and is used for recovering the audio feature map processed by the mask network into a clean audio signal with noise removed.

In a practical scenario, since an attacker can only obtain r (t), it is difficult to recover s (t) from r (t). Meanwhile, since the authorized user can obtain the detailed information and response of the interference noise signal n (t), s (t) can be recovered from r (t) by using the neural network model. Further, the authorized user can store the voice data added with the interference noise and the corresponding time stamp added with the noise for a long time, so that the clean audio data can be recovered at any time. Thus, in the training phase, the data from the previous step is input into the neural network model and model training is performed according to the following loss function:

the fully trained model receives noise-added voice data, authorizes the obtained noise-added data, takes the introduced random time offset and impact response characteristics as input, and outputs clean voice signals after noise filtering.

In order to verify the effect of the invention, experiments are carried out on the voice interference noise authorization elimination method based on the neural network.

To simulate the use scenario of the method, it is assumed that the user is talking in the environment. To protect privacy, a user may use some noise-interfering device to inject an interfering signal into a recording device in the environment. At the same time, the user himself needs to record the current dialogue. Then, the user's own recording will also be disturbed because the interfering device does not have the capability of selective interference. Therefore, the user can utilize the method to denoise the received interference sound recording and restore the pure sound recording audio. The experiment considers an attacker attempting to recover a recording without adding the relevant information of the interference noise, and adopts a model architecture similar to the design for removing noise to simulate the attacker. In particular, this experiment removes the upper half of the encoder that is involved in the added interference noise, using only the noisy recording as input. Based on the constructed simulated attacker model, the experiment uses three different methods to process noise recordings: the model for removing noise in the present design is used for denoising and speech enhancement, the model for simulating an attacker is used for speech enhancement, and only speech enhancement is performed. While the experiment used as a comparison the original noise audio without any treatment.

This experiment considers two scenarios: the digital domain and the real world scene, 200/50 audio was randomly selected as the test set from the library specific test-clean dataset, respectively. In real world scenarios, the experiment records audio and modulated noise separately, and then mixes the collected audio at various signal-to-noise ratios.

As shown in fig. 3 and 4, when the signal-to-noise ratio is <1, the model for removing noise in the present design can significantly improve the recognition accuracy, and the improvement degree is greater at the lower signal-to-noise ratio. When the signal-to-noise ratio > =1, the result shows that the recognition accuracy is slightly degraded. For the other two methods, namely enhancement after denoising by using the simulated attacker model and voice enhancement only, the recognition accuracy is obviously reduced. In addition, after the speech signal is recovered by noise cancellation, the audio quality (expressed in SI-SNR) is significantly improved. The above results show that the model for removing noise in the design can effectively remove interference noise in noisy audio, so that a user can record normally under the condition of interference of an attacker.

Claims

1. The voice interference noise authorization elimination method based on the neural network is characterized by comprising the following steps of:

step 2, collecting different audio noise data sets;

2. The neural network-based voice interference noise authorized elimination method of claim 1, wherein the step of obtaining the voice data set further comprises data expansion, wherein the data expansion expands voice data of fewer speakers by means of tone migration.

3. The neural network-based voice interference noise cancellation method of claim 1, wherein the noise signal comprises single frequency noise, swept frequency noise or voice noise.

4. The neural network-based voice interference noise authorized cancellation method of claim 1, wherein in step 5, the expression of the voice signal after adding noise in the voice training data set is as follows:

r(t)＝s(t)+α·h ₁ (t)*n(t)+a(t)

5. The neural network-based voice interference noise cancellation method of claim 1, wherein the noise cancellation neural network is constructed based on a transducer architecture.

6. The neural network-based voice interference noise authorized cancellation method according to claim 1 or 5, wherein the noise cancellation neural network comprises an encoder, a masking network and a decoder;

7. The neural network-based voice interference noise cancellation method of claim 1, wherein the noise cancellation neural network is trained using a loss function.

8. The neural network-based voice interference noise cancellation method of claim 7, wherein the loss function is designed based on a scale-invariant signal-to-noise ratio, and is expressed as follows: