CN115602190A

CN115602190A - Forged voice detection algorithm and system based on main body filtering

Info

Publication number: CN115602190A
Application number: CN202211217858.4A
Authority: CN
Inventors: 任延珍; 刘轶文; 王丽娜
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-13

Abstract

The robustness of the existing forged voice detection method under the scene of recoding and noise mismatch is weak, and a strategy for performing data amplification on a training data set is provided for improving the robustness of the existing method and the research work of forged voice detection. However, the data augmentation strategy increases the amount of training data, reduces the model training efficiency, and can only be applied to known coding algorithms and noise difference scenarios. The invention relates to the field of forged voice detection, in particular to the field of forged voice detection in a recoding and noise interference scene, and particularly relates to a forged voice detection algorithm and a forged voice detection system based on main body filtering.

Description

Forged voice detection algorithm and system based on main body filtering

The technical field is as follows:

the invention relates to the field of forged voice detection, in particular to the field of forged voice detection in a recoding and noise interference scene, and particularly relates to a forged voice detection algorithm and a forged voice detection system based on main body filtering.

Technical background:

the non-robustness of the forged voice detection model means that when the data set of the training detection model and the data set of the evaluation detection model are mismatched, the detection performance of the forged voice detection model is reduced sharply. The mismatch of the training data set and the evaluation data set can be divided into a number of scenarios: the method comprises a speaker mismatch scene, a counterfeit algorithm mismatch scene, a recoding mismatch scene, a noise interference mismatch scene and the like. The speaker mismatching scene refers to the situation that speakers do not exist in the training data set exist in the evaluation data set; the scene of mismatching of the forged algorithm means that the forged voice in the evaluation data set uses the forged voice synthesis algorithm which is not used in the training data set during construction; the recoding mismatch scene means that the voice data in the evaluation data set may be processed by a plurality of unknown coding algorithms, while the voice data in the training data set is not coded or the related processing algorithms are limited; a noise interference mismatch scenario refers to the speech in the evaluation dataset containing various noise interferences, while the training dataset has only clean noise-free speech. The mismatch scenarios described above may exist together, for example, the LA assessment dataset and the training dataset of the ASVspoof2019 have both speaker mismatch and spurious speech algorithm mismatch.

Improving the robustness of the forged speech detection model is a progressive process. Early forged voice detection research work focused on robustness of forged voice detection models in speaker mismatch scenes and forged algorithm mismatch scenes, and a plurality of targeted detection models and novel loss functions were proposed. The detection accuracy of the fake voice detection methods is high in a speaker mismatch scene and a fake algorithm mismatch scene. However, the existing method is not concerned with robustness of a forged voice detection model in a recoding mismatch scene and a noise mismatch scene. Coding is a common processing mode of digital audio, and audio cannot inevitably encounter various noise interferences in the processes of acquisition, transmission and playing. Therefore, the forged voice detection model facing the practical scene should consider the performance under the re-encoding mismatch scene and the noise mismatch scene.

Experiments show that the existing forged voice detection model has poor robustness in a recoding mismatch scene and a noise mismatch scene. The distribution mismatch scenario that mainly exists between the LA assessment dataset of ASVspoof2021 and the LA training dataset of ASVspoof2019 is a re-encoding mismatch scenario. The existing forged voice detection model is trained in an LA training data set of ASVspoof2019, EER of the trained detection model is about 4% -7% when the trained detection model is evaluated by using the LA evaluation data set of ASVspoof2019, and EER of the trained detection model when the trained detection model is evaluated by using the LA evaluation data set of ASVspoof2021 is generally close to 20%. An LF (LF) data set of the ADD simulates a noise mismatch scene of a real environment, the EER of an existing forged voice detection model during training of the ADD data set is close to 10%, the EER during evaluation is about 30%, and the judgment capability of the model in the noise mismatch scene is weak. Therefore, the performance of the existing algorithm in the re-encoding mismatch scenario and the noise mismatch scenario still needs to be improved.

Disclosure of Invention

The technical problem of the invention is mainly solved by the following technical scheme:

a forged voice detection algorithm based on main body filtering comprises

Collecting voice data, extracting characteristics, and dividing the voice data into a training set and a testing set;

respectively extracting a masking filtering main body and an amplitude filtering main body from the training set and the test set, and eliminating interference data of speech data due to recoding and noise in a spectrogram to obtain an extracted training set and a test set;

training the detection model by using a training set to obtain a trained detection model;

and detecting the forged voice in real time by using the trained detection model.

In the above forged voice detection algorithm based on main body filtering, the feature extraction is to extract the spectrogram feature of the voice to be detected by using a general feature extraction algorithm.

In the foregoing forged speech detection algorithm based on main body filtering, when the masking filtering main body is extracted:

masking the spectrogram characteristics to calculate a spectrogram characteristic masking curve;

and eliminating the masked frequency components in the original speech spectrum features according to the masking curve to obtain a non-masking power spectrogram.

In the above forged voice detection algorithm based on the main body filtering, when the amplitude filtering main body is extracted:

performing frequency band division on the non-masking power spectrogram, and dividing the non-masking power spectrogram into a plurality of frequency band parts according to the characteristics of human voice and listening;

and eliminating the noise signals by adopting a self-adaptive amplitude filtering algorithm on each frequency band part to obtain a main signal power spectrogram.

In the forged voice detection algorithm based on the main body filtering, a Bark (Bark) domain spectrogram, a sound pressure level SPL of spectrogram amplitude and a local peak point of a frequency curve are respectively calculated according to a voice power spectrogram;

the Bark band, the sound pressure level SPL of the spectrogram amplitude and the local peak point are brought into a masking transfer function to calculate a masking curve;

and eliminating frequency components with amplitude lower than the masking curve to obtain a non-masking voice power spectrogram.

In the above forged voice detection algorithm based on the main body filtering, the non-masking power spectrogram is divided into three frequency bands, namely a high frequency band, a middle frequency band and a low frequency band;

and carrying out amplitude filtering according to the adaptive energy level in the frequency band region for different frequency bands.

In the above-mentioned forged voice detection algorithm based on body filtering, a Bark (Bark) domain spectrogram is calculated, i.e. the calculation of frequency domain to Bark domain is shown in formula 1,

wherein F _hz Representing a frequency value, f _bark Representing the frequency domain values in the bark scale.

In the above forged voice detection algorithm based on the main filtering, the core of the adaptive filtering is as shown in formula 2:

f _abs representing the absolute value of the amplitude, top _10％ Representing the order of the amplitudes of all frequency components in the band from high to low, F _h There are two parameters, the first being all frequency components of the band and the second being Top _10％ Calculated amplitude point, F _h And setting the amplitudes of all frequency components with the amplitudes lower than the second parameter as singular values.

A forged voice detection system based on main body filtering comprises

A first module: the voice recognition system is configured to collect voice data, extract features of the voice data, and divide the voice data into a training set and a testing set;

a second module: the system is configured to respectively perform masking filtering main body extraction and amplitude filtering main body extraction on a training set and a test set, and eliminate interference data of speech data due to recoding and noise in a spectrogram to obtain an extracted training set and an extracted test set;

a third module: the training set is used for training the detection model to obtain a trained detection model;

a fourth module: configured to detect in real time the detection of spurious speech using a trained detection model.

Therefore, the invention has the following advantages: 1. aiming at the common problem of data distribution mismatch in the field of forged voice detection, the main body extraction module eliminates non-main body parts which are easy to change in voice signals, and the robustness of the conventional forged voice detection model in the scenes of recoding and noise interference is effectively improved. 2. The main body extraction module utilizes the auditory masking effect, filters out the non-main body part of the voice content, simultaneously reserves the main body part of the voice, and can keep the semantic meaning and the naturalness of the original voice. Under the condition that the training data set and the evaluation data set are not mismatched, the detection accuracy of the conventional forged voice detection model cannot be obviously reduced by the main body extraction module.

Drawings

Figure 1 is a schematic of a subject extraction scheme versus a general process.

Fig. 2 is a flow of a subject extraction calculation based on auditory masking effects.

Fig. 3 is a specific calculation flow of the principal extraction scheme based on spectrogram amplitude filtering.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.

Example (b):

the invention provides a masking filtering main body extraction scheme and an amplitude filtering main body extraction scheme respectively based on the characteristics of human auditory masking effect and high energy difference of signal noise by deeply analyzing the coding flow of voice and the characteristics of signals and noise in a spectrogram, eliminates interference components caused by recoding and noise on voice signals in the spectrogram, and reserves the main body part of voice production in the spectrogram, thereby realizing more robust detection of forged voice. The method specifically comprises the following steps:

1. and extracting the original spectrogram characteristics of the voice sample to be detected by using an STFT algorithm.

2. Processing the original spectrogram characteristics by using a main body filtering module, wherein the main body filtering module comprises the following steps:

2.1 calculating a masking curve of the original spectrogram characteristic according to a calculation formula of the masking effect.

And 2.2, removing the masked frequency components in the original spectrogram features according to the masking curve to obtain a non-masking power spectrogram.

2.3 according to human auditory characteristics, applying adaptive amplitude filtering to different frequency bands in the non-masking power spectrogram, and eliminating noise signals to obtain main body signals.

3. The subject signal is used as input to train a model for detecting the forged voice (many deep neural networks can be used for detecting the forged voice, and any network can be selected here, which is not the coverage of the present invention).

4. The trained model can be used for detecting the forged voice. Before detection, however, the main signal is still extracted by the main filtering manner described in step 2.

The core of the method is a main feature extraction module, the overall structure is shown in figure 1, and the work flow is as follows. Firstly, extracting the spectrogram feature of the voice to be detected by using a general feature extraction algorithm. Secondly, mask metrics are performed on the spectrogram features in order to calculate their masking curves. And then, the masking and removing module removes the masked frequency components in the original speech spectrum features according to the masking curve to obtain a non-masking power spectrogram. And finally, performing frequency band division on the non-masking power spectrogram, dividing the non-masking power spectrogram into a plurality of frequency band parts according to the characteristics of human voice production and listening, and removing noise signals from each frequency band part by adopting a self-adaptive amplitude filtering algorithm. The main part extraction module processes the speech spectrum characteristics of the training data and the evaluation data, interference information caused by recoding and noise signals between the characteristics is eliminated, and the phenomenon that a forged speech detection model depends on unstable interference information during training and evaluation is avoided, so that the robustness of the forged speech detection model is improved.

The main body extraction module firstly eliminates interference signals which can not be sensed by human ears based on masking effect, and secondly performs amplitude filtering operation according to the amplitude relation between noise and main body signals. This processing sequence is because the computation of the masking curve is required to maintain the integrity of the original speech signal, and if the amplitude filtering process is performed first, the relationship between the signals is destroyed and the computed masking curve will lose its original meaning. The masking filtering only eliminates the parts which can not be sensed by human ears, which has no influence on the subsequent noise elimination according to the amplitude, so the processing sequence of the main body extraction module is first masking filtering and then amplitude filtering.

The subject extraction module includes a subject extraction scheme based on auditory masking effect and a subject extraction scheme based on amplitude filtering, which are respectively described below.

1. Subject extraction scheme based on auditory masking effect (masking filtering subject extraction).

The specific processing flow of the subject extraction scheme based on the masking effect is shown in fig. 2. Firstly, according to the voice power spectrogram respectively meterBark (Bark) domain spectra, sound Pressure Level (SPL) in spectral amplitude, and local peak points of the frequency curve are calculated. The calculation of frequency domain to Bark domain is shown in formula 1, wherein f _hz Represents the frequency value, f _bark The frequency domain value representing Bark scale, the frequency domain is converted into Bark domain because Bark region is more in line with human ear auditory system, there are 24 Bark sub-bands respectively corresponding to 24 regions in human ear, the physiological basis of masking effect is mutual interference of voice frequency components in each region in 24 regions of human ear; the unit of the SPL value is dB which represents the ratio of the point sound to the standard sound pressure, and the formula of the frequency-to-SPL is shown in formula 2, wherein

Representing the square of the absolute value of the power, which represents the energy of the frequency component, N _fft The number of levels used for fourier transformation is typically slightly larger than the speech framing window length and is a power of2, which is convenient for using fast fourier transform algorithms; the local peak point is a frequency point of which the frequency component in the frequency curve is higher than the frequency components around the local peak point, the algorithm for searching the local peak point in the sequence is very mature, and the fin _ peak function of the scipy library is used for positioning the peak point. In calculating the masking curve, each peak point is treated as a non-noise part. The reason why the peak point is calculated is that in the auditory masking effect, the masking effect of the non-noise part on the noise part is different from the masking effect of the noise part on the noise part, and generally only the masking effect of the non-noise part on the noise part needs to be counted.

The Bark band, SPL, and local peak points are taken into the masking transfer function to calculate the masking curve. The masking transfer function is a cyclic iteration process, and masking effects of each non-noise point on surrounding signals are calculated in an iteration mode and accumulated to finally obtain a masking curve of the whole frequency curve. The formula for calculating the global masking effect of each non-noise point in the masking transfer function is shown in equation 3.SPL _i The SPL value representing the peak point of the current iteration is only such that the SPL of the non-noise part is greater than 40 to produce a masking effect. sf _j Used for temporarily storing ith peak value to global jth frequency scoreMasking effect of the amount. dz is the difference between the Bark scale value for the ith frequency and the Bark scale value for the jth frequency, taken in absolute terms when used. θ depends on the value of dz, which takes 1 if dz is regular, and 0 otherwise. This means that there is a masking effect if the energy of the local peak point i is greater than the energy of j. And the masking and rejecting module rejects the frequency components with the amplitude lower than the masking curve to obtain a non-masking voice power spectrogram.

Sf _j ＝abs(dz)·(-27+0.37·max(SPL _i -40，0)·θ) (3)

2. Magnitude filtering based subject extraction scheme (magnitude filtering subject extraction).

The specific flow of the principal extraction scheme based on spectrogram magnitude filtering is shown in fig. 3. The non-masking power spectrogram firstly divides frequency bands into a high frequency band, a middle frequency band and a low frequency band. And carrying out amplitude filtering on different frequency bands according to the adaptive energy level in the frequency band region. The core of adaptive filtering is shown in equation 4. f. of _abs Representing the absolute value of the amplitude, top _10％ The amplitudes representing all frequency components within the band are ordered from high to low, and the amplitude, top, that happens to fall at the 10 percentile is chosen _30％ And Top _5％ Similar thereto. f. of _h There are two parameters, the first being all frequency components of the band and the second being Top _10％ Calculated amplitude point, F _h And setting the amplitudes of all the frequency components with the amplitudes lower than the second parameter as singular values.

In different band components, different percentages are retained according to the amplitude distribution of each speech itself. Low frequencies are the main component of human speech and only frequencies with sufficient energy need to be preserved. The intermediate frequency is an important basis for human ears to distinguish different sounds, so that more frequency components need to be reserved. The high frequencies are mostly consonants, noise, etc., so that only minimal information needs to be retained.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments, or alternatives may be employed, by those skilled in the art, without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A forged voice detection algorithm based on main body filtering is characterized by comprising

2. The algorithm for detecting counterfeit voice based on body filtering as claimed in claim 1, wherein the feature extraction is a general feature extraction algorithm to extract spectrogram features of the voice to be detected.

3. A forged voice detection algorithm based on body filtering according to claim 1, characterized in that, when extracting the masking filtering body:

masking measurement is carried out on spectrogram features, and a spectrogram feature masking curve is calculated;

and eliminating the masked frequency components in the original speech spectrum characteristics according to the masking curve to obtain a non-masking power spectrogram.

4. A subject filtering based counterfeit voice detection algorithm according to claim 1, wherein the magnitude filtering subject extraction:

5. A subject filtering based counterfeit speech detection algorithm according to claim 1,

respectively calculating a Bark (Bark) domain spectrogram, a sound pressure level SPL of the spectrogram amplitude and a local peak point of a frequency curve according to the voice power spectrogram;

6. A subject filtering based counterfeit speech detection algorithm according to claim 1,

dividing the frequency band of the non-masking power spectrogram into a high frequency band, a middle frequency band and a low frequency band;

7. A subject filtering based counterfeit speech detection algorithm according to claim 1,

calculating Bark (Bark) domain spectrogram, namely, calculating frequency domain to Bark domain as shown in formula 1,

wherein f is _hz Represents the frequency value, f _bark Representing the frequency domain values in the bark scale.

8. The algorithm for detecting forged speech based on body filtering as claimed in claim 1, wherein the core of the adaptive filtering is as shown in equation 2:

f _abs representing the absolute value of the amplitude, top _10％ Representing the order of the amplitudes of all frequency components in the band from high to low, F _h There are two parameters, the first being all frequency components of the band and the second being Top _10％ Calculated amplitude point, F _h And setting the amplitudes of all the frequency components with the amplitudes lower than the second parameter as singular values.

9. A system for detecting forged voice based on main body filtering is characterized by comprising

a second module: the device is configured to respectively perform masking filtering main body extraction and amplitude filtering main body extraction on a training set and a test set, and eliminate interference data of speech data due to recoding and noise in a spectrogram to obtain an extracted training set and a test set;

a third module: the method comprises the steps of training a detection model by using a training set to obtain a trained detection model;