CN109767776B

CN109767776B - Deception voice detection method based on dense neural network

Info

Publication number: CN109767776B
Application number: CN201910033384.XA
Authority: CN
Inventors: 王泳; 苏卓艺
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2019-01-14
Filing date: 2019-01-14
Publication date: 2023-12-15
Anticipated expiration: 2039-01-14
Also published as: CN109767776A

Abstract

The invention discloses a deception voice detection method based on a dense neural network, which particularly relates to the technical field of information security, and particularly comprises the following detection steps: step one: construction of VT spoofed speech conversion model: the STFT is utilized to break the connection between the traditional time and frequency characteristics and keep the rhythm unchanged; the convolutional neural network is constructed so that the output of the previous layer network is transmitted to the next layer as input and output through nonlinear operation. The invention ensures the maximum information flow between layers by establishing the dense convolution network, enhances the feature propagation, has regularization effect by dense connection, reduces the overfitting of tasks with smaller training set, can narrow the network layer, obviously reduces the parameter number, reduces the degradation problem, supports the reuse of limited neurons, does not need to relearn redundant feature graphs, and is convenient for training.

Description

Deception voice detection method based on dense neural network

Technical Field

The invention relates to the technical field of information security, in particular to a deception voice detection method based on a dense neural network.

Background

Speech fraud is common in the current society, and presents great challenges to social security. It is important to recognize a disguised language from a real voice. Most of the research is focused on speech conversion (VC), speech synthesis and replay attacks, however, another spoofing method exists in speech spoofing, which is to change the voice of speaker a into a different voice (without the need of a target speaker), so that the recognition system cannot judge that the voice is said to be a, and this transformation is called VT (Voice Transformation, speech morphing). Much less attention is paid to it.

The invention patent of patent application publication number CN 106875007A discloses a convolution long-term and short-term memory end-to-end depth neural network for language spoofing detection, and the adopted convolution long-term depth neural network can directly optimize feature extraction and classification according to the current task, so that given input can be expressed more robustly and effectively, and the detection result is comprehensively improved; directly evaluating the proper characteristics by combining classifier training, so that the model can adapt to any relevant task; the front-end program is removed, so that the model of the invention greatly simplifies the pipeline, in particular API call: by combining classification and optimization within a single model, the present invention eliminates the need to invoke multiple parameters for separate classifiers and feature extraction methods.

However, in practical use, there are still many drawbacks, such as degradation with increasing number of layers, and this connection method results in many network layers contributing little, but taking up a lot of computation.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, an embodiment of the present invention provides a method for detecting spoofed speech based on a dense neural network, which ensures the maximum information flow between layers by establishing the dense convolutional network, enhances feature propagation, while the dense connection has a regularization effect, reduces overfitting to tasks with smaller training sets, and the dense convolutional network can narrow the network layer, significantly reduce the number of parameters, alleviate degradation problems, support reuse of limited neurons, and simultaneously does not need to relearn redundant feature patterns, thereby facilitating training to solve the problems presented in the above-mentioned background art.

In order to achieve the above purpose, the present invention provides the following technical solutions: a deception voice detection method based on a dense neural network specifically comprises the following detection steps:

step one: construction of VT spoofed speech conversion model: the conventional link between time and frequency characteristics is broken by using STFT, and the tempo is kept unchanged, wherein VT spoofing voice can be described as follows:

let x be _t (n) is a frame of length n from the input speech signal at time t, which is first windowed w (n), and then FFT is performed on the windowed signal to obtain F (k), given by equation (1):

where w (n) represents a hamming or hanning window, k represents a frequency bin index,

then, the calculation of the instantaneous quantity |f (k) | and the instantaneous frequency ω (k) are shown in the formula (2) and the formula (3), respectively:

delta represents k ^th Deviation of frequency bin, F _S Representing the sampling frequency of the sample,

for VT spoofing speech, the instantaneous frequency ω (k) is modified by equation (4), a represents a scale factor, i.e. a spoofing factor,

linear interpolation is used to modify the instantaneous values, see equation (5), where 0.ltoreq.k, k ' < N/2, k=k '/a and μ=k '/a-k,

another method of modifying the instantaneous value is energy protection correction, as shown in equation (6),

for simplicity we still use k as the frequency-base index for the modified instantaneous frequency ω 'and instantaneous value F'.

Then calculating instantaneous phase phi '(k) through instantaneous frequency omega' (k), and obtaining converted FFT coefficient through formula (7),

F(k)＝|F(k)|e ^jφ'(k) (7)

finally, performing inverse FFT on F' (k) to obtain VT spoofed voice,

as can be seen from equations 4 and 5, VT spoofing speech changes the spectral amplitude such that implicit features may be introduced into VT spoofing speech, the depth features may be classified by using the spectrogram of the speech as an input to a depth neural network, and a spectrogram of the input speech signal may be obtained by short-time fourier transform (STFT), as given in equation (8),

where the window size is 175 and the overlap is 50%, in speech, VT spoof speech disturbance is measured by a spoof factor a caused by 12 semitones, as shown in equation (9),

α(s)＝2 ^s/12 (9)

s takes any integer value in the range of [ -12, +12 ];

step two: constructing convolutional neural network to make the output of the previous layer networkIs transferred to the next layer as input via a nonlinear operation +.>Output->Wherein (1)>The method can be expressed as follows:

degradation occurs with increasing number of layers, while the residual network, the highway network and the fractal network both create a short path X from the early network to the later layer _l-n Has good inhibiting effect on degradation phenomenon, as shown in formula (11)

Step three: performance measurement: the detection accuracy of the VT spoofed voice is tested through an experimental corpus, wherein the detection can be described as follows:

d＝(G _d +S _d )/(G+S)

where d is the detection accuracy, G and S are the number of real and spoofed segments in the test set, respectively, and Gd and Sd are the number of real segments correctly detected from G and spoofed segments correctly detected from S, respectively.

In a preferred embodiment, the second step further comprises a dense convolution network of improved structure, in which any layer is directly connected to all subsequent layers, specifically as follows,

wherein X is ₀ ,X ₁ ,...,All outputs before layer i are indicated, [.]Representing continuous operation, furthermore, each layer's output dimension has q feature maps, where the q value is a natural number.

In a preferred embodiment, the dense convolution network input is a single channel spectrogram obtained by STFT, all sized 90 x 88, and the network consists of an initialization layer, three dense modules, two conversion layers, a global pooling layer and a linear layer, the three dense modules consisting of 6, 12 and 48 bottleneck layers, respectively, the linear layer being a complete connection layer followed by a normalized exponential function with two outputs representing the probabilities of "true" and "spoof", respectively, each convolution bottleneck layer comprising 2 layers, the entire dense convolution network comprising 2× (6+12+48) +1+1+1=135 convolution layers.

In a preferred embodiment, the bottleneck layer comprises a convolution 1 x 1 layer and a convolution 3 x 3 layer, and the transition layer connects two adjacent dense modules to further reduce the size of the functional map.

In a preferred embodiment, the experimental corpus in step three includes Timit, NIST and UME, which are all in WAV format, 8 khz sampling rate, 16-bit quantization and mono speech.

In a preferred embodiment, the Timit, NIST and UME each comprise a training set and a testing set, wherein the training set is Timit-1, NIST-1, UME-1, respectively, and the testing set is Timit-2, NIST-2, UME-2, respectively.

The invention has the technical effects and advantages that:

the invention ensures the maximum information flow between layers by establishing the dense convolution network, enhances the feature propagation, has regularization effect by dense connection, reduces the overfitting of tasks with smaller training set, can narrow the network layer, obviously reduces the parameter number, reduces the degradation problem, supports the reuse of limited neurons, does not need to relearn redundant feature graphs, is convenient for training, does not need to manually select specific one or more features like the traditional machine learning method, and then carries out classification by using a classifier, but can spontaneously extract related features including some shallow edge features and deep features by using the proposed dense neural network and then classify the features, thereby simplifying the whole flow and achieving better effect.

Drawings

FIG. 1 is a flow chart of the voice detection of the present invention;

FIG. 2 is a block diagram of a dense neural network of the present invention;

fig. 3 is an internal structural diagram of the dense neural network of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The invention provides a deception voice detection method based on a dense neural network as shown in fig. 1-3, which specifically comprises the following detection steps:

linear interpolation is used to modify the instantaneous values, see equation (5), where 0.ltoreq.k, k ' < N/2, k=k '/α and μ=k '/α -k,

F(k)＝|F(k)|e ^jφ(k) (7)

finally, performing inverse FFT on F' (k) to obtain VT spoofed voice,

where the window size is 175 and the overlap is 50%, in speech, VT spoofing speech is measured by a spoofing factor a caused by 12 semitones, as shown in equation (9),

α(s)＝2 ^s/12 (9)

s takes any integer value in the range of [ -12, +12 ];

step two: constructing Convolutional Neural Network (CNN) to make the output of the previous layer networkIs transferred to the next layer as input via a nonlinear operation +.>Output->Wherein (1)>The method can be expressed as follows:

degradation occurs with increasing layer number, and a residual network (ResNet), a Highway network (Highway networks) and a fractal network (Fractalnets) both create a short path X from an early network to a later layer _l-n Has good inhibiting effect on degradation phenomenon, as shown in formula (11)

d＝(G _d +S _d )/(G+S)

where G and S are the number of real and spoofed segments in the test set, respectively, and Gd and Sd are the number of real segments correctly detected from G and spoofed segments correctly detected from S, respectively.

Further, the experimental corpus in the third step includes Timit (6300 segments, 630 speakers), NIST (3560 segments, 356 speakers) and UME (4040 segments, 202 speakers), which are all in WAV format, 8 khz sampling rate, 16-bit quantization and mono speech.

Further, the Timit (6300 fragments, 630 speakers), NIST (3560 fragments, 356 speakers) and UME (4040 fragments, 202 speakers) each include a training set and a test set, wherein the training set is Timit-1 (3000 fragments), NIST-1 (2000 fragments), UME-1 (2040 fragments), and the test set is Timit-2 (3300 fragments), NIST-2 (1560 fragments), UME-2 (2000 fragments), respectively.

Example 2

Unlike example 1, the second step further includes a dense convolutional network (DenseNet) of improved structure, in which any layer is directly connected to all subsequent layers, specifically as follows,

wherein X is ₀ ,X ₁ ,The output of all layers preceding layer i is indicated, [.]Representing continuous operation, furthermore, each layer's output dimension has k feature maps, where k is a small value.

Further, the dense convolution network (DenseNet) input is a single-channel spectrogram obtained by STFT, the size is set to 90×88, the network is composed of an initialization layer, three dense modules, two conversion layers, a global pooling layer and a linear layer, the three dense modules are respectively composed of 6 layers, 12 layers and 48 bottleneck layers, the linear layer is a complete connection layer, and then a normalized exponential function is provided, the two output is used for respectively representing the probability of 'true' and 'deception', each convolution bottleneck layer comprises 2 layers, and the whole dense convolution network (DenseNet) comprises 2× (6+12+48) +1+1+1=135 convolution layers, so that the depth features can be automatically extracted through the 135-layer dense convolution network, and the calculation efficiency is improved.

Further, the bottleneck layer includes one 1×1 convolution layer and one 3×3 convolution layer instead of 2 3×3 convolution layers to reduce computation, and the transition layer connects two adjacent dense modules to further reduce the size of the functional map.

Based on example 2, homologous corpus evaluation and cross-corpus evaluation were performed on the test set and training set, respectively:

(1) Homologous corpus evaluation

In the case of an internal database, where the test set and training set are from the same corpus, the test results of this method and the results of other methods are shown in the following table,

as can be seen from the data in the table, the average detection precision of the method provided by the invention is 2.58% higher than that of the traditional CNN model and 3.66% higher than that of the SVM model, so that in the dense convolution network, the decision has depth characteristics and refers to early edge characteristics, and the precision can be further improved.

(2) Quadrature corpus evaluation

In a real-world scenario, the test speech and the training speech may come from different sources, by selecting one of the three corpora as the test data set, the other two as the training set, and the experimental results are shown in the following table,

from the data in the table, the results of the first two cases are good, but scheme 3 is not ideal, one possible reason is that the data amount of NIST is greater than the other two groups shown in table 1, indicating that the NIST-trained model has better generalization ability, and that in the GNN method, the accuracy of scheme 1 is 94.37% and our accuracy is 96.45%, indicating that our proposed method is superior to the GNN method.

Finally, it should be noted that: the foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A deception voice detection method based on a dense neural network is characterized by comprising the following steps of: the method specifically comprises the following detection steps:

for simplicity we still use k as the frequency-base index of the modified instantaneous frequency ω 'and instantaneous value F',

F(k)＝|F(k)|e ^jφ'(k) (7)

finally, performing inverse FFT on F' (k) to obtain VT spoofed voice,

as can be seen from equations 4 and 5, VT spoofing speech disturbance changes the spectral amplitude so that implicit features can be introduced into VT spoofing speech, when the implicit features are introduced into VT spoofing speech, by using the spectrogram of the speech as the input to the deep neural network to extract the implicit features, and by Short Time Fourier Transform (STFT) to obtain a spectrogram of the input speech signal, equation (8) gives,

a(s)＝2 ^s/12 (9)

s takes any integer value in the range of [ -12, +12 ];

with the increase of the layer number, degradation occurs, and the convolutional neural network has good inhibition effect on degradation phenomenon, as shown in formula (11)

d＝(G _d +S _d )/(G+S)

2. The method for detecting spoofed voice based on dense neural network of claim 1, wherein the method comprises the steps of: the second step also comprises a dense convolution network with an improved structure, in which any layer is directly connected to all subsequent layers, specifically expressed as follows,

wherein the method comprises the steps ofAll outputs before layer i are indicated, [.]Representing continuous operation, furthermore, each layer's output dimension has q feature maps, where the q value is a natural number.

3. The method for detecting spoofed voice based on dense neural network of claim 1, wherein the method comprises the steps of: the experimental corpus in the third step comprises Timit, NIST and UME which are all in WAV format, 8 khz sampling rate and 16-bit quantization.

4. A method for detecting spoofed speech based on a dense neural network according to claim 3, wherein: the kit, the NIST and the UME comprise a training set and a testing set, wherein the training set is respectively the kit-1, the NIST-1 and the UME-1, and the testing set is respectively the kit-2, the NIST-2 and the UME-2.