CN114898773A

CN114898773A - Synthetic speech detection method based on deep self-attention neural network classifier

Info

Publication number: CN114898773A
Application number: CN202210401440.2A
Authority: CN
Inventors: 李长涛
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-08-12

Abstract

The invention relates to the field of synthesized voice identification, in particular to a method and a system for detecting synthesized voice based on a deep self-attention neural network classifier, which comprises the following steps: step 1) obtaining a voice signal to be distinguished with a fixed length; step 2) extracting the time-frequency characteristics of the preprocessed voice signals to be distinguished; step 3) performing mode recognition on the time-frequency characteristics of the voice signal to be distinguished through a one-dimensional convolution neural network module so as to reduce the time resolution of the voice signal to be distinguished; and 4) identifying the input low-time resolution voice features through a deep self-attention neural network classifier to determine whether the voice signal to be judged is synthesized voice. The invention utilizes the deep self-attention neural network classifier to learn the long-time correlation of the input voice characteristics, and judges the real voice and the synthesized voice according to the long-time correlation of the voice characteristics, thereby improving the accuracy of the detection of the synthesized voice.

Description

Synthetic speech detection method based on deep self-attention neural network classifier

Technical Field

The invention relates to the field of synthesized voice identification, in particular to a method and a system for detecting synthesized voice based on a deep self-attention neural network classifier.

Background

The automatic speaker authentication system is a convenient and practical biometric verification method, and is widely applied to telephone and network access control systems such as telephone banks, health management and smart homes. However, synthesized speech generated by speech synthesis and speech conversion often can fool the speaker authentication system, which makes people concerned about the reliability of the speaker authentication system. In recent years, the development of deep learning technology obviously improves the performance of speech synthesis and speech conversion, which brings great challenges to a speaker authentication system. It becomes very important how to improve the stability of the automated speaker authentication system in the face of synthetic speech attacks.

The existing research shows that the safety of the speaker recognition system can be enhanced by adding an independent synthesized voice detection system in front of the automatic speaker recognition system. The synthesized voice detection system is a system for distinguishing the synthesized voice from the real voice, and improves the reliability of the speaker recognition system without greatly changing the existing speaker recognition system. In general, a synthesized speech detection system includes front-end and back-end classifiers for feature extraction. The rear-end classifier based on the neural network can obtain better detection effect by combining with the front-end voice characteristics.

Most of the existing synthetic speech detection systems tend to have large model parameter quantity, and do not well utilize the long-time dependence of the input front-end speech characteristics. These problems can affect the actual deployment of the synthesized speech detection system and its performance. Most of the existing synthetic speech detection systems are based on convolutional neural networks and cyclic neural networks, such as residual error networks and long-term memory networks. Generally, the convolutional neural network cannot efficiently and accurately learn the long-time correlation relationship of the input data by being limited by the size of the convolution kernel of the convolutional neural network. Meanwhile, due to the problem of gradient explosion in the training process, the long-time dependency of data is difficult to process by a recurrent neural network including a long-time memory network and a short-time memory network. In addition, the parallel efficiency of long and short time memory networks is not high. However, the long-term correlation of speech is just as important for synthesized speech detection. Existing research indicates that most synthesized speech can be very close to real speech locally, but the inconsistency of these synthesized speech at longer time scales becomes worse. Therefore, it is likely that the synthesized speech detection task is completed more efficiently using the long-time correlation of speech.

In summary, the synthesized speech detection system in the prior art has the disadvantages of large model parameters and difficulty in learning the long-time correlation of the input speech.

Disclosure of Invention

The invention aims to solve the problems that the synthesized speech detection system in the prior art has large model parameters and cannot learn the long-time correlation of input speech, so that the synthesized speech detection method and the system based on the deep self-attention neural network classifier are provided, and the corresponding defects are made up by the deep self-attention neural network classifier.

With the continuous development of deep learning technology, a deep self-attention neural network based on a self-attention mechanism attracts attention. Compared with a convolutional neural network and a cyclic neural network, the self-attention mechanism can well learn the long-time correlation relationship of input data, and the parallel processing efficiency of the self-attention mechanism is high. People have initially solved sequence-to-sequence problems, such as neural machine translation, using deep self-attention neural networks based on a self-attention mechanism. Recently, people find that the large-scale correlation relationship of the picture can be well learned only by utilizing the encoder part of the deep self-attention neural network, and then a picture classifier with high accuracy can be obtained. This suggests that deep self-attention neural networks can be successfully applied to various classification problems.

The invention provides a synthetic voice detection method based on a deep self-attention neural network classifier, which utilizes the deep self-attention neural network classifier to learn the long-time correlation of input voice characteristics and judges real voice and synthetic voice according to the long-time correlation of the voice characteristics, and comprises the following steps:

step 1) preprocessing an input voice signal to be distinguished through a voice preprocessing module to obtain a voice signal to be distinguished with a fixed length;

step 2) extracting the time-frequency characteristics of the preprocessed voice signal to be distinguished through a voice time-frequency characteristic extraction module;

step 3) performing mode recognition on the time-frequency characteristics of the voice signal to be distinguished through a one-dimensional convolution neural network module to reduce the time resolution of the voice signal to be distinguished, and inputting the obtained voice characteristics to a deep self-attention neural network classifier;

and 4) identifying the input low-time resolution voice features through a deep self-attention neural network classifier to determine whether the voice signal to be judged is synthesized voice.

As a modification of the above method, step 1) specifically includes: repeatedly and circularly filling the voice signal to be distinguished with the length less than the target fixed length through a voice preprocessing module until the voice signal is filled to the target fixed length; and randomly selecting the voice signals with the target fixed length from the voice signals to be distinguished with the lengths larger than the target fixed length through a voice preprocessing module.

As an improvement of the above method, the step 2) specifically includes the following steps:

step 201) performing framing on the preprocessed voice signals to be distinguished according to a preset framing frame length and a frame shift to obtain a plurality of frame voice signals;

step 202) multiplying each frame of voice signal by a Hanning window to reduce frequency spectrum leakage;

step 203) performing discrete Fourier transform on each frame of voice signals after windowing to obtain magnitude spectrums of a plurality of voice signals;

step 204) applying a group of linear filters to the amplitude spectrum of each voice signal for filtering, and obtaining a filtered logarithmic amplitude spectrum;

step 205) performing discrete cosine transform on each logarithmic magnitude spectrum to obtain a linear frequency cepstrum coefficient of each voice signal;

step 206) calculating a first order difference and a second order difference of each voice signal linear frequency cepstrum coefficient, and splicing the first order difference and the second order difference with the linear frequency cepstrum coefficient along a frequency axis respectively to be used as an input time-frequency characteristic of a one-dimensional convolution neural network module; wherein, the dimension of the input time-frequency characteristic of the one-dimensional convolution neural network module is as follows:

as an improvement of the above method, the step 4) specifically includes:

step 401) obtaining a sequence output by a deep self-attention neural network encoder layer, with its position information embedded on speech features of low temporal resolution by a position information embedding layer, and with its position information embedded on speech features of low temporal resolution

Step 402) outputting the sequence

Input to a linear layer having only one node and soft-maximizing it along a first axis through the linear layer to obtain a sequence output

Weighted weights on time axis

Step 403) utilizing the weighted weights

Output to the sequence

Weighted average is carried out on each time frame to obtain characteristics for discrimination

And inputting the speech signal to a linear layer containing two units, and respectively outputting the probabilities of real speech and synthesized speech to determine whether the speech signal to be distinguished is synthesized speech.

As an improvement of the above method, the specific steps of the deep self-attention neural network classifier for learning the long-time correlation of the input speech features are as follows:

s1) utilizing three linear layers containing E nodes, and respectively and independently mapping each frame of the one-dimensional convolutional neural network output time-frequency characteristics epsilon to the embedding dimension E of the depth self-attention neural network encoder to obtain the output Q of the first linear layer, the output K of the second linear layer and the output V of the third linear layer; the time frame number of the output Q of the first linear layer, the output K of the second linear layer and the output V of the third linear layer is the same as the time frame number of the one-dimensional convolution neural network output time-frequency feature epsilon and different from the dimension N of each frame feature; wherein the content of the first and second substances,

a real number matrix with a first dimension of T and a second dimension of E, wherein E is an embedding dimension of the deep self-attention neural network encoder, and T is the time frame number of the one-dimensional convolution neural network output time-frequency characteristic epsilon;

s2) calculate the unnormalized attention matrix a:

wherein, K ^t Transpose to output K of the second linear layer;

performing a soft maximization operation on each row of the non-normalized attention matrix A to obtain an attention moment matrix

Wherein the attention matrix

The long-time correlation relationship of the one-dimensional convolution neural network output time-frequency characteristics epsilon is contained;

s3) calculating attention output O which is not mapped to a target space so as to update the characteristics of the one-dimensional convolution neural network output time-frequency characteristics epsilon on each time frame; wherein the attention output O not mapped to the target space is:

s4) mapping the feature of the attention output O which is not mapped to the target space on each time frame to the embedding dimension of the depth self-attention neural network encoder through a linear layer containing E nodes to obtain the output of the depth self-attention neural network encoder

Wherein, the first and the second end of the pipe are connected with each other,

to achieve another object of the present invention, the present invention provides a synthesized speech detection system based on a deep self-attention neural network classifier, which is used in the above synthesized speech detection method based on a deep self-attention neural network classifier, and the system includes: the system comprises a voice preprocessing module, a voice time-frequency feature extraction module, a one-dimensional convolution neural network module and a depth self-attention neural network classifier; wherein the content of the first and second substances,

the voice preprocessing module is used for preprocessing an input voice signal to be distinguished so as to obtain a voice signal to be distinguished with a fixed length;

the voice time-frequency feature extraction module is used for extracting the time-frequency features of the preprocessed voice signals to be distinguished;

the one-dimensional convolution neural network module is used for carrying out mode recognition on the time-frequency characteristics of the voice signal to be distinguished so as to reduce the time resolution of the voice signal to be distinguished, and inputting the obtained voice characteristics to the deep self-attention neural network classifier;

the deep self-attention neural network classifier is used for identifying input low-time-resolution voice features so as to determine whether the voice signal to be distinguished is synthesized voice.

As an improvement of the above system, the one-dimensional convolutional neural network module includes: a plurality of cascaded basic units, wherein each basic unit consists of a one-dimensional convolution layer, a linear rectification function and a maximum pooling layer.

As an improvement of the above system, the one-dimensional convolutional neural network module adopts a shallow neural network, and includes: 1 to 3 basic units.

As an improvement of the above system, the deep self-attention neural network classifier includes: the system comprises a position information embedding layer, a depth self-attention neural network encoder layer, a pooling layer, a linear layer of one node and linear layers of two nodes; wherein the content of the first and second substances,

the position information embedding layer is used for embedding the position information of the voice features with low time resolution;

the deep self-attention neural network encoder layer is used for acquiring the sequence output of the speech features with low time resolution after embedding the position information

A linear layer of the one node for outputting the sequence

Performing soft maximization operation along a first axis to obtain a sequence output

Weighted weights on time axis

Using the weighted weight

Output to the sequence

And input it to a linear layer containing two cells;

and the linear layers of the two units are used for respectively outputting the probabilities of real voice and synthesized voice so as to determine that the voice signal to be distinguished is the real voice or the synthesized voice.

As an improvement of the above system, the deep self-attention neural network classifier includes: no greater than 3 of the deep self-attention neural network encoder layers, and the deep self-attention neural network encoder layers comprising: not more than 2 heads and multiple heads.

Compared with the prior art, the invention has the advantages that:

the invention combines the advantages of the deep self-attention neural network classifier to learn the long-time correlation relationship of the input voice features, and the deep self-attention neural network encoder can simultaneously process all time frames of the input features, so that the long-time correlation of the input time-frequency features can be efficiently learned by the deep self-attention neural network encoder; and further, the real voice and the synthesized voice are recognized according to the long-time correlation of the voice characteristics, so that the accuracy of the detection of the synthesized voice is improved. Further, the invention also improves the generalization capability of the deep self-attention neural network classifier on smaller data sets: effective priori knowledge is introduced by serially connecting a plurality of layers of one-dimensional convolutional neural network layers in front of the deep self-attention neural network classifier, and the overall generalization capability of the system is improved. In addition, the model parameter quantity of the whole system is small, the complexity is low, and the system can be conveniently deployed in practical application.

Drawings

FIG. 1 is a flowchart of a method for synthesizing speech based on a deep self-attention neural network classifier according to an embodiment of the present invention;

FIG. 2 is a diagram of a one-dimensional convolutional neural network structure provided in an embodiment of the present invention;

FIG. 3 is a block diagram of a deep self-attention neural network classifier according to an embodiment of the present invention;

FIG. 4(a) is a diagram illustrating linear frequency cepstral coefficients of real speech according to an embodiment of the present invention;

FIG. 4(b) is a diagram of linear frequency cepstrum coefficients of synthesized speech according to an embodiment of the present invention.

Detailed Description

The technical scheme provided by the invention is further illustrated by combining the following embodiments.

The invention discloses a synthetic speech detection method and a synthetic speech detection system based on a deep self-attention neural network classifier, which are used for distinguishing synthetic speech from real speech. The synthetic voice detection method provided by the invention comprises the following steps: carrying out voice preprocessing to obtain a voice signal with fixed duration; carrying out feature extraction on a voice signal to obtain time-frequency features of the voice signal; inputting the time-frequency characteristics of the voice signal into a one-dimensional convolution neural network, and outputting voice characteristics with lower time resolution; and inputting the output of the one-dimensional convolution neural network of the last stage into a deep self-attention neural network classifier, and identifying real voice and synthesized voice.

The method effectively improves the accuracy of the detection of the synthesized voice by utilizing the defects of long-time correlation existing in the synthesized voice and by means of the one-dimensional convolutional neural network and the deep self-attention neural network classifier.

Example 1

As shown in fig. 1, the technical solution adopted by the synthesized speech detection method based on the deep self-attention neural network classifier provided by the present invention is as follows:

step 1) a voice preprocessing module preprocesses a voice signal to be distinguished;

step 2), extracting time-frequency characteristics from the preprocessed voice by a voice time-frequency characteristic extraction module;

step 3) the extracted time-frequency features pass through a one-dimensional convolution neural network module to reduce the time resolution of the voice features;

and 4) inputting the output of the one-dimensional convolutional neural network into a classifier based on a deep self-attention neural network encoder to obtain a final classification result of the voice to be distinguished.

In the above technical solution, the step 3) and the step 4) store a synthesized speech signal pattern library in the one-dimensional convolutional neural network module and a classifier based on a deep self-attention neural network encoder, and the step of training the pattern classifier includes: (a) respectively collecting a plurality of synthesized voice and real voice signal samples; (b) preprocessing the collected voice signal samples; (c) extracting time-frequency characteristics from the preprocessed voice signal sample; (d) and according to the voice time-frequency characteristics and the corresponding label information, establishing a synthetic voice signal mode library by training a deep neural network system which is formed by cascading a one-dimensional convolution neural network module and a classifier based on a deep self-attention neural network encoder.

In the above technical solution, the ratio of the synthesized speech sample to the real speech sample collected in the step (a) may be unbalanced, for example, the ratio of the synthesized speech to the real speech may be 10: 1. Furthermore, the kind of synthesized speech encountered in practical applications may not be included in the collected synthesized speech for training.

In the above technical solution, the preprocessing of the speech signal in step 1) refers to processing the speech signal to obtain a speech signal with a fixed length: for shorter speech signals, cyclic filling is adopted; for longer speech signals, fixed length segments of the speech signal are randomly truncated from the signal.

In the above technical solution, the one-dimensional convolutional neural network module adopted in step 3) is composed of a one-dimensional convolutional layer, a linear rectification function and a maximum pooling layer, and these three operations constitute a basic unit of the one-dimensional convolutional neural network.

In the above technical solution, the number of basic units used by the one-dimensional convolutional neural network module adopted in step 3) is small, and generally does not exceed three, thereby ensuring that the model has a small number of parameters as a whole.

In the above technical solution, the classifier described in step 4) is based on a deep self-attention neural network encoder, and the classifier includes a position information embedding layer, a deep self-attention neural network encoder layer, a pooling layer, and a linear layer for obtaining a final output classification probability.

In the above technical solution, the classifier based on the depth self-attention neural network encoder used in step 4) uses fewer layers of the depth self-attention neural network encoder, the number of layers of the depth self-attention neural network encoder generally used is not more than three, and the number of multi-head self-attention heads in the depth self-attention neural network encoder layer is also smaller and generally not more than two, so as to reduce the probability of the overall overfitting of the model.

Example 2

A synthesized speech detection system based on a deep self-attention neural network classifier, as shown in fig. 1, mainly comprising:

the voice preprocessing module 101 is used for circularly filling or cutting the obtained voice signals and fixing the signal length of the voice signals so as to facilitate network batch processing and accelerate the processing speed;

the voice time-frequency feature extraction module 102 is used for extracting the time-frequency features of the processed voice and using the time-frequency features as the input of the neural network classifier;

and the one-dimensional convolutional neural network 103 is connected with the output end of the time-frequency feature extraction module 102 and is used for reducing the time resolution of the voice time-frequency features and increasing the generalization capability of the whole system on a small-scale data set. As shown in fig. 2, one basic component unit of the one-dimensional convolutional neural network includes a one-dimensional convolutional layer, a linear rectification activation function, and a max-pooling operation. Specifically, the number of convolution kernels of the one-dimensional convolution layer is set to N, the convolution kernel length is set to K, the convolution step size is set to S1, and the pooling kernel size of the maximum pooling operation is set to P, and the pooling step size is set to S2. The above three operations constitute one basic constituent unit of the one-dimensional convolutional neural network, and the one-dimensional convolutional neural network 103 may be composed of a plurality of basic constituent units in cascade;

the classifier 104 based on the deep self-attention neural network is connected with the output end of the one-dimensional convolution neural network 103. The deep self-attention neural network classifier 104 receives the output of the one-dimensional convolutional neural network 103 and classifies it to get the final result. The deep self-attention neural network classifier 104 models the long-time correlation of the input speech features and finally obtains the judgment of the speech class, as shown in fig. 3, and mainly includes a deep self-attention neural network encoder, a pooling layer, and a binary classifier based on a linear layer. The embedding dimension of the deep self-attention neural network encoder is E, the number of heads of self-attention is H, the type of the selected pooling layer is sequence pooling, and two nodes of the binary classifier based on the linear layer respectively output the probability that the voice belongs to real voice or synthesized voice. According to the output of the binary classifier based on the linear layer, the final judgment of the synthetic speech detection system on the class of the input speech can be obtained.

In particular, given a one-dimensional convolutional neural network output time-frequency feature

Wherein T represents the time frame number of the feature, N represents the dimension of the feature of each frame (the value of which is equal to the number of convolution kernels of the one-dimensional convolutional neural network 103), and the specific step of modeling the one-dimensional convolutional neural network output time-frequency feature epsilon by the deep self-attention neural network encoder comprises:

s1) utilizing three linear layers containing E nodes, and respectively and independently mapping each frame of the one-dimensional convolutional neural network output time-frequency characteristics epsilon to the embedding dimension E of the depth self-attention neural network encoder to obtain the output Q of the first linear layer, the output K of the second linear layer and the output V of the third linear layer; the time frame number of the output Q of the first linear layer, the output K of the second linear layer and the output V of the third linear layer is the same as the time frame number of the one-dimensional convolution neural network output time-frequency characteristic epsilon and different from the dimension N of each frame of characteristic; wherein the content of the first and second substances,

a real number matrix with a first dimension of T and a second dimension of E, wherein E is an embedding dimension of the depth self-attention neural network encoder, and T is the time frame number of the one-dimensional convolution neural network output time-frequency feature epsilon;

s2) calculate the unnormalized attention matrix a:

wherein, K ^t Transpose to output K of the second linear layer;

Wherein the attention matrix

Wherein the content of the first and second substances,

output of

The long-time correlation relation of the input time-frequency characteristics epsilon is learned.

Through the steps, the deep self-attention neural network encoder can process all time frames of the input features at the same time, and therefore long-time correlation of the input time-frequency features can be efficiently learned through the deep self-attention neural network encoder.

The above neural network models of each part, if not otherwise noted, are prepared by conventional methods well known to those skilled in the art.

Example 3

The method provided by the embodiment 1 is combined with the system provided by the embodiment 2 to distinguish the input voice, and the specific steps are as follows:

step 1) the voice preprocessing module 101 preprocesses input voice, and the specific steps include:

step 101) judging the voice length, if the voice length T is less than the target filling length 4s, turning to step 102), otherwise, turning to step 103); the target filling length of 4s is chosen to more conveniently illustrate the present embodiment, not to be a hard requirement, and the reasons for choosing other specific parameters appearing hereinafter are the same;

step 102) calculating

Then repeating the voice signal for n times, randomly intercepting a segment with the length of 4s from the filling voice with the length of nT, and returning to the segment;

step 103) randomly truncates a segment of length 4s from the speech signal and returns the segment.

Step 2), the voice time-frequency feature extraction module 102 extracts the time-frequency features of the processed voice. It should be noted that there are many methods for selecting the feature vector, such as selecting a time-domain envelope, a frequency-domain energy, or a coefficient of a signal after various transformations (such as discrete cosine transformation, wavelet transformation, etc.), and so on. In this example, the linear frequency cepstral coefficient features of the speech signal are selected as inputs to the network. Research finds that the linear frequency cepstrum coefficient characteristics are very effective for the problem of synthetic speech detection. Fig. 4 shows the linear frequency cepstrum coefficient characteristics of the synthesized speech and the real speech, and the difference between the two cases cannot be clearly seen. This illustrates the necessity of pattern recognition based on deep neural networks. The method comprises the following specific steps:

step 201) framing a voice signal, wherein the length of each frame is 20ms, and the frame shift is 10 ms;

step 202) multiplying each obtained frame of voice signal by a Hanning window to reduce frequency spectrum leakage;

step 203) performing discrete Fourier transform on each frame of voice signals subjected to windowing to obtain a magnitude spectrum of the signals;

step 204) applying a group of linear filters to the amplitude spectrum of the signal and taking the logarithm of the filtered amplitude spectrum;

step 205) performing discrete cosine transform on the logarithmic magnitude spectrum obtained in the previous step to obtain linear frequency cepstrum coefficient characteristics of the voice signal, wherein the linear frequency cepstrum coefficient time-frequency characteristics of the real voice are shown in fig. 4(a), and the linear frequency cepstrum coefficient time-frequency characteristics of the synthesized voice are shown in fig. 4 (b);

step 206) calculating the first order difference and the second order difference of the voice signal linear frequency cepstrum coefficient characteristics, and splicing the first order difference and the second order difference with the linear frequency cepstrum coefficient characteristics along a frequency axis to finally obtain the input characteristics of the network. The dimension (vectordelay) of such a feature vector is related to the length (SignalLength) of the signal, the frame length (FrameLength) selected in the framing, the frame shift (HopLength) and the number of cepstrums (NumCeptral) selected back, i.e. the number of cepstrums (NumCeptral) selected back

Theoretically, the larger the dimension of the linear frequency cepstrum coefficient features, the better the description of the features; but the larger the dimension, the more computation, and the greater the risk of a trained overfitting of the deep neural network. Therefore, the dimension of the linear frequency cepstrum coefficient feature should not be too large, and a suitable value should be selected. The experimental result of the prior document shows that the effect is best when the number of returned cepstrums is selected to be 20.

Step 3) as shown in fig. 2, the one-dimensional convolutional neural network module 103 processes the linear frequency cepstrum coefficient feature of the speech signal to reduce the time resolution of the LFCC feature of the speech signal. The one-dimensional convolution in the network architecture is very important to the overall performance of the network, and compared with a synthetic voice detection system adopting the one-dimensional convolution and a deep self-attention neural network classifier, the synthetic voice detection system adopting the one-dimensional convolution and the deep self-attention neural network classifier has obvious performance advantages without convolution and a system adopting two-dimensional convolution;

in particular, input speech features are considered

Where 399 is the number of time frames of the input time-frequency features and 60 is the dimension of each frame feature. Setting the one-dimensional convolution layer 201 to have 128 convolution kernels, wherein the size of each convolution kernel is 3, and the convolution step length is 1; the pooling core size of the maximum pooling operation 203 is set to 3 and the pooling step size is set to 2.

Obtaining low time resolution speech characteristics by the one-dimensional convolution neural network

And simultaneously inputting the data to the next module for further judgment.

And 4) as shown in fig. 3, the classifier 104 based on the deep self-attention neural network encoder further judges the features of the speech signal with lower time resolution, and finally outputs a judgment result. Setting the embedding dimension of the deep self-attention neural network encoder to be 128 and the number of the multi-head attention to be 1, the determining process of the deep self-attention neural network encoder includes:

the voice signal features with low time resolution firstly pass through a position information embedding module to embed time sequence information; in particular, location informationThe information embedding module is a map that maps low temporal resolution speech features

Respectively map to 199 learnable vectors of 128 dimensions, and respectively connect the 199 learnable vectors with

The features at the 199 time positions are added to obtain the low time resolution features of the speech signal embedded with the position information.

Then, the low time resolution characteristic of the voice signal embedded with the position information passes through a deep self-attention neural network encoder to obtain a sequence output

Where 199 is the time frame number of the sequence output feature and 128 is the embedding dimension of the sequence output feature. This sequence output cannot be used directly for classification, and needs to be pooled through the sequence to obtain speech features that can be used for final discrimination.

The sequence pooling specifically comprises:

step 401) outputting the sequence

Input to a linear layer with only one node to obtain

Step 402) will

Performing soft maximization operation along the first axis to obtain sequence output

Weighted weights on time axis

Step 403) utilizing the weight

Output to the sequence

Each time frame is weighted and averaged to obtain the characteristics finally used for discrimination

Output using sequential pooling

The speech classification probability is input into a linear layer containing two units, and the classification discrimination probability of the neural network to the speech can be finally obtained.

Because the neural network training is needed, the synthetic speech detection method based on the deep self-attention neural network classifier is realized according to the following steps:

1) training a deep self-attention neural network classifier;

a synthesized speech detection system is often trained and evaluated using the common synthesized speech data set ASVspoof. Thus, an example of a synthesized speech detection system is trained according to the LA data set of ASVspoof 2019.

First, the training set of ASVspoof 2019LA and the development set are merged to form a new training sample set, 50224 speech samples are counted, and the equal error rate of the system prediction results is calculated on the test set of ASVspoof 2019LA to evaluate the final performance of the system. 50224 voices in the training set are divided into 64 voices in one batch, and the divided voices are input into the synthesized voice detection system in the example 1 in one batch to obtain the probability that each voice in the batch belongs to real voice or synthesized voice. And after the judgment probability of the voice is obtained, calculating a loss function by combining the real label of the voice, and calculating the gradient of parameters in the network through back propagation so as to train the neural network. The 50224 voices are repeatedly input into the neural network in batches for training until convergence, and the finally converged neural network is used for synthetic voice detection. Table 1 lists the resulting equal error rates for different configurations of synthesized speech detection systems on the test set.

TABLE 1

Network architecture	Equal error rate
		One-dimensional convolution	1.06％
Two-dimensional convolution	2.31％
		Without using convolution	2.62％

As can be seen from table 1, the depth self-attention neural network classifier model using the one-dimensional convolutional layer achieves the optimal detection effect.

And finally, performing a synthesized voice detection task by using the trained synthesized voice detection system.

To further illustrate that the present invention may be better utilized in practice, the following comparisons are made. Table 2 compares the detection effect of the proposed model with the existing synthesized speech detection system and the model parameters.

TABLE 2

As can be seen from the recognition results shown in Table 2, the present invention has the smallest number of model parameters and at the same time has very good synthesized speech detection capability.

2) After training of a synthetic speech detection system based on a deep self-attention neural network classifier is completed, actual detection is carried out;

3) the voice preprocessing module 101 preprocesses a voice signal to be detected;

4) the voice time-frequency feature extraction module 102 extracts feature vectors from voice signals to be detected;

5) the input speech features are recognized using a synthesized speech detection system 103 and a deep self-attention neural network classifier 104.

As can be seen from the above detailed description of the present invention, the present invention combines the advantages of the deep self-attention neural network classifier to learn the long-time correlation of the input speech features, and performs recognition of real speech and synthesized speech according to the long-time correlation of the speech features, thereby improving the accuracy of synthesized speech detection. Further, the invention also improves the generalization capability of the deep self-attention neural network classifier on smaller data sets: effective priori knowledge is introduced into a plurality of layers of one-dimensional convolutional neural network layers connected in series in front of the deep self-attention neural network classifier, and the overall generalization capability of the system is improved. In addition, the model parameter quantity of the whole system is small, the complexity is low, and the system can be conveniently deployed in practical application.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A synthetic voice detection method based on a deep self-attention neural network classifier learns the long-time correlation of input voice features by using the deep self-attention neural network classifier and discriminates real voice and synthetic voice according to the long-time correlation of the voice features, comprising the following steps:

2. The method for detecting the synthesized speech based on the deep self-attention neural network classifier according to claim 1, wherein the step 1) specifically comprises: repeatedly and circularly filling the voice signal to be distinguished with the length less than the target fixed length through a voice preprocessing module until the voice signal is filled to the target fixed length; and randomly selecting the voice signals with the target fixed length from the voice signals to be distinguished with the lengths larger than the target fixed length through a voice preprocessing module.

3. The method according to claim 1, wherein the step 2) specifically comprises the following steps:

4. the method according to claim 1, wherein the step 4) specifically comprises:

Step 402) outputting the sequence

Weighted weights on time axis

Step 403) utilizing the weighted weights

Output to the sequence

5. The method for detecting the synthesized speech based on the deep self-attention neural network classifier as claimed in claim 1, wherein the deep self-attention neural network classifier learns the long-time correlation of the input speech features by the following specific steps:

s2) calculate the unnormalized attention matrix a:

wherein, K ^t Transpose of output K for the second linear layer;

Wherein the attention matrix

Wherein the content of the first and second substances,

6. a synthesized speech detection system based on a deep self-attention neural network classifier, for implementing the synthesized speech detection method based on the deep self-attention neural network classifier as claimed in any one of claims 1 to 5, comprising: the system comprises a voice preprocessing module, a voice time-frequency feature extraction module, a one-dimensional convolution neural network module and a depth self-attention neural network classifier; wherein the content of the first and second substances,

7. The deep self-attention neural network classifier-based synthesized speech detection system of claim 6, wherein the one-dimensional convolutional neural network module comprises: a plurality of cascaded basic units, wherein each basic unit consists of a one-dimensional convolution layer, a linear rectification function and a maximum pooling layer.

8. The system according to claim 7, wherein the one-dimensional convolutional neural network module employs a shallow neural network, comprising: 1 to 3 basic units.

9. The deep self-attention neural network classifier based synthesized speech detection system of claim 6, wherein the deep self-attention neural network classifier comprises: the system comprises a position information embedding layer, a depth self-attention neural network encoder layer, a pooling layer, a linear layer of one node and linear layers of two nodes; wherein the content of the first and second substances,

A linear layer of the one node for outputting the sequence

Weighted weights on time axis

Using the weighted weight

Output to the sequence

And input it to a linear layer containing two cells;

10. The deep self-attention neural network classifier based synthesized speech detection system of claim 9, wherein the deep self-attention neural network classifier comprises: no greater than 3 of the deep self-attention neural network encoder layers, and the deep self-attention neural network encoder layers comprising: not more than 2 heads and multiple heads attention.