CN111564163B

CN111564163B - RNN-based multiple fake operation voice detection method

Info

Publication number: CN111564163B
Application number: CN202010382185.2A
Authority: CN
Inventors: 严迪群; 乌婷婷; 王让定
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2023-12-15
Anticipated expiration: 2040-05-08
Also published as: CN111564163A

Abstract

The invention discloses a speech detection method for multiple fake operations based on RNN, which comprises the following steps: 1) Obtaining an original voice sample, performing M kinds of forging treatment on the original voice sample to obtain M kinds of forged voices and 1 untreated original voice, extracting characteristics of the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) Obtaining a section of test voice, extracting features of the test voice to obtain an LFCC matrix of test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) to classify, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the predicted result is the original voice, the test voice is recognized as the original voice; if the predicted result is a voice that has undergone a certain falsification operation, the test voice is recognized as a falsified voice that has undergone the corresponding falsification operation.

Description

RNN-based multiple fake operation voice detection method

Technical Field

The invention relates to a voice detection method, in particular to a voice detection method for multiple fake operations based on RNN.

Background

With the increasing functionality of voice editing software, modifications to the voice content are also readily available to non-professionals. If lawbreakers maliciously forge and modify voice, even if the modified voice is used in the fields of news reporting, judicial evidence obtaining, scientific research and the like, the voice can pose a huge threat, and even can cause an unpredictable influence on social stability. The digital voice evidence obtaining method is detection of fake operation, plays a crucial role in identifying the originality and the authenticity of the audio material, and is an important research subject in the current multimedia evidence obtaining field.

Most of the existing digital voice evidence obtaining detection technologies detect a single fake operation, that is, evidence obtaining personnel assume that voice to be detected can undergo a specific fake operation. The Mengyu Qiao et al propose a detection algorithm based on statistical features of quantized MDCT coefficients and their derivatives, detecting up-converted and down-converted MP3 audio files, generating reference audio signals by recompression and calibration of the audio, and then classifying with a support vector machine, and experimental results show that the method effectively detects MP3 double compression and can detect the audio processing history of digital evidence. For example, wang Lihua et al propose a convolutional neural network-based pitch-shifting voice processing history detection, which uses four different pitch-shifting software to perform pitch shifting on three voice libraries, and uses CNN to detect pitch shifting factors of voices in and between the voice libraries and between pitch shifting methods, wherein the detection rate is up to more than 90%.

The existing digital voice evidence obtaining detection technology can detect single fake operation, and the detection rate can reach a very high level. In practical applications, however, a prover often cannot predict a specific counterfeit operation, and a false judgment may occur when detecting the counterfeit operation using a specific operation classifier.

At present, most of the existing digital evidence collection works applicable to various forging operations are concentrated on the field of digital images, and the research on digital voice evidence collection is still relatively few. In the digital voice field, a convolutional neural network model is designed by a Luo Weiqi team, can be used for detecting the audio processing operation of default settings in two different audio editing software, and provides better results. However, although the experiment originally researches various fake operation detection of the voice, the experiment has some problems that the calculation complexity is too high, the application scene of the fake operation is too ideal, and the like.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a speech detection method for various fake operations based on RNN, which can improve the detection accuracy.

The technical scheme adopted for solving the technical problems is as follows: a speech detection method for multiple fake operations based on RNN is characterized in that: the method comprises the following steps:

1) Training network: obtaining an original voice sample, performing M kinds of forging treatment on the original voice sample to obtain M kinds of forged voices and 1 untreated original voice, extracting features of the M kinds of forged voices and 1 original voice to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model;

2) And (3) voice recognition: obtaining a section of test voice, extracting features of the test voice to obtain an LFCC matrix of test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) to classify, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the predicted result is the original voice, the test voice is recognized as the original voice; if the predicted result is a voice that has undergone a certain falsification operation, the test voice is recognized as a falsified voice that has undergone the corresponding falsification operation.

Preferably, in steps 1) and 2), the step of obtaining the LFCC matrix is:

1) FFT: firstly, preprocessing voice, and calculating the frequency spectrum energy E (i, k) of each voice frame after FFT:

where i is the number of speech frames, k is the frequency component, x _i (m) voice signal data of the i-th frame, N being the number of fourier transforms;

the energy of the spectral energy E (i, k) per frame after passing through the triangular filter bank is then calculated:

wherein H is _i (k) Representing the frequency response of the triangular filter, f (L) is the filtering function of the first triangular filter, S (i, L) is spectral line energy after passing through the triangular filter bank, L represents the number of the triangular filter, and L is the total number of the triangular filters;

2) DCT (discrete cosine transform): calculating output data lfcc (i, n) of each triangular filter bank using DCT:

wherein n represents the spectral line after DCT of the ith frame;

3) Obtaining LFCC statistical moment: taking 12-order LFCC coefficients from LFCC (i, n), and calculating the mean value and the correlation coefficient to obtain an LFCC matrix extracted from a section of voice, wherein the LFCC matrix comprises the following components:

wherein x is _s,1 …x _1,n N LFCCs for the calculated s-th frame voice data.

Preferably, the RNN classifier includes an LSTM network, a Dropout layer, a full connection layer, and a Softmax layer sequentially connected, where the Dropout layer is connected to the last LSTM network.

Preferably, the LSTM network has two parameters set to (64,128) and (128,64), respectively.

Preferably, the LSTM network uses a tanh activation function.

Preferably, the dropoff layer has a dropoff function value of 0.5.

Preferably, the original speech is in WAV format.

Compared with the prior art, the invention has the advantages that: the voice cepstrum features are adopted, the probability of the result is output through the classification of the cyclic neural network, the accuracy of voice detection is improved, the voice detection method is more suitable for digital voice carriers, and different fake marks can be identified; compared with the existing method based on deep learning, the method has the advantage that the calculation complexity of the sharing parameters in the RNN is greatly reduced.

Drawings

Fig. 1 is a process diagram of LFCC statistical moment extraction in the voice detection method according to the embodiment of the present invention;

FIG. 2 is a schematic diagram of the overall framework of a voice detection method according to an embodiment of the present invention;

fig. 3 is a network configuration diagram of a voice detection method according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for purposes of describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a specific orientation, be configured and operated in a specific orientation, and because the disclosed embodiments of the present invention may be arranged in different orientations, these directional terms are merely for illustration and should not be construed as limitations, such as "upper", "lower" are not necessarily limited to orientations opposite or coincident with the direction of gravity. Furthermore, features defining "first", "second" may include one or more such features, either explicitly or implicitly.

A speech detection method based on RNN (recurrent neural network) for multiple fake operations is realized by constructing a recurrent neural network framework based on cepstrum features. Referring to fig. 2, the frame is made up of two parts: firstly, the cepstrum features of the voice sample are extracted, and then the cepstrum features are sent into a designed network frame for classification, so that the task of identifying various fake operations is achieved.

Specifically, in the present invention, feature extraction of speech is achieved in the following manner. The cepstrum features adopted by the invention are linear frequency cepstrum coefficients (Linear Frequency Cepstral Coefficients, LFCC). The voice cepstrum feature is one of the most commonly used feature parameters in voice technology, characterizes the auditory features of humans, and is widely used for speaker recognition.

LFCC is the average allocation of low frequency to high frequency band pass filters. The extraction process of the LFCC statistical moment of the present invention can be seen in fig. 1:

1) FFT: firstly, preprocessing voice, and calculating the spectrum energy E (i, k) of each voice frame after fast Fourier transform (Fast Fourier Transform, FFT):

where i is the number of speech frames, k is the frequency component, x _i (m) is the speech signal data of the i-th frame, and N is the number of Fourier transforms.

Calculating the energy of the spectral energy E (i, k) of each frame after the spectral energy E (i, k) passes through the triangular filter bank:

wherein H is _i (k) The frequency response of the triangular filter is represented, f (L) is the filtering function of the first triangular filter, S (i, L) is the spectral line energy after the triangular filter group, L represents the number of the triangular filter, and L is the total number of the triangular filters.

2) DCT (discrete cosine transform): then, the output data lfcc (i, n) of each triangular filter bank is calculated using a discrete cosine transform (Discrete Cosine Transform, DCT):

where n represents the spectral line after the i-th frame DCT.

3) Obtaining LFCC statistical moment: taking 12-order LFCC coefficients from LFCC (i, n), and calculating the mean value and the correlation coefficient, wherein the steps can be realized through the existing matlab function, and the LFCC matrix extracted from a certain segment of preprocessed voice is as follows, assuming that the voice has s frames in total:

wherein x is _s,1 …x _1,n N LFCCs for the calculated s-th frame voice data.

Referring to fig. 3, the network framework employs RNN classifiers, the selection of the number of network layers of which is critical to the optimization algorithm, and deeper networks can learn more knowledge, but training takes a long time and may be overfitted. Thus, in the present invention, the network structure of the RNN classifier is proposed as shown in fig. 3. The network structure comprises 2 LSTM networks, parameters are respectively set to (64,128) and (128,64), and the tanh activation function is used for improving the model performance. The device also comprises a Dropout layer, a full connection layer (dense) and a Softmax layer which are connected in sequence, wherein the Dropout layer is connected with the last LSTM network. Setting the value of the Dropout function to 0.5 helps to reduce the overfitting, after full-connected layer dimension reduction, using the Softmax layer (Softmax classifier) output probabilities. The overall iterative training of the network framework is set to 50 turns. Certain adjustments can be made during specific training.

Referring back to fig. 2, the voice detection method includes the following steps:

1) First, training the network frame is required. Assuming that there are M kinds of falsification operations, M kinds of falsification processing are performed on the original speech respectively, so as to obtain m+1 kinds of speech samples including M kinds of falsified speech and 1 kind of unprocessed original speech. In the invention, certain constraint is applied to the input of the original voice, and a certain amount of WAV format audio sample library is required to be provided as training data of a network framework. Extracting features of the M+1 voice samples to obtain an LFCC matrix of training voice samples, and sending the LFCC matrix into a designed RNN classifier network for training to obtain a multi-classification training model; a plurality of original voice samples can be stored in a database, and each original voice sample is subjected to characteristic extraction and sent to an RNN classifier for training;

2) Then, obtaining a detection and identification result through the trained network framework: when a section of test voice is obtained, extracting the characteristics of the test voice to obtain an LFCC matrix of test voice data, and sending the LFCC matrix into a trained RNN classifier for classification. Each test speech will get an output probability, combining all output probabilities as the final prediction result. If the predicted result is the original voice, the test voice is recognized as the original voice; if the predicted result is a speech subjected to a certain falsification operation, the test speech is recognized as the falsified speech. The evidence obtaining person can judge whether a section of voice is forged or not according to the experimental result.

Claims

1. A speech detection method for multiple fake operations based on RNN is characterized in that: the method comprises the following steps:

2) And (3) voice recognition: obtaining a section of test voice, extracting features of the test voice to obtain an LFCC matrix of test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) to classify, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the predicted result is the original voice, the test voice is recognized as the original voice; if the predicted result is the voice which is subjected to a certain fake operation, the test voice is recognized as the fake voice which is subjected to the corresponding fake operation;

in steps 1) and 2), the step of obtaining the LFCC matrix is:

wherein H is _i (k) Representing the frequency response of the triangular filter, f (l) being the filter function of the first triangular filter, S (i, l) being the spectral line energy after passing through the triangular filter bank,l represents the number of the triangular filters, and L is the total number of the triangular filters;

wherein n represents the spectral line after DCT of the ith frame;

wherein x is _s,1 …x _1,n N LFCCs for the calculated s-th frame voice data.

2. The RNN-based multiple counterfeit operation voice detection method according to claim 1, wherein: the RNN classifier comprises an LSTM network, a Dropout layer, a full connection layer and a Softmax layer which are sequentially connected, wherein the Dropout layer is connected with the last LSTM network.

3. The RNN-based multiple counterfeit operation voice detection method according to claim 2, wherein: the LSTM network has two parameters set to (64,128) and (128,64), respectively.

4. The RNN-based multiple counterfeit operation voice detection method according to claim 2, wherein: the LSTM network uses a tanh activation function.

5. The RNN-based multiple counterfeit operation voice detection method according to claim 2, wherein: the Dropout function value of the Dropout layer is 0.5.

6. The RNN-based multiple counterfeit operation voice detection method according to claim 1, wherein: the original speech is in WAV format.