CN114048770A

CN114048770A - Automatic detection method and system for digital audio deletion and insertion tampering operation

Info

Publication number: CN114048770A
Application number: CN202111315681.7A
Authority: CN
Inventors: 曾春艳; 孔帅; 王志锋; 冯世雄; 余琰; 夏诗言
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-15

Abstract

The invention belongs to the technical field of digital audio signal tampering detection, and discloses an automatic detection method and system for digital audio deletion and insertion tampering operation, wherein a trained general background model of power grid frequency is used for extracting a power grid frequency spectrum characteristic super vector of each digital audio signal; inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features; inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering. Extracting power grid frequency spectrum characteristic super vectors and establishing characteristics extracted by deep neural network training; the invention not only realizes the automation of tamper detection, but also well applies the deep neural network to tamper detection and obtains good effect. The invention has higher accuracy and better robustness.

Description

Automatic detection method and system for digital audio deletion and insertion tampering operation

Technical Field

The invention belongs to the technical field of digital audio signal tampering detection, and particularly relates to an automatic detection method and system for digital audio deleting and inserting tampering operations.

Background

At present, with the rapid development of internet information technology, intelligent mobile devices are gradually popularized, and digital multimedia data (such as audio, images, texts, etc.) has become a main information carrier. The recording and storing cost of the digital audio files becomes lower and lower, and the digital audio files are more and more convenient to obtain from the internet, so that people have increasingly rising demands for collecting and sharing the digital audio files. Meanwhile, various audio editing software is also developed, so that the audio signal is easier to edit. Accordingly, there is an increasing need for effective protection and authentication of audio recordings, particularly where the recordings may be involved in digital rights management and law enforcement cases. A large amount of false information with a sense of reality may be generated on the internet or in the court, thereby affecting social stability and public safety. Audio forensics are therefore becoming increasingly important for verifying the authenticity, integrity and source of audio information.

The utility model utilizes the power grid frequency for tamper detection, and is widely quoted by the law. From a forensic perspective, grid frequency signals are often embedded in the audio recordings of eavesdropping devices, the high availability associated with well-behaved characteristics making it an attractive feature. This is also why it is widely used. Grid frequency fluctuations in an area are stable and unique over a long period of time. Non-periodic fluctuations in the grid frequency have the same effect on all devices connected to it. Grid frequency signals, which are also a well-known standard signal, are typically present in devices powered by the grid. For example, the standard value of the grid frequency is 50Hz or 60Hz, depending on the region. European countries, australia, and most countries in asia and africa use 50 Hz. North america and central america use 60 Hz. It should be noted that in south America some countries use 50Hz and some countries 60 Hz. And japan uses both 50Hz and 60Hz as standard values of the grid frequency. Ideally, the grid signal is a sinusoidal signal oscillating at a nominal frequency, but in reality, the instantaneous frequency varies due to fluctuations in the power supply and demand from the grid. Over time, the frequency and phase of the grid do not change abruptly. The power grid frequency signal has stability and uniqueness, and inserting or deleting an audio segment into an audio file may cause sudden change of the estimated power grid frequency signal. In the audio file, a power grid frequency signal is extracted through band-pass filtering, and the fact that tampering operation causes sudden changes of instantaneous frequency and phase of power grid frequency components at a tampering point is used for identifying whether tampering occurs or not.

Meanwhile, the prior art also provides a series of methods for detecting audio tampering. The application of the power grid frequency to the tamper detection technology can be divided into two types, wherein the first type is to compare a power grid frequency signal with a large-scale power grid frequency database; and secondly, extracting some characteristics in the power grid frequency signal and analyzing the consistency or regularity. Still other researchers have not used the grid frequency for analysis of tampering operations.

1) Based on grid frequency database comparison: grigoras originally proposed an audio tampering detection algorithm based on power grid frequency, and mainly compared the fluctuation of the power grid frequency in the audio to be detected with data of a reference year, so as to judge whether the audio is tampered. In the prior art 1, a standard power grid frequency database is obtained by using a B-spline line basis function and inverse interpolation based on analysis of a north american power grid frequency detection network. And estimating the frequency of the power grid frequency signal component by using short-time Fourier transform, and matching the power grid frequency sequence of the audio to be detected with a standard database. An oscillator error iterative correction algorithm is provided for obtaining an accurate time frequency pair of a signal to be measured, and the problem that a power grid frequency sequence cannot be matched with a standard power grid frequency database is solved. In the prior art 2, the frequency demodulation is used for extracting the power grid frequency signal from the audio signal, and the work of the stage of extracting the power grid frequency signal is further researched.

2) Extracting the frequency characteristics of the power grid: prior art 3 proposes a method for detecting splicing by revealing abnormal differences in local noise levels, which determines whether there is a heterogeneous splicing tampering operation in the audio by comparing the similarity between the background noise variances of each syllable. Prior art 4 utilizes statistical features of MDCT coefficients and studies on MP3 file structures to detect multiple compressed files and identify encoder types, and verify algorithm performance in large speech databases. Prior art 5 proposes that the pitch sequence is usually completely different as the pitch sequence of the characteristic different syllable extractions of the audio. Whether copy-move forgery exists for the corresponding syllable is determined by calculating the difference between each syllable and comparing the difference of the syllable with a set threshold. Prior art 6 extracts the pitch sequence and the first two formant sequences as a feature set for each speech segment. And calculating the similarity of each feature set by adopting a Dynamic Time Warping (DTW) algorithm. Copy-move forgery in voice recordings is detected and localized by similarity comparison to a threshold.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) the quality, the recording environment and the like of the signal to be detected are limited by certain conditions, and the detection result has no judgment standard of consistency;

(2) the extracted power grid frequency characteristics cannot well reflect tampering information, and the adopted classifier cannot better utilize important information of characteristics and learning characteristics;

(3) some detection methods need to set fuzzy threshold decision conditions through experience of professional knowledge, and cannot well realize automatic detection.

(4) The existing characteristics are not deep enough for the mining degree of tampering information in the power grid frequency;

(5) the traditional method has weak generalization, and the robustness and accuracy of the detected audio frequency are to be improved;

the difficulty in solving the above problems and defects is: for automatic detection of digital audio deletion and insertion tampering operations, features more suitable for training in a deep network need to be extracted, and a network more suitable for tampering detection is not established yet.

The significance of solving the problems and the defects is as follows:

for the existing method, the power grid frequency spectrum characteristic supervectors extracted based on the phase information and the frequency information can better reflect and deeply mine tampering information; the deep neural network is adopted to train shallow features, so that important information of the features can be better learned; and the classification of tampering detection is realized by adopting a classification network, the detection result has a specific judgment standard, and the automatic detection is realized. The designed digital audio deleting and inserting system has obvious improvement on the robustness and accuracy of audio detection and is verified in a plurality of databases.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an automatic detection method and system for digital audio deletion and insertion tampering operation.

The invention is realized in this way, a method for automatically detecting operations of deleting and inserting tampering of digital audio, the method for automatically detecting operations of deleting and inserting tampering of digital audio comprises:

extracting a power grid frequency spectrum characteristic super vector of each digital audio signal by using a trained general background model of the power grid frequency;

inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;

inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering.

Further, the automatic detection method for the digital audio deletion and insertion tampering operation comprises the following steps:

firstly, preprocessing an original digital audio signal by using a band-pass filter, and extracting a power grid frequency component of a signal to be detected; extracting phase characteristics and fitting characteristic parameters, and constructing a general background model of the power grid frequency;

step two, the training data set digital audio signals update the general background model parameters of the power grid frequency through self-adaption on the obtained general background model, and a feature matrix of the power grid frequency spectrum feature supervectors of the digital audio signals is constructed according to a target database;

inputting the obtained power grid frequency spectrum characteristic super vector into a deep neural network for shallow feature representation learning to obtain a shallow feature, namely the power grid frequency spectrum characteristic super vector;

and step four, inputting the obtained shallow layer characteristics into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampered voice through a sigmod function to obtain a tampering detection result.

Further, in the first step, the preprocessing is performed on the original digital audio signal by using the band-pass filter, the power grid frequency component of the signal to be detected is extracted, and the extracting of the phase characteristic and the fitting of the characteristic parameter includes:

original digital audio signal f [ n ] using 10000-order linear phase FIR filter]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured_ENFC[n]；

Based on DFT⁰And DFT¹Transforming to obtain phase fluctuation characteristics F1 and F2, and obtaining an instantaneous frequency characteristic F3 based on Hilbert transformation;

and respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.

Further, the constructing the general background model of the power grid frequency includes:

(1) determining a Gaussian mixture model:

where f represents an N-dimensional feature vector f ═ f composed of phase features and fitting feature parameters₁,f₂,…,f_N}；φ_jJ ═ 1, … L represents the blending weight; sigma_jRepresenting a covariance matrix; mu.s_jRepresenting a mean vector;

(2) and (3) adopting an EM algorithm to carry out parameter estimation of the mixed Gaussian model:

(2.1) determining a suitable θ and z-maximization log-likelihood function:

wherein x is (x)₁,x₂,x₃,…,x_m) Representing the voice feature vectors, and m represents the number of the voice feature vectors which are independent of each other; λ represents a digital audio signal model, θ represents a known model parameter, z_i,z_i∈(z₁,z₂,z₃,…,z_i) Representation and feature vector x_iCorresponding hidden variable, let p (x)_i,z_i| θ) max;

(2.2) calculating values of θ and z: determining Q after a fixed parameter theta based on Q (z) as the distribution of the hidden variable z under the known sample and model parameters_i(z_i) The lower bound of L (theta, Z), i.e., the

And (3) maximizing the lower bound by adjusting theta, maximizing the likelihood function to obtain new model parameters, returning and substituting into the step (2.1), and continuously iterating to obtain more accurate GMM parameters to obtain a good general background model of the power grid frequency.

Further, in step two, the adaptively updating the mean parameter of the training data set digital audio signal to the obtained general background model includes:

first, the jth feature vector f is calculated_jBelonging to the ith joint Gaussian component p in UBM_i(f) Probability of (c):

next, the calculated P (i | f) is used_j) Separately calculating G of the untampered target digital audio signalMean parameters of the MM model:

finally, the new sufficient statistics generated from the training data are updated to the sufficient statistics of the ith mixing member of the UBM:

wherein the content of the first and second substances,

representing adaptive coefficients for controlling the balance between the new mean and the old estimator;

representing adaptive coefficients; k denotes a factor of a fixed parameter.

Further, the constructing a feature matrix of the power grid frequency spectrum feature supervector of the digital audio signal according to the target database includes:

and taking the mean matrix of each GMM-UBM model derived from each voice as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and the high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.

Further, in the third step, the deep neural network is provided with an attention mechanism and a residual error network;

the attention mechanism comprises a convolution layer, a pooling layer, a full-connection layer and a dot multiplication module, and is used for performing feature reconstruction of the power grid frequency spectrum feature super-vector and endowing different weights to features in the power grid frequency spectrum feature super-vector;

the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted features 31 and M represents the gaussian component; input size 224 x 224;

the residual error network convolution layer is a convolution layer of 5 x 5;

the residual block is as follows:

x_l+1＝h(x_l)+F(x_l,W_l)；

wherein, h (x)_l)＝W_l'x；W_l' denotes a1 x 1 convolution operation; f (x)_l,W_l) Representing the residual part.

Further, the attention mechanism includes:

the first convolutional layer K is a matrix with the convolutional kernel size of n x n, and the activation function is a relu function; for shallow feature extraction, the formula is as follows:

wherein M is_ijRepresenting elements corresponding to convolution kernels in the input characteristic diagram during convolution, wherein R represents that a relu function is adopted as an activation function;

and the maximum pooling layer is used for carrying out secondary extraction on the shallow feature to obtain a pooled feature map, and the formula is as follows:

H＝E(Y_α)+b₂；

wherein, Y_αThe representation is an original feature map, and E represents a pooling domain matrix of the feature map; b₂Indicating a deviation;

the full connection layer is used for integrating the pooled feature maps;

and the dot multiplication module is used for performing dot multiplication on the feature map processed by the full connection layer and the original feature map.

Further, the tamper detection classification network is composed of a convolution layer, a pooling layer, a full-link layer and an output layer; the activation function of the output layer adopts a sigmoid function;

the loss function of the tampering detection classification network is Binary cross entropy, and the expression is as follows:

wherein, N represents the number of features, y corresponds to the tag value of each voice, and p (y) represents the probability that the output belongs to the y tag.

Further, in the fourth step, the step of inputting the obtained shallow feature into a pre-constructed tamper detection classification network, and distinguishing the original voice and the tamper voice through a sigmod function includes:

1) strengthening shallow layer characteristics by using the convolution layer, the pooling layer and the full-connection layer through local receptive field, weight sharing and down-sampling;

2) distinguishing the original voice and the tampered voice by using a Sigmoid function of an output layer:

H＝Sigmoid(P*W+b)；

wherein H represents an output and W represents a weight; b denotes the deviation and P denotes the output of the fully connected layer.

Another object of the present invention is to provide an automatic detection system for digital audio deletion and insertion tampering operations, which implements the automatic detection method for digital audio deletion and insertion tampering operations.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the digital audio deletion and insertion manipulation operation automatic detection method.

By combining all the technical schemes, the invention has the advantages and positive effects that:

extracting power grid frequency spectrum characteristic super vectors and establishing characteristics extracted by deep neural network training; the invention not only realizes the automation of tamper detection, but also well applies the deep network to tamper detection and obtains good effect. The method obtains the power grid frequency spectrum characteristic super vector of each voice by establishing a background model and updating the parameters thereof in a self-adaptive manner, performs representation learning of shallow features through a deep neural network, and classifies the voice input into a classification network, wherein an empirical behavior of threshold selection does not exist, and the method has higher accuracy and better robustness.

In order to verify that the invention is more robust, good results are obtained on some public databases. The significance of automatic detection of digital audio deletion and insertion tampering operations is that the detection method can be applied to various databases and various scenes, and in order to guarantee the application, the detection scheme must be robust under various practical conditions.

The method is based on the establishment of a power grid frequency general background model, parameters of the model are updated through an EM (effective noise) algorithm, the model can be self-adapted through a small amount of data by applying an MAP (MAP) algorithm, and each original audio frequency in a database can be self-adapted to form a GMM-UNM model; the method establishes a deep network for shallow feature representation learning based on the power grid frequency spectrum feature super vector, and the shallow feature is input into a classification network for secondary classification of tamper detection.

According to the invention, an attention mechanism module is added in the network to reconstruct the characteristics, so that the weight ratio of important characteristics is increased, and the characteristic diagram is strengthened; the invention establishes a network which can be used for tamper detection based on a residual error network, a residual error block in the network uses jump link, the problem of gradient disappearance caused by increasing depth in a deep neural network is relieved, input information is directly bypassed to output, and the integrity of the information is protected; the method classifies by activating the function sigmoid and judges the quality of the model by the loss function Binary cross entry, thereby realizing the automation of tampering detection.

Drawings

Fig. 1 is a schematic diagram of an automatic detection method for digital audio deletion and insertion tampering oriented operations according to an embodiment of the present invention.

Fig. 2 is a flowchart of an automatic detection method for digital audio deletion and insertion tampering oriented operations according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a deep neural network according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides an automatic detection method for digital audio deletion and insertion tampering operation, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the method for automatically detecting operations of deleting and inserting digital audio according to an embodiment of the present invention includes:

As shown in fig. 2, the method for automatically detecting operations of deleting and inserting digital audio according to an embodiment of the present invention includes the following steps:

s101, preprocessing an original digital audio signal by using a band-pass filter, and extracting a power grid frequency component of a signal to be detected; extracting phase characteristics and fitting characteristic parameters, and constructing a general background model of the power grid frequency;

s102, updating general background model parameters of the power grid frequency by training data set digital audio signals to the obtained general background model in a self-adaptive manner, and constructing a feature matrix of a power grid frequency spectrum feature super vector of the digital audio signals according to a target database;

s103, inputting the obtained power grid frequency spectrum characteristic super vector into a deep neural network for shallow feature representation learning to obtain a shallow feature, namely the power grid frequency spectrum characteristic super vector;

and S104, inputting the obtained shallow layer characteristics into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampered voice through a sigmod function to obtain a tampering detection result.

The method for preprocessing the original digital audio signal by using the band-pass filter provided by the embodiment of the invention to extract the power grid frequency component of the signal to be detected, wherein the extracting of the phase characteristic and the fitting characteristic parameter comprises the following steps:

The general background model for constructing the power grid frequency provided by the embodiment of the invention comprises the following steps:

(1) determining a Gaussian mixture model:

(2.1) determining a suitable θ and z-maximization log-likelihood function:

The method for updating the mean value parameter of the obtained general background model by the training data set digital audio signal through self-adaption comprises the following steps:

next, the calculated P (i | f) is used_j) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:

wherein the content of the first and second substances,

representing adaptive coefficients; k denotes a factor of a fixed parameter.

The feature matrix for constructing the power grid frequency spectrum feature supervectors of the digital audio signals according to the target database provided by the embodiment of the invention comprises the following steps:

As shown in fig. 3, the deep neural network provided by the embodiment of the present invention is provided with an attention mechanism and a residual error network;

the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted features 31 and M represents the gaussian component; the input size is 224 x 224.

The attention mechanism provided by the embodiment of the invention comprises:

H＝E(Y_α)+b₂；

the full connection layer is used for integrating the pooled feature maps;

The residual error network convolution layer provided by the embodiment of the invention is a convolution layer of 5 x 5;

the residual block is as follows:

x_l+1＝h(x_l)+F(x_l,W_l)；

The tamper detection classification network provided by the embodiment of the invention consists of a convolution layer, a pooling layer, a full-connection layer and an output layer; the activation function of the output layer adopts a sigmoid function;

the loss function of the tamper detection classification network provided by the embodiment of the invention is Binary cross entropy, and the expression is as follows:

The method for inputting the trained shallow features into the pre-constructed tamper detection classification network provided by the embodiment of the invention and distinguishing the original voice and the tamper voice through the sigmod function comprises the following steps:

H＝Sigmoid(P*W+b)；

The technical solution of the present invention is further described with reference to the following specific embodiments.

Example 1:

the invention aims to provide an automatic detection method for digital audio deletion and insertion tampering operation. Extracting a power grid frequency component of a signal to be detected, then carrying out phase characteristic and fitting characteristic parameters, and training a general background model; the mean value parameters of the obtained background model are updated by training data set digital audio signals in a self-adaptive manner, a target GMM-UBM model can be derived from each voice, and the mean value matrix of each GMM-UBM is used as a power grid frequency spectrum characteristic super vector; the power grid frequency spectrum feature super vector obtained by the invention is input to a deep neural network to carry out shallow feature representation learning. The deep neural network consists of an attention mechanism and a residual error network, has good capability of feature extraction and representation learning, and can further train shallow features. Shallow layer characteristics are obtained through the characterization learning of the deep network and then input into the classification network. The classification network consists of a convolution layer, a pooling layer, a full-link layer and an output layer, and the activation function of the output layer adopts a sigmoid function. Further training is carried out through convolution, pooling and full connection, and finally whether tampering occurs is distinguished through a sigmod function, so that automation of tampering detection is realized.

Referring to fig. 1, the automatic detection method for digital audio deletion and insertion tampering operation of the present invention comprises the following steps:

step 1: the method comprises the following steps: extracting power grid frequency components of the original digital audio based on the designed band-pass filter, further extracting phase characteristics and fitting characteristic parameters, and establishing a general background model of the power grid frequency;

the specific implementation comprises the following substeps:

step one): for the original digital audio signal f [ n ]]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured_ENFC[n]. The band-pass filter designed by the invention uses a 10000-order linear phase FIR filter. The higher order filter is used in order to obtain an ideal narrow band signal. The center frequency is at the ENF standard frequency, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100 dB. Based on DFT⁰And DFT¹The transformation obtains phase fluctuation characteristics F1 and F2, and the Hilbert transformation obtains an instantaneous frequency characteristic F3. And respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.

Step two), building a UBM model;

the Universal Background Model (UBM) is composed of a Gaussian Mixed Model. The gaussian mixture model refers to a linear combination of L gaussian distribution functions, and the formula of the gaussian mixture model is as follows:

where f is an N-dimensional eigenvector f ═ f composed of phase features and fitted feature parameters₁,f₂,…,f_N}，φ_jJ is 1, … L, is the mixing weight, σ_jIs a covariance matrix, mu_jIs the mean vector. The complete Gaussian mixture model consists of weight parameters, mean vectors and covariance matrices, and is represented as:

and then performing parameter estimation of the mixed Gaussian model by adopting an EM algorithm.

The EM algorithm is divided into two steps: in the first step E, there are m independent speech feature vectors x ═ x (x)₁,x₂,x₃,…,x_m) For a model λ of the digital audio signal, the model parameter is known as θ, and each feature vector x_iAll have a hidden variable z corresponding to it_i,z_i∈(z₁,z₂,z₃,…,z_i) Let p (x)_i,z_i| θ) is maximum. The goal of the invention is to find a suitable θ and z-maximization log-likelihood function:

the second step is M steps, how to solve the values of theta and z is a complex mathematical problem, and according to the analysis of the likelihood function, the following formula is constructed:

is provided with

Then

Illustrating that the above equation introduces a new distribution Q that is unknown_i(z_i) And satisfies the following conditions:

scaling with the Jensen inequality yields:

after the reaction is carried into the original formula, the reaction is changed into:

as can be seen from the Jensen inequality, the random variable equation constant can make the equation hold, that is:

and also

It is possible to obtain:

from this, the distribution of the implicit variable z for which q (z) is a known sample and model parameter can be obtained. Thus, Q after a fixed parameter theta is derived_i(z_i) Thereby establishing a lower bound for L (theta, Z), i.e.

This lower bound is maximized by adjusting θ.

After the likelihood function is maximized to obtain new model parameters, the model parameters are brought into the first step, and more accurate GMM parameters are obtained through continuous iteration. Thus resulting in a good UBM model.

Step 2: constructing a feature matrix of a power grid frequency spectrum feature super vector of the digital audio signal according to the target database;

the specific implementation comprises the following substeps:

in order to obtain the GMM-UBM model, UBM model parameters in the first step are updated in a target database containing original voice and tampered voice through a MAP self-adaption method, and a Gaussian mixture model can be derived from each digital audio signal to be tested.

1) The self-adaptive process is also a parameter updating process and comprises two steps: first, calculate the jth eigenvector f_jBelonging to the ith joint Gaussian component p in UBM_i(f) Probability of (c):

2) the second step uses the calculated P (i | f)_j) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:

3) finally these new sufficient statistics generated from the training data are used to update the sufficient statistics of the ith mixing member of the UBM:

wherein the content of the first and second substances,

are adaptive coefficients that control the balance between the new mean and the old estimator. The adaptive coefficient is defined as

k is a fixed parameter factor and the invention takes an empirical value of 16. And taking the mean matrix of each GMM-UNM as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and the high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.

And step 3: inputting the power grid frequency spectrum characteristic super vector into the designed tamper detection deep network to perform shallow characteristic representation learning;

the method specifically comprises the following steps:

the power grid frequency spectrum feature super vector obtained by the invention is input to a deep neural network to carry out shallow feature representation learning. The deep neural network has good capability of feature extraction and representation learning, and shallow features, namely power grid frequency spectrum feature supervectors can be further trained by modeling the input signal representation.

Step A1: attention input mechanism

As shown in the schematic diagram of the attention mechanism network of fig. 3, weights are constructed by convolution, pooling, and dot multiplication to re-adjust the feature map. And different weights are given to the features in the power grid frequency spectrum feature super vector to fulfill the aims of strengthening important features and weakening edge features. M represents a two-dimensional characteristic diagram formed by transforming a power grid frequency spectrum characteristic super vector, the first convolution layer K is a matrix with the convolution kernel size of n x n, and Y is obtained after convolution kernel filtering. The convolution is calculated as:

wherein M is_ijRepresenting the elements of the input feature map corresponding to the convolution kernel when convolved, and R is the use of the relu function as the activation function.

After convolution, a layer of pooling layer is passed, and pooling is the secondary extraction of features. The invention uses maximal pooling, the maximum of which is selected to represent the characteristics of the area. The high-level characteristic diagram obtained after pooling not only can reduce the dimensionality and parameter quantity of the original characteristic diagram, but also can avoid problems of overfitting and the like. The pooling formula is:

H＝E(Y_α)+b₂

wherein, Y_αRepresenting the original feature map, the pooling domain of the feature map being a matrix E, b₂And traversing the pooling domain of the original feature map for deviation to obtain a pooled feature map H. The original characteristic map M is subjected to convolution, pooling andand multiplying the feature map after the full-connection processing with the original feature map to reconstruct the original feature map.

Step A2: input to residual error network

And after the power grid frequency spectrum characteristic super vector is subjected to characteristic reconstruction through an attention mechanism, the power grid frequency spectrum characteristic super vector is input into a residual error module to train the characteristic into a specific structure. The residual error module is based on rennet18, and the invention removes the high-order convolution layer, thereby not only reducing the calculation parameters, but also saving the calculation resources. For image-related tasks, image pixels are input into the neural network, but for the voice tampering detection task of the invention, the invention needs to perform a series of feature extraction on an original waveform, and then convert the extracted two-dimensional features into three-dimensional features to be input into the neural network. In addition, the size of the input feature vector is N M. N is the extracted fitted feature 31 and M is a gaussian component.

The residual block can be represented as:

x_l+1＝h(x_l)+F(x_l,W_l)

in the formula: h (x)_l)＝W_l'x。W_l' is a1 x 1 convolution operation; f (x)_l,W_l) Is the residual part.

In addition, compared with the input dimension 224 x 224 in the traditional resnet18 network, the feature dimension input by the invention is much smaller than the input dimension of the image. The convolution kernel can continuously perform down-sampling, the number of channels is increased, and the size of feature map is reduced. In addition, the input size of the invention is smaller than the recommended input size, which results in that the generated characteristic diagram is too small, and partial characteristics are lost. In order to further reduce parameters and calculation, the invention replaces 7 × 7 convolutional layers with 5 × 5 convolutional layers, which can greatly reduce parameters.

And 4, step 4: and inputting the shallow feature into the constructed tamper detection classification network, and distinguishing the original voice from the tamper voice through a sigmod function.

The specific implementation comprises the following substeps:

the power grid frequency spectrum characteristic super vector obtains shallow layer characteristics through the characterization learning of a deep network, and whether the power grid frequency spectrum characteristic super vector is tampered or not is further judged through a classification network. A tamper detection classification network is shown in fig. 2.

1) And features are further learned through a convolutional layer and a pooling layer. The obtained shallow characteristic parameter with too large amount is directly used for tampering classification, and the obtained effect cannot reach the best. A convolutional layer, a pooling layer and a full-link layer are further adopted in the classification network, and shallow layer characteristics are enhanced through local receptive fields, weight sharing and down sampling.

2) The activation function of the output layer adopts a Sigmoid function. Sigmoid has the formula:

the formula shows that the output mapping of the sigmoid function is between (0,1), monotonous and continuous, the output range is limited, and the optimization is stable. And is convenient for use as a second category. Meanwhile, the sigmoid layer of the invention is expressed as follows:

H＝Sigmoid(P*W+b)

where H outputs, W is the weight; b is the offset and P is the output of the fully connected layer.

Binary cross entropy is a Loss of Loss function commonly used in Binary classification problems. The expression is as follows:

the number of N features, y corresponds to the label value of each voice, and p (y) is the probability of outputting a label belonging to y. Loss is the value of the Binary cross entry Loss function, and is used for judging the quality of the model of the invention.

The technical effects of the present invention will be further explained in conjunction with simulation experiments.

The invention uses 2397 voices from the Ahumada-25 database as the original voices to extract signal characteristics, and establishes UBM models of the original voices. The model of the invention was evaluated by performing experiments on three target databases, Carioca (consisting of Carioca1 database and Carioca2 database), the New Spanish database, and the self-made database ENF-HG. The four databases have 3253 samples, and the power grid frequency spectrum characteristic super vector obtained by each sample is 31 x 32 dimensions. The process of extracting the continuous super vector is carried out in an MATLAB platform, and the extracted data is stored as a csv format and is input into a network structure built in a keras for training. The change of the gaussians can influence the dimensionality of the fluctuation supervectors extracted by the invention, the influence of different gaussians on the model established by the invention is verified, the influence of four gaussians of 16, 32, 64 and 128 on the model is respectively verified, and as shown in table 1, the highest precision of the gaussians in the three databases respectively reaches 95.0%, 94.2% and 93.7%.

Table 1:

number of gauss	Carioca	New Spanish	ENF-HG	All data
						16	0.942	0.933	0.932	0.938
32	0.950	0.942	0.937	0.951
					64	0.928	0.914	0.937	0.928
128	0.895	0.911	0.923	0.932

The positive effects of the present invention will be further described below with reference to specific experimental data.

1) Different network architectures

In order to verify the feasibility of the extracted features, the extracted features are respectively input into a traditional machine learning classifier and a deep network for training. To better illustrate the feasibility of the features of the present invention, experiments were performed on different data sets and all data were validated together.

Experiments are carried out on the traditional machine learning model, and in order to compare different results, experimental comparison is carried out on SVM, random forest, decision tree, logistic regression and XGboost respectively. As shown in Table 2, the results show that the features of the present invention perform poorly on decision trees. The results of the Carioca database on the SVM are better, reaching 90.6%. The New Spanish database has a better effect on XGboost reaching 92.1%. The homemade database ENF-HG also performed 92.3% better on XGBoost.

TABLE 2

For models comparing different results, the invention respectively compares CNN (self-designed), resnet50, resnet34, resnet18 and tamper detection network. The tampering detection network is the deep neural network and the classification network system designed by the invention. And the effects of different databases in the neural network are compared. It can be seen from table 3 that the power grid frequency spectrum feature supervector has the best effect in the tamper detection network. Meanwhile, compared with the table 2, the power grid frequency spectrum characteristic super vector has better performance in a deep network. In general, the performance of the proposed model on a data set is superior to the structure and features of other models.

TABLE 3

2) Comparison of existing methods

The present invention was also compared with the results of other researchers' experiments on the public databases Carioca1, Carioca2 and New Spanish with the best method proposed by the present invention. The results are shown in Table 4.

TABLE 4

From table 4, it can be seen that the accuracy of using a single phase feature or frequency feature is not very high, and the grid frequency spectrum super vector used in the invention has higher accuracy.

In the description of the present invention, "a plurality" means two or more unless otherwise specified; the terms "upper", "lower", "left", "right", "inner", "outer", "front", "rear", "head", "tail", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for automatically detecting operation of digital audio deletion and insertion tampering, which is characterized by comprising the following steps:

2. The method for automatically detecting digital audio deletion and insertion manipulation operation according to claim 1, wherein the method for automatically detecting digital audio deletion and insertion manipulation operation comprises the steps of:

and step four, inputting the trained shallow features into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampered voice through a sigmod function to obtain a tampering detection result.

3. The method for automatically detecting digital audio deletion and insertion tampering operation according to claim 2, wherein in step one, the preprocessing of the original digital audio signal by the band-pass filter is performed to extract the grid frequency component of the signal to be detected, and the extracting of the phase characteristic and the fitting characteristic parameter includes:

original digital audio signal f [ n ] using 10000-order linear phase FIR filter]Performing band-passFiltering to obtain the power grid frequency component F in the signal to be measured_ENFC[n]；

4. The method for automatic detection of digital audio deletion and insertion tampering operations as defined in claim 2, wherein said constructing a generic background model of the grid frequency comprises:

(1) determining a Gaussian mixture model:

(2.1) determining a suitable θ and z-maximization log-likelihood function:

5. The method for automatically detecting digital audio deletion and insertion tampering operations as claimed in claim 2, wherein in step two, the adaptively updating the mean parameter of the training data set digital audio signal to the obtained general background model comprises:

wherein the content of the first and second substances,

representing adaptive coefficients; k denotes a factor of a fixed parameter.

6. The method for automatically detecting digital audio deletion and insertion tampering operations as defined in claim 2, wherein the constructing a feature matrix of a power grid frequency spectral feature supervector of the digital audio signal from the target database comprises:

taking the mean matrix of each GMM-UBM model derived from each voice as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and a high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain a power grid frequency spectrum characteristic super vector;

in the third step, the deep neural network is provided with an attention mechanism and a residual error network;

the residual error network convolution layer is a convolution layer of 5 x 5;

the residual block is as follows:

x_l+1＝h(x_l)+F(x_l,W_l)；

wherein, h (x)_l)＝W_l'x；W_l' denotes a1 x 1 convolution operation; f (x)_l,W_l) Representing a residual portion;

the attention mechanism includes:

H＝E(Y_α)+b₂；

the full connection layer is used for integrating the pooled feature maps;

7. The automatic detection method for digital audio deletion and insertion tampering operations of claim 2, wherein the tamper detection classification network is composed of a convolutional layer, a pooling layer, a full connection layer, and an output layer; the activation function of the output layer adopts a sigmoid function;

8. The automatic detection method for digital audio deletion and insertion tampering operation of claim 2, wherein in step four, the shallow feature obtained is input into a pre-constructed tampering detection classification network, and the distinguishing between the original speech and the tampered speech by the sigmod function comprises:

1) reinforcing shallow layer characteristics by using a convolution layer, a pooling layer and a full connection layer of the tamper detection classification network through local receptive field and weight sharing and down-sampling;

2) utilizing a Sigmoid function of a tamper detection classification network output layer to distinguish original voice from tampered voice:

H＝Sigmoid(P*W+b)；

9. An automatic detection system for digital audio deletion and insertion tampering operation, which implements the automatic detection method for digital audio deletion and insertion tampering operation according to any one of claims 1 to 8.

10. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the method for automatic detection of digital audio deletion and insertion tampering oriented operations according to any one of claims 1 to 8.