CN115472179A

CN115472179A - Automatic detection method and system for digital audio deletion and insertion tampering operation

Info

Publication number: CN115472179A
Application number: CN202210932618.6A
Authority: CN
Inventors: 曾春艳; 孔帅; 王志锋; 万相奎; 李坤; 赵宇豪
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2022-12-13

Abstract

The invention belongs to the technical field of digital audio signal tampering detection, and discloses an automatic detection method and system for digital audio deleting and inserting tampering operation, wherein a trained general background model of power grid frequency is utilized to extract a power grid frequency spectrum characteristic super vector of each digital audio signal; inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features; inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering. Extracting power grid frequency spectrum characteristic super vectors and establishing characteristics extracted by deep neural network training; the invention not only realizes the automation of tamper detection, but also well applies the deep neural network to tamper detection and obtains good effect. The invention has higher accuracy and better robustness.

Description

Automatic detection method and system for digital audio deletion and insertion tampering operation

Technical Field

The invention belongs to the technical field of digital audio signal tampering detection, and particularly relates to an automatic detection method and system for digital audio deleting and inserting tampering operations.

Background

At present, with the rapid development of internet information technology, intelligent mobile devices are gradually popularized, and digital multimedia data (such as audio, images, texts, etc.) has become a main information carrier. The recording and storing cost of the digital audio files becomes lower and lower, and the digital audio files are more and more convenient to obtain from the internet, so that people have increasingly rising demands for collecting and sharing the digital audio files. Meanwhile, various audio editing software is also developed, so that the audio signal is easier to edit. Accordingly, there is an increasing need for effective protection and authentication of audio recordings, particularly where the recordings may be involved in digital rights management and law enforcement cases. A large amount of false information with a sense of reality may be generated on the internet or in the court, thereby affecting social stability and public safety. Audio forensics are therefore becoming increasingly important for verifying the authenticity, integrity and source of audio information.

The utility model utilizes the power grid frequency for tamper detection, and is widely quoted by the law. From a forensic perspective, grid frequency signals are often embedded in the audio recordings of eavesdropping devices, the high availability associated with well-behaved characteristics making it an attractive feature. This is also why it is widely used. Grid frequency fluctuations in an area are stable and unique over a long period of time. Non-periodic fluctuations in the grid frequency have the same effect on all devices connected to it. Grid frequency signals are typically present in equipment powered by the grid, which is also a well-known standard signal. For example, the standard value of the grid frequency is 50Hz or 60Hz, depending on the region. European countries, australia, and most countries in asia and africa use 50Hz. North america and central america use 60Hz. It should be noted that in south America some countries use 50Hz and some countries 60Hz. And japan uses both 50Hz and 60Hz as standard values of the grid frequency. Ideally, the grid signal is a sinusoidal signal oscillating at a nominal frequency, but in reality, the instantaneous frequency varies due to fluctuations in the power supply and demand from the grid. Over time, the frequency and phase of the grid do not change abruptly. The power grid frequency signal has stability and uniqueness, and inserting or deleting an audio segment into an audio file may cause sudden change of the estimated power grid frequency signal. In the audio file, a power grid frequency signal is extracted through band-pass filtering, and the fact that tampering operation causes sudden changes of instantaneous frequency and phase of power grid frequency components at a tampering point is used for identifying whether tampering occurs or not.

Meanwhile, the prior art also provides a series of methods for detecting audio tampering. The application of the power grid frequency to the tamper detection technology can be divided into two types, wherein the first type is to compare a power grid frequency signal with a large-scale power grid frequency database; and secondly, extracting some characteristics in the power grid frequency signal and analyzing the consistency or regularity. Still other researchers have not used the grid frequency for analysis of tampering operations.

1) Based on grid frequency database comparison: grigoras originally proposed an audio tampering detection algorithm based on power grid frequency, and mainly compared the fluctuation of the power grid frequency in the audio to be detected with data of a reference year, so as to judge whether the audio is tampered. In the prior art 1, a standard power grid frequency database is obtained by using a B-spline line basis function and inverse interpolation based on analysis of a north american power grid frequency detection network. And estimating the frequency of the power grid frequency signal component by using short-time Fourier transform, and matching the power grid frequency sequence of the audio to be detected with a standard database. An oscillator error iterative correction algorithm is provided for obtaining an accurate time frequency pair of a signal to be measured, and the problem that a power grid frequency sequence cannot be matched with a standard power grid frequency database is solved. In the prior art 2, the frequency demodulation is used for extracting the power grid frequency signal from the audio signal, and the work of the stage of extracting the power grid frequency signal is further researched.

2) Extracting the frequency characteristics of the power grid: prior art 3 proposes a method for detecting splicing by revealing abnormal differences in local noise levels, which determines whether there is a heterogeneous splicing tampering operation in the audio by comparing the similarity between the background noise variances of each syllable. Prior art 4 detects multiple compressed files and identifies the encoder type using statistical features of MDCT coefficients and studies on MP3 file structures, and verifies algorithm performance in large speech databases. Prior art 5 proposes that the pitch sequence is usually completely different as the pitch sequence of the characteristic different syllable extractions of the audio. Whether copy-move forgery exists for the corresponding syllable is determined by calculating the difference between each syllable and comparing the difference of the syllable with a set threshold. Prior art 6 extracts the pitch sequence and the first two formant sequences as a feature set for each speech segment. And calculating the similarity of each feature set by adopting a Dynamic Time Warping (DTW) algorithm. Copy-mobile falsification in voice recordings is detected and located by similarity comparison with a threshold.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) The quality, the recording environment and the like of the signal to be detected are limited by certain conditions, and the detection result has no judgment standard of consistency;

(2) The extracted power grid frequency characteristics cannot well reflect tampering information, and the adopted classifier cannot better utilize important information of characteristics and learning characteristics;

(3) Some detection methods need to set fuzzy threshold decision conditions through experience of professional knowledge, and cannot well realize automatic detection.

(4) The existing characteristics are not deep enough for the mining degree of tampering information in the power grid frequency;

(5) The traditional method has weak generalization, and the robustness and accuracy of the detected audio frequency are to be improved;

the difficulty in solving the above problems and defects is: for automatic detection of digital audio deletion and insertion tampering operations, features more suitable for training in a deep network need to be extracted, and a network more suitable for tampering detection is not established yet.

The significance of solving the problems and the defects is as follows:

for the existing method, the power grid frequency spectrum characteristic supervectors extracted based on the phase information and the frequency information can better reflect and deeply mine tampering information; the deep neural network is adopted to train shallow features, so that important information of the features can be better learned; and the classification of tampering detection is realized by adopting a classification network, the detection result has a specific judgment standard, and the automatic detection is realized. The designed digital audio deleting and inserting system has obvious improvement on the robustness and the accuracy of audio detection and is verified in a plurality of databases.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an automatic detection method and system for digital audio deletion and insertion tampering operation.

The invention is realized in this way, a method for automatically detecting operations of deleting and inserting tampering of digital audio, the method for automatically detecting operations of deleting and inserting tampering of digital audio comprises:

extracting a power grid frequency spectrum characteristic super vector of each digital audio signal by using a trained general background model of the power grid frequency;

inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;

inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering.

Further, the automatic detection method for the digital audio deletion and insertion tampering operation comprises the following steps:

preprocessing an original digital audio signal by using a band-pass filter, and extracting a power grid frequency component of a signal to be detected; extracting phase characteristics and fitting characteristic parameters, and constructing a general background model of the power grid frequency;

the training data set digital audio signals update the general background model parameters of the power grid frequency through self-adaption on the obtained general background model, and a characteristic matrix of the power grid frequency spectrum characteristic super-vector of the digital audio signals is constructed according to a target database;

inputting the obtained power grid frequency spectrum characteristic super vector into a deep neural network to perform representation learning of shallow features, and obtaining the shallow features, namely the power grid frequency spectrum characteristic super vector;

and inputting the obtained shallow layer characteristics into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampering voice through a sigmod function to obtain a tampering detection result.

Further, the preprocessing of the original digital audio signal by using the band-pass filter, the extraction of the power grid frequency component of the signal to be detected, and the extraction of the phase characteristics and the fitting of the characteristic parameters comprise:

original digital audio signal f [ n ] using 10000-order linear phase FIR filter]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured _ENFC [n]；

Based on DFT ⁰ And DFT ¹ Phase fluctuation characteristics F1 and F2 are obtained through transformation, and instantaneous frequency characteristics F3 are obtained based on Hilbert transformation;

and respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.

Further, the constructing of the general background model of the grid frequency includes:

(1) Determining a Gaussian mixture model:

wherein f represents an N-dimensional eigenvector f = { f) composed of phase characteristics and fitting characteristic parameters ₁ ,f ₂ ,…,f _N }；φ _j J =1, … L denotes the mixing weight; sigma _j Representing a covariance matrix; mu.s _j Representing a mean vector;

(2) And (3) adopting an EM algorithm to carry out parameter estimation of the mixed Gaussian model:

(2.1) determining a suitable θ and z-maximization log-likelihood function:

wherein, x = (x) ₁ ,x ₂ ,x ₃ ,…,x _m ) Representing the voice feature vectors, and m represents the number of the voice feature vectors which are independent of each other; λ represents a digital audio signal model, θ represents a known model parameter, z _i ,z _i ∈(z ₁ ,z ₂ ,z ₃ ,…,z _i ) Representation and feature vector x _i Corresponding hidden variable, let p (x) _i ,z _i | θ) max;

(2.2) calculating values of θ and z: determining Q after a fixed parameter θ based on the distribution of Q (z) as an implicit variable z under known sample and model parameters _i (z _i ) The lower bound of L (theta, Z), i.e., the

And (3) maximizing the lower bound by adjusting theta, maximizing the likelihood function to obtain new model parameters, returning and substituting into the step (2.1), and continuously iterating to obtain more accurate GMM parameters to obtain a good general background model of the power grid frequency.

Further, the adaptively updating the mean parameter of the training data set digital audio signal to the obtained general background model comprises:

first, the jth feature vector f is calculated _j Belonging to the ith joint Gaussian component p in UBM _i (f) Probability of (c):

next, the calculated P (i | f) is used _j ) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:

finally, the new sufficient statistics generated from the training data are updated to the sufficient statistics of the ith mixing member of the UBM:

wherein the content of the first and second substances,

representing adaptive coefficients for controlling the balance between the new mean and the old estimator;

representing adaptive coefficients; k denotes a factor of a fixed parameter.

Further, the constructing a feature matrix of the power grid frequency spectrum feature supervector of the digital audio signal according to the target database includes:

and taking the mean matrix of each GMM-UBM model derived from each voice as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and a high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.

Further, the deep neural network is provided with an attention mechanism and a residual error network;

the attention mechanism comprises a convolution layer, a pooling layer, a full-connection layer and a dot multiplication module, and is used for performing feature reconstruction of the power grid frequency spectrum feature super-vector and endowing different weights to features in the power grid frequency spectrum feature super-vector;

the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted features 31, M represents the Gaussian components; the input size is 224 x 224;

the residual error network convolution layer is a convolution layer of 5*5;

the residual block is as follows:

x _l+1 ＝h(x _l )+F(x _l ,W _l )；

wherein, h (x) _l )＝W’ _l x；W’ _l Represents a 1*1 convolution operation; f (x) _l ,W _l ) Representing the residual part.

Further, the attention mechanism includes:

the first convolutional layer K is a matrix with the convolutional kernel size of n x n, and the activation function is a relu function; for shallow feature extraction, the formula is as follows:

wherein M is _ij Representing elements corresponding to convolution kernels in the input characteristic diagram during convolution, wherein R represents that a relu function is adopted as an activation function;

and the maximum pooling layer is used for carrying out secondary extraction on the shallow feature to obtain a pooled feature map, and the formula is as follows:

H＝E(Y _α )+b ₂ ；

wherein Y is _α The representation is an original feature map, and E represents a pooling domain matrix of the feature map; b ₂ Indicating a deviation;

the full connection layer is used for integrating the pooled feature maps;

and the dot multiplication module is used for performing dot multiplication on the feature map processed by the full connection layer and the original feature map.

Further, the tamper detection classification network is composed of a convolution layer, a pooling layer, a full-link layer and an output layer; the activation function of the output layer adopts a sigmoid function;

the loss function of the tampering detection classification network is Binary cross entropy, and the expression is as follows:

wherein, N represents the number of features, y corresponds to the label value of each voice, and p (y) represents the probability that the output belongs to the y label.

Further, in the fourth step, the step of inputting the obtained shallow feature into a pre-constructed tamper detection classification network, and distinguishing the original voice and the tamper voice through a sigmod function includes:

1) Strengthening shallow layer characteristics by using the convolution layer, the pooling layer and the full-connection layer through local receptive field, weight sharing and down-sampling;

2) Distinguishing the original voice and the tampered voice by using a Sigmoid function of an output layer:

H＝Sigmoid(P*W+b)；

wherein H represents an output and W represents a weight; b denotes the deviation, and P denotes the output of the fully connected layer.

A system, comprising:

a first module: the general background model of the trained power grid frequency is used for extracting a power grid frequency spectrum characteristic super vector of each digital audio signal;

a second module: the system is configured to input the extracted grid frequency spectrum feature super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;

a third module: and the system is configured to input the trained shallow features into a classification network and judge whether the shallow features are subjected to deletion or insertion tampering.

By combining all the technical schemes, the invention has the advantages and positive effects that:

extracting power grid frequency spectrum characteristic super vectors and establishing characteristics extracted by deep neural network training; the invention not only realizes the automation of tamper detection, but also well applies the deep network to tamper detection and obtains good effect. The method obtains the power grid frequency spectrum characteristic super vector of each voice by establishing a background model and updating the parameters thereof in a self-adaptive manner, performs representation learning of shallow features through a deep neural network, and classifies the voice input into a classification network, wherein an empirical behavior of threshold selection does not exist, and the method has higher accuracy and better robustness.

In order to verify that the invention is more robust, good results are obtained on some public databases. The significance of automatic detection of digital audio deletion and insertion tampering operations is that the detection method can be applied to various databases and various scenes, and in order to guarantee the application, the detection scheme must be robust under various practical conditions.

The method is based on the establishment of a power grid frequency general background model, the parameters of the model are updated through an EM (effective noise) algorithm, the MAP algorithm can be applied to self-adapt through a small amount of data, and each original audio frequency in a database can be self-adapted to form a GMM-UNM model; the invention establishes a deep network for performing shallow feature representation learning based on a power grid frequency spectrum feature super vector, and the shallow feature is input into a classification network for performing two classifications of tamper detection.

According to the invention, an attention mechanism module is added in the network to reconstruct the characteristics, so that the weight ratio of important characteristics is increased, and the characteristic diagram is strengthened; the invention establishes a network which can be used for tamper detection based on a residual error network, a residual error block in the network uses jump link, the problem of gradient disappearance caused by increasing depth in a deep neural network is relieved, input information is directly bypassed to output, and the integrity of the information is protected; the method classifies by activating the function sigmoid and judges the quality of the model by the loss function Binary cross entry, thereby realizing the automation of tampering detection.

Drawings

Fig. 1 is a schematic diagram of an automatic detection method for digital audio deletion and insertion tampering oriented operations according to an embodiment of the present invention.

Fig. 2 is a flowchart of an automatic detection method for digital audio deletion and insertion tampering oriented operations according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a deep neural network according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides an automatic detection method for digital audio deletion and insertion tampering operation, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the method for automatically detecting operations of deleting and inserting digital audio according to an embodiment of the present invention includes:

As shown in fig. 2, the method for automatically detecting operations of deleting and inserting digital audio according to an embodiment of the present invention includes the following steps:

s101, preprocessing an original digital audio signal by using a band-pass filter, and extracting a power grid frequency component of a signal to be detected; extracting phase characteristics and fitting characteristic parameters, and constructing a general background model of the power grid frequency;

s102, training a data set digital audio signal to obtain a general background model, updating general background model parameters of the power grid frequency in a self-adaptive manner, and constructing a feature matrix of a power grid frequency spectrum feature supervector of the digital audio signal according to a target database;

s103, inputting the obtained power grid frequency spectrum characteristic super vector into a deep neural network for shallow feature representation learning to obtain a shallow feature, namely the power grid frequency spectrum characteristic super vector;

and S104, inputting the obtained shallow layer characteristics into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampered voice through a sigmod function to obtain a tampering detection result.

The method for preprocessing the original digital audio signal by using the band-pass filter provided by the embodiment of the invention to extract the power grid frequency component of the signal to be detected, wherein the extracting of the phase characteristic and the fitting characteristic parameter comprises the following steps:

The general background model for constructing the power grid frequency provided by the embodiment of the invention comprises the following steps:

(1) Determining a Gaussian mixture model:

(2.1) determining a suitable θ and z-maximization log-likelihood function:

The method for updating the mean value parameter of the obtained general background model by the training data set digital audio signal through self-adaption comprises the following steps:

wherein the content of the first and second substances,

representing the adaptive coefficients; k denotes a factor of a fixed parameter.

The feature matrix for constructing the power grid frequency spectrum feature supervectors of the digital audio signals according to the target database provided by the embodiment of the invention comprises the following steps:

and taking the mean matrix of each GMM-UBM model derived from each voice as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and the high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.

As shown in fig. 3, the deep neural network provided by the embodiment of the present invention is provided with an attention mechanism and a residual error network;

the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted features 31, M represents the Gaussian components; the input size is 224 x 224.

The attention mechanism provided by the embodiment of the invention comprises:

wherein, M _ij Representing elements corresponding to convolution kernels in the input characteristic diagram during convolution, wherein R represents that a relu function is adopted as an activation function;

H＝E(Y _α )+b ₂ ；

wherein Y is _α The representation is an original feature map, and E represents a pooling domain matrix of the feature map; b is a mixture of ₂ Indicating a deviation;

the full connection layer is used for integrating the pooled feature maps;

The residual error network convolution layer provided by the embodiment of the invention is a convolution layer of 5*5;

the residual block is as follows:

x _l+1 ＝h(x _l )+F(x _l ,W _l )；

wherein, h (x) _l )＝W’ _l x；W’ _l Represents 1*1 convolution operation; f (x) _l ,W _l ) Representing the residual part.

The tamper detection classification network provided by the embodiment of the invention consists of a convolution layer, a pooling layer, a full-connection layer and an output layer; the activation function of the output layer adopts a sigmoid function;

the loss function of the tamper detection classification network provided by the embodiment of the invention is Binary cross entropy, and the expression is as follows:

The method for inputting the trained shallow features into the pre-constructed tamper detection classification network provided by the embodiment of the invention and distinguishing the original voice and the tamper voice through the sigmod function comprises the following steps:

2) And (3) distinguishing the original voice and the tampered voice by using a Sigmoid function of an output layer:

H＝Sigmoid(P*W+b)；

wherein H represents an output and W represents a weight; b denotes the deviation and P denotes the output of the fully connected layer.

The technical solution of the present invention is further described with reference to the following specific embodiments.

Example 1:

the invention aims to provide an automatic detection method for digital audio deletion and insertion tampering operation. Extracting a power grid frequency component of a signal to be detected, then carrying out phase characteristic and fitting characteristic parameters, and training a general background model; the mean value parameters of the obtained background model are updated by training data set digital audio signals in a self-adaptive manner, a target GMM-UBM model can be derived from each voice, and the mean value matrix of each GMM-UBM is used as a power grid frequency spectrum characteristic super vector; the power grid frequency spectrum characteristic super vector obtained by the invention is input into a deep neural network to carry out expression learning of shallow layer characteristics. The deep neural network consists of an attention mechanism and a residual error network, has good capability of feature extraction and representation learning, and can further train shallow features. Shallow layer characteristics are obtained through the characterization learning of the deep network and then input into the classification network. The classification network consists of a convolution layer, a pooling layer, a full connection layer and an output layer, and a sigmoid function is adopted as an activation function of the output layer. Further training is carried out through convolution, pooling and full connection, and finally whether tampering occurs is distinguished through a sigmod function, so that automation of tampering detection is realized.

Referring to fig. 1, the automatic detection method for digital audio deletion and insertion tampering operation of the present invention comprises the following steps:

step 1: the method comprises the following steps: extracting power grid frequency components of the original digital audio based on the designed band-pass filter, further extracting phase characteristics and fitting characteristic parameters, and establishing a general background model of the power grid frequency;

the specific implementation comprises the following substeps:

step one): for the original digital audio signal f [ n ]]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured _ENFC [n]. The band-pass filter designed by the invention uses a 10000-order linear phase FIR filter. The higher order filter is used in order to obtain an ideal narrow band signal. The center frequency is at the ENF standard frequency, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB. Based on DFT ⁰ And DFT ¹ The phase fluctuation characteristics F1 and F2 are obtained through transformation, and the instantaneous frequency characteristic F3 is obtained based on Hilbert transformation. And respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.

Step two), building a UBM model;

the Universal Background Model (UBM) is composed of a Gaussian Mixed Model. The gaussian mixture model refers to a linear combination of L gaussian distribution functions, and the formula of the gaussian mixture model is as follows:

where f is an N-dimensional eigenvector f = { f) consisting of phase signatures and fitted signature parameters ₁ ,f ₂ ,…,f _N }，φ _j J =1, … L, is the mixing weight, σ _j Is a covariance matrix, mu _j Is the mean vector. The complete Gaussian mixture model consists of weight parameters, mean vectors and covariance matrices, and is represented as:

and then performing parameter estimation of the mixed Gaussian model by adopting an EM algorithm.

The EM algorithm is divided into two steps: in the first step E, m independent speech feature vectors x = (x) ₁ ,x ₂ ,x ₃ ,…,x _m ) For a model λ of the digital audio signal, the model parameter is known as θ, and each feature vector x _i All have a hidden variable z corresponding to it _i ,z _i ∈(z ₁ ,z ₂ ,z ₃ ,…,z _i ) Let p (x) _i ,z _i | θ) is maximum. The goal of the invention is to find a suitable θ and z-maximization log-likelihood function:

the second step is M steps, how to solve the values of theta and z is a complex mathematical problem, and according to the analysis of the likelihood function, the following formula is constructed:

is provided with

Then

Illustrating that the above equation introduces a new distribution Q that is unknown _i (z _i ) And satisfies the following conditions:

scaling with the Jensen inequality yields:

after the reaction is carried into the original formula, the reaction is changed into:

as can be seen from the Jensen inequality, the random variable equation constant can make the equation hold, that is:

and also

It is possible to obtain:

from this, the distribution of Q (z) is the implicit variable z for known sample and model parameters. Thus, Q after a fixed parameter theta is derived _i (z _i ) Thereby establishing a lower bound for L (theta, Z), i.e.

This lower bound is maximized by adjusting θ.

After the likelihood function is maximized to obtain new model parameters, the model parameters are brought into the first step, and more accurate GMM parameters are obtained through continuous iteration. Thus a good UBM model is obtained.

And 2, step: constructing a feature matrix of a power grid frequency spectrum feature super vector of the digital audio signal according to the target database;

the specific implementation comprises the following substeps:

in order to obtain the GMM-UBM model, UBM model parameters in the first step are updated in a target database containing original voice and tampered voice through a MAP self-adaption method, and a Gaussian mixture model can be derived from each digital audio signal to be tested.

1) The adaptive process is the sameThe method is a parameter updating process and comprises the following two steps: first, calculate the jth eigenvector f _j Belonging to the ith joint Gaussian component p in UBM _i (f) Probability of (c):

2) The second step uses the calculated P (i | f) _j ) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:

3) Finally these new sufficient statistics generated from the training data are used to update the sufficient statistics of the ith mixing member of the UBM:

wherein the content of the first and second substances,

are adaptive coefficients that control the balance between the new mean and the old estimator. The adaptive coefficient is defined as

k is a fixed parameter factor and the invention takes an empirical value of 16. And taking the mean matrix of each GMM-UNM as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and the high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.

And 3, step 3: inputting the power grid frequency spectrum characteristic super vector into the designed tamper detection deep network to perform shallow characteristic representation learning;

the method specifically comprises the following steps:

the power grid frequency spectrum feature super vector obtained by the invention is input to a deep neural network to carry out shallow feature representation learning. The deep neural network has good capability of feature extraction and representation learning, and shallow features, namely power grid frequency spectrum feature supervectors can be further trained by modeling the input signal representation.

Step A1: attention input mechanism

As shown in the schematic diagram of the attention mechanism network of fig. 3, weights are constructed by convolution, pooling, and dot multiplication to re-adjust the feature map. And different weights are given to the features in the power grid frequency spectrum feature super vector to fulfill the aims of strengthening important features and weakening edge features. M represents a two-dimensional characteristic diagram formed by transforming a power grid frequency spectrum characteristic super vector, the first convolution layer K is a matrix with the convolution kernel size of n x n, and Y is obtained after convolution kernel filtering. The convolution is calculated as:

wherein M is _ij Representing the elements of the input feature map corresponding to the convolution kernel when convolved, and R is the use of the relu function as the activation function.

After convolution, a layer of pooling layer is passed, and pooling is the secondary extraction of features. The invention uses maximal pooling, the maximum of which is selected to represent the characteristics of the area. The high-level characteristic diagram obtained after pooling not only can reduce the dimensionality and parameter quantity of the original characteristic diagram, but also can avoid problems of overfitting and the like. The pooling formula is:

H＝E(Y _α )+b ₂

wherein, Y _α Representing the original feature map, the pooling domain of the feature map being a matrix E, b ₂ And traversing the pooling domain of the original feature map for deviation to obtain a pooled feature map H. After the original characteristic diagram M is processed by convolution, pooling and full connectionAnd multiplying the feature map by the original feature map to reconstruct the original feature map.

Step A2: input to residual error network

And after the power grid frequency spectrum characteristic super vector is subjected to characteristic reconstruction through an attention mechanism, inputting the power grid frequency spectrum characteristic super vector into a residual error module to train the characteristic into a specific structure. The residual error module is based on rennet18, and the invention removes the high-order convolution layer, thereby not only reducing the calculation parameters, but also saving the calculation resources. For image-related tasks, image pixels are input into the neural network, but for the voice tampering detection task of the invention, the invention needs to perform a series of feature extraction on an original waveform, and then convert the extracted two-dimensional features into three-dimensional features to be input into the neural network. In addition, the size of the input feature vector is N M. N is the extracted fitted feature 31 and m is a gaussian component.

The residual block can be represented as:

x _l+1 ＝h(x _l )+F(x _l ,W _l )

in the formula: h (x) _l )＝W’ _l x。W’ _l Convolution operation 1*1; f (x) _l ,W _l ) Is the residual part.

In addition, compared with the input dimension 224 x 224 suggested in the traditional resnet18 network, the feature dimension input by the invention is much smaller than the input dimension of the image. The convolution kernel can continuously perform down-sampling, the number of channels is increased, and the size of feature map is reduced. In addition, the input size of the invention is smaller than the recommended input size, which results in the generated feature map being too small, and partial feature loss. To further reduce parameters and calculations, the present invention replaces the 7*7 convolutional layer with 5*5 convolutional layer, which can significantly reduce parameters.

And 4, step 4: and inputting the shallow feature into the constructed tamper detection classification network, and distinguishing the original voice and the tamper voice through a sigmod function.

The specific implementation comprises the following substeps:

the power grid frequency spectrum characteristic super vector obtains shallow layer characteristics through the characterization learning of a deep network, and whether the power grid frequency spectrum characteristic super vector is tampered or not is further judged through a classification network. A tamper detection classification network is shown in fig. 2.

1) And features are further learned through a convolutional layer and a pooling layer. The obtained shallow characteristic parameter with too large amount is directly used for tampering classification, and the obtained effect cannot reach the best. A convolutional layer, a pooling layer and a full connection layer are further adopted in the classification network, and shallow layer characteristics are enhanced through local receptive fields, weight sharing and down sampling.

2) The activation function of the output layer adopts a Sigmoid function. Sigmoid has the formula:

the formula shows that the output mapping of the sigmoid function is between (0,1), the output mapping is monotonous and continuous, the output range is limited, and the optimization is stable. And is convenient for use as a second category. Meanwhile, the sigmoid layer of the invention is expressed as follows:

H＝Sigmoid(P*W+b)

where H outputs, W is the weight; b is the offset and P is the output of the fully connected layer.

Binary cross entropy is a Loss of Loss function commonly used in Binary classification problems. The expression is as follows:

the number of N features, y corresponds to the label value of each voice, and p (y) is the probability that the output belongs to the y label. Loss is the value of Binary cross entry Loss function, and is used for judging the quality of the model of the invention.

The technical effects of the present invention will be further explained in conjunction with simulation experiments.

The invention uses 2397 voices from the Ahumada-25 database as the original voices to extract signal characteristics, and establishes UBM models of the original voices. The model of the invention was evaluated by performing experiments on three target databases, carioca (consisting of Carioca1 database and Carioca2 database), new Spanish database, and home-made database ENF-HG. The four databases have 3253 samples, and the obtained power grid frequency spectrum feature supervectors of each sample are 31 x 32 dimensions. The process of extracting the continuous super vector is carried out in an MATLAB platform, and the extracted data is stored as a csv format and is input into a network structure built in a keras for training. The change of the gaussians can influence the dimensionality of the fluctuation supervectors extracted by the invention, the influence of different gaussians on the model established by the invention is verified, the influence of four gaussians of 16, 32, 64 and 128 on the model is respectively verified, and as shown in table 1, the highest precision of the gaussians in the three databases respectively reaches 95.0%, 94.2% and 93.7%.

Table 1:

number of gauss	Carioca	New Spanish	ENF-HG	All data
						16	0.942	0.933	0.932	0.938
32	0.950	0.942	0.937	0.951
					64	0.928	0.914	0.937	0.928
128	0.895	0.911	0.923	0.932

The positive effects of the present invention will be further described below with reference to specific experimental data.

1) Different network architectures

In order to verify the feasibility of the extracted features, the extracted features are respectively input into a traditional machine learning classifier and a deep network for training. To better illustrate the feasibility of the features of the present invention, experiments were performed on different data sets and all data were validated together.

Experiments are carried out on the traditional machine learning model, and in order to compare different results, experimental comparison is carried out on SVM, random forest, decision tree, logistic regression and XGboost respectively. As shown in Table 2, the results show that the features of the present invention perform poorly on decision trees. The results of the Carioca database on the SVM are better, reaching 90.6%. The New Spanish database has a better effect on XGboost reaching 92.1%. The homemade database ENF-HG also performed 92.3% better on XGBoost.

TABLE 2

For the purpose of comparing different results for the models, the present invention compares CNN (self-designed), resnet50, resnet34, resnet18, and tamper detection network, respectively. The tampering detection network is the deep neural network and the classification network system designed by the invention. And the effects of different databases in the neural network are compared. It can be seen from table 3 that the power grid frequency spectrum feature supervector has the best effect in the tamper detection network. Meanwhile, compared with the table 2, the power grid frequency spectrum characteristic super vector has better performance in a deep network. In general, the performance of the proposed model on a data set is superior to the structure and features of other models.

TABLE 3

2) Comparison of existing methods

The invention was also compared with the results of other researchers' experiments on the public databases Carioca1, carioca2 and New Spanish with the best method proposed by the invention. The results are shown in Table 4.

TABLE 4

From table 4, it can be seen that the accuracy of using a single phase feature or frequency feature is not very high, and the grid frequency spectrum super vector used in the invention has higher accuracy.

In the description of the present invention, "a plurality" means two or more unless otherwise specified; the terms "upper", "lower", "left", "right", "inner", "outer", "front", "rear", "head", "tail", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. It will be appreciated by those skilled in the art that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, for example such code provided on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware) or a data carrier such as an optical or electronic signal carrier. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the embodiments of the present invention, and the scope of the present invention should not be limited thereto, and any modifications, equivalents and improvements made by those skilled in the art within the technical scope of the present invention as disclosed in the present invention should be covered by the scope of the present invention.

Claims

1. A method for automatically detecting operations of deleting and inserting digital audios and tampering, which is characterized by comprising the following steps:

2. The digital audio deletion and insertion tampering operation oriented automatic detection method of claim 1,

inputting the trained shallow features into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampering voice through a sigmod function to obtain a tampering detection result.

3. The method for automatically detecting digital audio deletion and insertion tampering operations as defined in claim 2, wherein the pre-processing of the original digital audio signal by the band-pass filter to extract the grid frequency components of the signal under test, and the extracting phase characteristics and fitting characteristic parameters comprises:

method for processing original digital audio signal f [ n ] by using 10000-order linear phase FIR filter]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured _ENFC [n]；

4. The method for automatic detection of digital audio deletion and insertion tampering operations as defined in claim 2, wherein said constructing a generic background model of the grid frequency comprises:

(1) Determining a Gaussian mixture model:

(2.1) determining a suitable θ and z-maximization log-likelihood function:

(2.2) calculating values of θ and z: determining Q after a fixed parameter θ based on the distribution of Q (z) as an implicit variable z under known sample and model parameters _i (z _i ) To establish L (theta)Lower bound of Z) is

5. The method for automatic detection of digital audio deletion and insertion tampering operations as defined in claim 2, wherein the training data set digital audio signal adaptively updating its mean parameter for the resulting generic background model comprises:

computing the jth feature vector f _j Belonging to the ith joint Gaussian component p in UBM _i (f) Probability of (c):

using calculated P (i | f) _j ) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:

updating the sufficient statistics of the ith mixing member of the UBM with the new sufficient statistics generated from the training data:

wherein the content of the first and second substances,

6. The method for automatically detecting digital audio deletion and insertion tampering operations as defined in claim 2, wherein the constructing a feature matrix of a power grid frequency spectral feature supervector of the digital audio signal from the target database comprises:

7. The method for automatic detection of digital audio deletion and insertion tampering oriented operations according to claim 2, wherein said deep neural network is provided with an attention mechanism and a residual network, wherein,

the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted feature 31, M represents the Gaussian component; input size 224 × 224;

the residual error network convolution layer is a convolution layer of 5*5;

the residual block is as follows:

x _l+1 ＝h(x _l )+F(x _l ,W _l )；

wherein, h (x) _l )＝W _l 'x；W _l ' represents 1*1 convolution operation; f (x) _l ,W _l ) Representing a residual portion;

the attention mechanism includes:

the first convolution layer K is a matrix with convolution kernel size n x n, and the activation function is the relu function; for shallow feature extraction, the formula is as follows:

H＝E(Y _α )+b ₂ ；

wherein, Y _α The representation is an original feature map, and E represents a pooling domain matrix of the feature map; b ₂ Indicating a deviation;

the full connection layer is used for integrating the pooled feature maps;

8. The automatic detection method for digital audio deletion and insertion tampering operations of claim 2, wherein the tamper detection classification network is composed of a convolutional layer, a pooling layer, a full connection layer, and an output layer; the activation function of the output layer adopts a sigmoid function;

9. The automatic detection method for digital audio deletion and insertion tampering operation of claim 2, wherein in step four, the shallow feature obtained is input into a pre-constructed tampering detection classification network, and the distinguishing between the original speech and the tampered speech by the sigmod function comprises:

1) Reinforcing shallow layer characteristics by using a convolution layer, a pooling layer and a full connection layer of the tamper detection classification network through local receptive field and weight sharing and down-sampling;

2) Distinguishing original voice and tampered voice by using a Sigmoid function of a tamper detection classification network output layer:

H＝Sigmoid(P*W+b)；

10. A system, comprising: