CN115472179A - Automatic detection method and system for digital audio deletion and insertion tampering operation - Google Patents

Automatic detection method and system for digital audio deletion and insertion tampering operation Download PDF

Info

Publication number
CN115472179A
CN115472179A CN202210932618.6A CN202210932618A CN115472179A CN 115472179 A CN115472179 A CN 115472179A CN 202210932618 A CN202210932618 A CN 202210932618A CN 115472179 A CN115472179 A CN 115472179A
Authority
CN
China
Prior art keywords
grid frequency
power grid
digital audio
tampering
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210932618.6A
Other languages
Chinese (zh)
Inventor
曾春艳
孔帅
王志锋
万相奎
李坤
赵宇豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN202210932618.6A priority Critical patent/CN115472179A/en
Publication of CN115472179A publication Critical patent/CN115472179A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention belongs to the technical field of digital audio signal tampering detection, and discloses an automatic detection method and system for digital audio deleting and inserting tampering operation, wherein a trained general background model of power grid frequency is utilized to extract a power grid frequency spectrum characteristic super vector of each digital audio signal; inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features; inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering. Extracting power grid frequency spectrum characteristic super vectors and establishing characteristics extracted by deep neural network training; the invention not only realizes the automation of tamper detection, but also well applies the deep neural network to tamper detection and obtains good effect. The invention has higher accuracy and better robustness.

Description

Automatic detection method and system for digital audio deletion and insertion tampering operation
Technical Field
The invention belongs to the technical field of digital audio signal tampering detection, and particularly relates to an automatic detection method and system for digital audio deleting and inserting tampering operations.
Background
At present, with the rapid development of internet information technology, intelligent mobile devices are gradually popularized, and digital multimedia data (such as audio, images, texts, etc.) has become a main information carrier. The recording and storing cost of the digital audio files becomes lower and lower, and the digital audio files are more and more convenient to obtain from the internet, so that people have increasingly rising demands for collecting and sharing the digital audio files. Meanwhile, various audio editing software is also developed, so that the audio signal is easier to edit. Accordingly, there is an increasing need for effective protection and authentication of audio recordings, particularly where the recordings may be involved in digital rights management and law enforcement cases. A large amount of false information with a sense of reality may be generated on the internet or in the court, thereby affecting social stability and public safety. Audio forensics are therefore becoming increasingly important for verifying the authenticity, integrity and source of audio information.
The utility model utilizes the power grid frequency for tamper detection, and is widely quoted by the law. From a forensic perspective, grid frequency signals are often embedded in the audio recordings of eavesdropping devices, the high availability associated with well-behaved characteristics making it an attractive feature. This is also why it is widely used. Grid frequency fluctuations in an area are stable and unique over a long period of time. Non-periodic fluctuations in the grid frequency have the same effect on all devices connected to it. Grid frequency signals are typically present in equipment powered by the grid, which is also a well-known standard signal. For example, the standard value of the grid frequency is 50Hz or 60Hz, depending on the region. European countries, australia, and most countries in asia and africa use 50Hz. North america and central america use 60Hz. It should be noted that in south America some countries use 50Hz and some countries 60Hz. And japan uses both 50Hz and 60Hz as standard values of the grid frequency. Ideally, the grid signal is a sinusoidal signal oscillating at a nominal frequency, but in reality, the instantaneous frequency varies due to fluctuations in the power supply and demand from the grid. Over time, the frequency and phase of the grid do not change abruptly. The power grid frequency signal has stability and uniqueness, and inserting or deleting an audio segment into an audio file may cause sudden change of the estimated power grid frequency signal. In the audio file, a power grid frequency signal is extracted through band-pass filtering, and the fact that tampering operation causes sudden changes of instantaneous frequency and phase of power grid frequency components at a tampering point is used for identifying whether tampering occurs or not.
Meanwhile, the prior art also provides a series of methods for detecting audio tampering. The application of the power grid frequency to the tamper detection technology can be divided into two types, wherein the first type is to compare a power grid frequency signal with a large-scale power grid frequency database; and secondly, extracting some characteristics in the power grid frequency signal and analyzing the consistency or regularity. Still other researchers have not used the grid frequency for analysis of tampering operations.
1) Based on grid frequency database comparison: grigoras originally proposed an audio tampering detection algorithm based on power grid frequency, and mainly compared the fluctuation of the power grid frequency in the audio to be detected with data of a reference year, so as to judge whether the audio is tampered. In the prior art 1, a standard power grid frequency database is obtained by using a B-spline line basis function and inverse interpolation based on analysis of a north american power grid frequency detection network. And estimating the frequency of the power grid frequency signal component by using short-time Fourier transform, and matching the power grid frequency sequence of the audio to be detected with a standard database. An oscillator error iterative correction algorithm is provided for obtaining an accurate time frequency pair of a signal to be measured, and the problem that a power grid frequency sequence cannot be matched with a standard power grid frequency database is solved. In the prior art 2, the frequency demodulation is used for extracting the power grid frequency signal from the audio signal, and the work of the stage of extracting the power grid frequency signal is further researched.
2) Extracting the frequency characteristics of the power grid: prior art 3 proposes a method for detecting splicing by revealing abnormal differences in local noise levels, which determines whether there is a heterogeneous splicing tampering operation in the audio by comparing the similarity between the background noise variances of each syllable. Prior art 4 detects multiple compressed files and identifies the encoder type using statistical features of MDCT coefficients and studies on MP3 file structures, and verifies algorithm performance in large speech databases. Prior art 5 proposes that the pitch sequence is usually completely different as the pitch sequence of the characteristic different syllable extractions of the audio. Whether copy-move forgery exists for the corresponding syllable is determined by calculating the difference between each syllable and comparing the difference of the syllable with a set threshold. Prior art 6 extracts the pitch sequence and the first two formant sequences as a feature set for each speech segment. And calculating the similarity of each feature set by adopting a Dynamic Time Warping (DTW) algorithm. Copy-mobile falsification in voice recordings is detected and located by similarity comparison with a threshold.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) The quality, the recording environment and the like of the signal to be detected are limited by certain conditions, and the detection result has no judgment standard of consistency;
(2) The extracted power grid frequency characteristics cannot well reflect tampering information, and the adopted classifier cannot better utilize important information of characteristics and learning characteristics;
(3) Some detection methods need to set fuzzy threshold decision conditions through experience of professional knowledge, and cannot well realize automatic detection.
(4) The existing characteristics are not deep enough for the mining degree of tampering information in the power grid frequency;
(5) The traditional method has weak generalization, and the robustness and accuracy of the detected audio frequency are to be improved;
the difficulty in solving the above problems and defects is: for automatic detection of digital audio deletion and insertion tampering operations, features more suitable for training in a deep network need to be extracted, and a network more suitable for tampering detection is not established yet.
The significance of solving the problems and the defects is as follows:
for the existing method, the power grid frequency spectrum characteristic supervectors extracted based on the phase information and the frequency information can better reflect and deeply mine tampering information; the deep neural network is adopted to train shallow features, so that important information of the features can be better learned; and the classification of tampering detection is realized by adopting a classification network, the detection result has a specific judgment standard, and the automatic detection is realized. The designed digital audio deleting and inserting system has obvious improvement on the robustness and the accuracy of audio detection and is verified in a plurality of databases.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an automatic detection method and system for digital audio deletion and insertion tampering operation.
The invention is realized in this way, a method for automatically detecting operations of deleting and inserting tampering of digital audio, the method for automatically detecting operations of deleting and inserting tampering of digital audio comprises:
extracting a power grid frequency spectrum characteristic super vector of each digital audio signal by using a trained general background model of the power grid frequency;
inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;
inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering.
Further, the automatic detection method for the digital audio deletion and insertion tampering operation comprises the following steps:
preprocessing an original digital audio signal by using a band-pass filter, and extracting a power grid frequency component of a signal to be detected; extracting phase characteristics and fitting characteristic parameters, and constructing a general background model of the power grid frequency;
the training data set digital audio signals update the general background model parameters of the power grid frequency through self-adaption on the obtained general background model, and a characteristic matrix of the power grid frequency spectrum characteristic super-vector of the digital audio signals is constructed according to a target database;
inputting the obtained power grid frequency spectrum characteristic super vector into a deep neural network to perform representation learning of shallow features, and obtaining the shallow features, namely the power grid frequency spectrum characteristic super vector;
and inputting the obtained shallow layer characteristics into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampering voice through a sigmod function to obtain a tampering detection result.
Further, the preprocessing of the original digital audio signal by using the band-pass filter, the extraction of the power grid frequency component of the signal to be detected, and the extraction of the phase characteristics and the fitting of the characteristic parameters comprise:
original digital audio signal f [ n ] using 10000-order linear phase FIR filter]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured ENFC [n];
Based on DFT 0 And DFT 1 Phase fluctuation characteristics F1 and F2 are obtained through transformation, and instantaneous frequency characteristics F3 are obtained based on Hilbert transformation;
and respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.
Further, the constructing of the general background model of the grid frequency includes:
(1) Determining a Gaussian mixture model:
Figure BDA0003782090370000051
wherein f represents an N-dimensional eigenvector f = { f) composed of phase characteristics and fitting characteristic parameters 1 ,f 2 ,…,f N };φ j J =1, … L denotes the mixing weight; sigma j Representing a covariance matrix; mu.s j Representing a mean vector;
(2) And (3) adopting an EM algorithm to carry out parameter estimation of the mixed Gaussian model:
(2.1) determining a suitable θ and z-maximization log-likelihood function:
Figure BDA0003782090370000052
wherein, x = (x) 1 ,x 2 ,x 3 ,…,x m ) Representing the voice feature vectors, and m represents the number of the voice feature vectors which are independent of each other; λ represents a digital audio signal model, θ represents a known model parameter, z i ,z i ∈(z 1 ,z 2 ,z 3 ,…,z i ) Representation and feature vector x i Corresponding hidden variable, let p (x) i ,z i | θ) max;
(2.2) calculating values of θ and z: determining Q after a fixed parameter θ based on the distribution of Q (z) as an implicit variable z under known sample and model parameters i (z i ) The lower bound of L (theta, Z), i.e., the
Figure BDA0003782090370000053
And (3) maximizing the lower bound by adjusting theta, maximizing the likelihood function to obtain new model parameters, returning and substituting into the step (2.1), and continuously iterating to obtain more accurate GMM parameters to obtain a good general background model of the power grid frequency.
Further, the adaptively updating the mean parameter of the training data set digital audio signal to the obtained general background model comprises:
first, the jth feature vector f is calculated j Belonging to the ith joint Gaussian component p in UBM i (f) Probability of (c):
Figure BDA0003782090370000054
next, the calculated P (i | f) is used j ) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:
Figure BDA0003782090370000061
Figure BDA0003782090370000062
finally, the new sufficient statistics generated from the training data are updated to the sufficient statistics of the ith mixing member of the UBM:
Figure BDA0003782090370000063
wherein the content of the first and second substances,
Figure BDA0003782090370000064
representing adaptive coefficients for controlling the balance between the new mean and the old estimator;
Figure BDA0003782090370000065
representing adaptive coefficients; k denotes a factor of a fixed parameter.
Further, the constructing a feature matrix of the power grid frequency spectrum feature supervector of the digital audio signal according to the target database includes:
and taking the mean matrix of each GMM-UBM model derived from each voice as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and a high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.
Further, the deep neural network is provided with an attention mechanism and a residual error network;
the attention mechanism comprises a convolution layer, a pooling layer, a full-connection layer and a dot multiplication module, and is used for performing feature reconstruction of the power grid frequency spectrum feature super-vector and endowing different weights to features in the power grid frequency spectrum feature super-vector;
the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted features 31, M represents the Gaussian components; the input size is 224 x 224;
the residual error network convolution layer is a convolution layer of 5*5;
the residual block is as follows:
x l+1 =h(x l )+F(x l ,W l );
wherein, h (x) l )=W’ l x;W’ l Represents a 1*1 convolution operation; f (x) l ,W l ) Representing the residual part.
Further, the attention mechanism includes:
the first convolutional layer K is a matrix with the convolutional kernel size of n x n, and the activation function is a relu function; for shallow feature extraction, the formula is as follows:
Figure BDA0003782090370000071
wherein M is ij Representing elements corresponding to convolution kernels in the input characteristic diagram during convolution, wherein R represents that a relu function is adopted as an activation function;
and the maximum pooling layer is used for carrying out secondary extraction on the shallow feature to obtain a pooled feature map, and the formula is as follows:
H=E(Y α )+b 2
wherein Y is α The representation is an original feature map, and E represents a pooling domain matrix of the feature map; b 2 Indicating a deviation;
the full connection layer is used for integrating the pooled feature maps;
and the dot multiplication module is used for performing dot multiplication on the feature map processed by the full connection layer and the original feature map.
Further, the tamper detection classification network is composed of a convolution layer, a pooling layer, a full-link layer and an output layer; the activation function of the output layer adopts a sigmoid function;
the loss function of the tampering detection classification network is Binary cross entropy, and the expression is as follows:
Figure BDA0003782090370000072
wherein, N represents the number of features, y corresponds to the label value of each voice, and p (y) represents the probability that the output belongs to the y label.
Further, in the fourth step, the step of inputting the obtained shallow feature into a pre-constructed tamper detection classification network, and distinguishing the original voice and the tamper voice through a sigmod function includes:
1) Strengthening shallow layer characteristics by using the convolution layer, the pooling layer and the full-connection layer through local receptive field, weight sharing and down-sampling;
2) Distinguishing the original voice and the tampered voice by using a Sigmoid function of an output layer:
H=Sigmoid(P*W+b);
wherein H represents an output and W represents a weight; b denotes the deviation, and P denotes the output of the fully connected layer.
A system, comprising:
a first module: the general background model of the trained power grid frequency is used for extracting a power grid frequency spectrum characteristic super vector of each digital audio signal;
a second module: the system is configured to input the extracted grid frequency spectrum feature super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;
a third module: and the system is configured to input the trained shallow features into a classification network and judge whether the shallow features are subjected to deletion or insertion tampering.
By combining all the technical schemes, the invention has the advantages and positive effects that:
extracting power grid frequency spectrum characteristic super vectors and establishing characteristics extracted by deep neural network training; the invention not only realizes the automation of tamper detection, but also well applies the deep network to tamper detection and obtains good effect. The method obtains the power grid frequency spectrum characteristic super vector of each voice by establishing a background model and updating the parameters thereof in a self-adaptive manner, performs representation learning of shallow features through a deep neural network, and classifies the voice input into a classification network, wherein an empirical behavior of threshold selection does not exist, and the method has higher accuracy and better robustness.
In order to verify that the invention is more robust, good results are obtained on some public databases. The significance of automatic detection of digital audio deletion and insertion tampering operations is that the detection method can be applied to various databases and various scenes, and in order to guarantee the application, the detection scheme must be robust under various practical conditions.
The method is based on the establishment of a power grid frequency general background model, the parameters of the model are updated through an EM (effective noise) algorithm, the MAP algorithm can be applied to self-adapt through a small amount of data, and each original audio frequency in a database can be self-adapted to form a GMM-UNM model; the invention establishes a deep network for performing shallow feature representation learning based on a power grid frequency spectrum feature super vector, and the shallow feature is input into a classification network for performing two classifications of tamper detection.
According to the invention, an attention mechanism module is added in the network to reconstruct the characteristics, so that the weight ratio of important characteristics is increased, and the characteristic diagram is strengthened; the invention establishes a network which can be used for tamper detection based on a residual error network, a residual error block in the network uses jump link, the problem of gradient disappearance caused by increasing depth in a deep neural network is relieved, input information is directly bypassed to output, and the integrity of the information is protected; the method classifies by activating the function sigmoid and judges the quality of the model by the loss function Binary cross entry, thereby realizing the automation of tampering detection.
Drawings
Fig. 1 is a schematic diagram of an automatic detection method for digital audio deletion and insertion tampering oriented operations according to an embodiment of the present invention.
Fig. 2 is a flowchart of an automatic detection method for digital audio deletion and insertion tampering oriented operations according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a deep neural network according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides an automatic detection method for digital audio deletion and insertion tampering operation, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the method for automatically detecting operations of deleting and inserting digital audio according to an embodiment of the present invention includes:
extracting a power grid frequency spectrum characteristic super vector of each digital audio signal by using a trained general background model of the power grid frequency;
inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;
inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering.
As shown in fig. 2, the method for automatically detecting operations of deleting and inserting digital audio according to an embodiment of the present invention includes the following steps:
s101, preprocessing an original digital audio signal by using a band-pass filter, and extracting a power grid frequency component of a signal to be detected; extracting phase characteristics and fitting characteristic parameters, and constructing a general background model of the power grid frequency;
s102, training a data set digital audio signal to obtain a general background model, updating general background model parameters of the power grid frequency in a self-adaptive manner, and constructing a feature matrix of a power grid frequency spectrum feature supervector of the digital audio signal according to a target database;
s103, inputting the obtained power grid frequency spectrum characteristic super vector into a deep neural network for shallow feature representation learning to obtain a shallow feature, namely the power grid frequency spectrum characteristic super vector;
and S104, inputting the obtained shallow layer characteristics into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampered voice through a sigmod function to obtain a tampering detection result.
The method for preprocessing the original digital audio signal by using the band-pass filter provided by the embodiment of the invention to extract the power grid frequency component of the signal to be detected, wherein the extracting of the phase characteristic and the fitting characteristic parameter comprises the following steps:
original digital audio signal f [ n ] using 10000-order linear phase FIR filter]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured ENFC [n];
Based on DFT 0 And DFT 1 Phase fluctuation characteristics F1 and F2 are obtained through transformation, and instantaneous frequency characteristics F3 are obtained based on Hilbert transformation;
and respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.
The general background model for constructing the power grid frequency provided by the embodiment of the invention comprises the following steps:
(1) Determining a Gaussian mixture model:
Figure BDA0003782090370000101
wherein f represents an N-dimensional eigenvector f = { f) composed of phase characteristics and fitting characteristic parameters 1 ,f 2 ,…,f N };φ j J =1, … L denotes the mixing weight; sigma j Representing a covariance matrix; mu.s j Representing a mean vector;
(2) And (3) adopting an EM algorithm to carry out parameter estimation of the mixed Gaussian model:
(2.1) determining a suitable θ and z-maximization log-likelihood function:
Figure BDA0003782090370000111
wherein, x = (x) 1 ,x 2 ,x 3 ,…,x m ) Representing the voice feature vectors, and m represents the number of the voice feature vectors which are independent of each other; λ represents a digital audio signal model, θ represents a known model parameter, z i ,z i ∈(z 1 ,z 2 ,z 3 ,…,z i ) Representation and feature vector x i Corresponding hidden variable, let p (x) i ,z i | θ) max;
(2.2) calculating values of θ and z: determining Q after a fixed parameter θ based on the distribution of Q (z) as an implicit variable z under known sample and model parameters i (z i ) The lower bound of L (theta, Z), i.e., the
Figure BDA0003782090370000112
And (3) maximizing the lower bound by adjusting theta, maximizing the likelihood function to obtain new model parameters, returning and substituting into the step (2.1), and continuously iterating to obtain more accurate GMM parameters to obtain a good general background model of the power grid frequency.
The method for updating the mean value parameter of the obtained general background model by the training data set digital audio signal through self-adaption comprises the following steps:
first, the jth feature vector f is calculated j Belonging to the ith joint Gaussian component p in UBM i (f) Probability of (c):
Figure BDA0003782090370000113
next, the calculated P (i | f) is used j ) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:
Figure BDA0003782090370000114
Figure BDA0003782090370000115
finally, the new sufficient statistics generated from the training data are updated to the sufficient statistics of the ith mixing member of the UBM:
Figure BDA0003782090370000121
wherein the content of the first and second substances,
Figure BDA0003782090370000122
representing adaptive coefficients for controlling the balance between the new mean and the old estimator;
Figure BDA0003782090370000123
representing the adaptive coefficients; k denotes a factor of a fixed parameter.
The feature matrix for constructing the power grid frequency spectrum feature supervectors of the digital audio signals according to the target database provided by the embodiment of the invention comprises the following steps:
and taking the mean matrix of each GMM-UBM model derived from each voice as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and the high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.
As shown in fig. 3, the deep neural network provided by the embodiment of the present invention is provided with an attention mechanism and a residual error network;
the attention mechanism comprises a convolution layer, a pooling layer, a full-connection layer and a dot multiplication module, and is used for performing feature reconstruction of the power grid frequency spectrum feature super-vector and endowing different weights to features in the power grid frequency spectrum feature super-vector;
the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted features 31, M represents the Gaussian components; the input size is 224 x 224.
The attention mechanism provided by the embodiment of the invention comprises:
the first convolutional layer K is a matrix with the convolutional kernel size of n x n, and the activation function is a relu function; for shallow feature extraction, the formula is as follows:
Figure BDA0003782090370000124
wherein, M ij Representing elements corresponding to convolution kernels in the input characteristic diagram during convolution, wherein R represents that a relu function is adopted as an activation function;
and the maximum pooling layer is used for carrying out secondary extraction on the shallow feature to obtain a pooled feature map, and the formula is as follows:
H=E(Y α )+b 2
wherein Y is α The representation is an original feature map, and E represents a pooling domain matrix of the feature map; b is a mixture of 2 Indicating a deviation;
the full connection layer is used for integrating the pooled feature maps;
and the dot multiplication module is used for performing dot multiplication on the feature map processed by the full connection layer and the original feature map.
The residual error network convolution layer provided by the embodiment of the invention is a convolution layer of 5*5;
the residual block is as follows:
x l+1 =h(x l )+F(x l ,W l );
wherein, h (x) l )=W’ l x;W’ l Represents 1*1 convolution operation; f (x) l ,W l ) Representing the residual part.
The tamper detection classification network provided by the embodiment of the invention consists of a convolution layer, a pooling layer, a full-connection layer and an output layer; the activation function of the output layer adopts a sigmoid function;
the loss function of the tamper detection classification network provided by the embodiment of the invention is Binary cross entropy, and the expression is as follows:
Figure BDA0003782090370000131
wherein, N represents the number of features, y corresponds to the label value of each voice, and p (y) represents the probability that the output belongs to the y label.
The method for inputting the trained shallow features into the pre-constructed tamper detection classification network provided by the embodiment of the invention and distinguishing the original voice and the tamper voice through the sigmod function comprises the following steps:
1) Strengthening shallow layer characteristics by using the convolution layer, the pooling layer and the full-connection layer through local receptive field, weight sharing and down-sampling;
2) And (3) distinguishing the original voice and the tampered voice by using a Sigmoid function of an output layer:
H=Sigmoid(P*W+b);
wherein H represents an output and W represents a weight; b denotes the deviation and P denotes the output of the fully connected layer.
The technical solution of the present invention is further described with reference to the following specific embodiments.
Example 1:
the invention aims to provide an automatic detection method for digital audio deletion and insertion tampering operation. Extracting a power grid frequency component of a signal to be detected, then carrying out phase characteristic and fitting characteristic parameters, and training a general background model; the mean value parameters of the obtained background model are updated by training data set digital audio signals in a self-adaptive manner, a target GMM-UBM model can be derived from each voice, and the mean value matrix of each GMM-UBM is used as a power grid frequency spectrum characteristic super vector; the power grid frequency spectrum characteristic super vector obtained by the invention is input into a deep neural network to carry out expression learning of shallow layer characteristics. The deep neural network consists of an attention mechanism and a residual error network, has good capability of feature extraction and representation learning, and can further train shallow features. Shallow layer characteristics are obtained through the characterization learning of the deep network and then input into the classification network. The classification network consists of a convolution layer, a pooling layer, a full connection layer and an output layer, and a sigmoid function is adopted as an activation function of the output layer. Further training is carried out through convolution, pooling and full connection, and finally whether tampering occurs is distinguished through a sigmod function, so that automation of tampering detection is realized.
Referring to fig. 1, the automatic detection method for digital audio deletion and insertion tampering operation of the present invention comprises the following steps:
step 1: the method comprises the following steps: extracting power grid frequency components of the original digital audio based on the designed band-pass filter, further extracting phase characteristics and fitting characteristic parameters, and establishing a general background model of the power grid frequency;
the specific implementation comprises the following substeps:
step one): for the original digital audio signal f [ n ]]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured ENFC [n]. The band-pass filter designed by the invention uses a 10000-order linear phase FIR filter. The higher order filter is used in order to obtain an ideal narrow band signal. The center frequency is at the ENF standard frequency, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB. Based on DFT 0 And DFT 1 The phase fluctuation characteristics F1 and F2 are obtained through transformation, and the instantaneous frequency characteristic F3 is obtained based on Hilbert transformation. And respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.
Step two), building a UBM model;
the Universal Background Model (UBM) is composed of a Gaussian Mixed Model. The gaussian mixture model refers to a linear combination of L gaussian distribution functions, and the formula of the gaussian mixture model is as follows:
Figure BDA0003782090370000151
where f is an N-dimensional eigenvector f = { f) consisting of phase signatures and fitted signature parameters 1 ,f 2 ,…,f N },φ j J =1, … L, is the mixing weight, σ j Is a covariance matrix, mu j Is the mean vector. The complete Gaussian mixture model consists of weight parameters, mean vectors and covariance matrices, and is represented as:
Figure BDA0003782090370000152
and then performing parameter estimation of the mixed Gaussian model by adopting an EM algorithm.
The EM algorithm is divided into two steps: in the first step E, m independent speech feature vectors x = (x) 1 ,x 2 ,x 3 ,…,x m ) For a model λ of the digital audio signal, the model parameter is known as θ, and each feature vector x i All have a hidden variable z corresponding to it i ,z i ∈(z 1 ,z 2 ,z 3 ,…,z i ) Let p (x) i ,z i | θ) is maximum. The goal of the invention is to find a suitable θ and z-maximization log-likelihood function:
Figure BDA0003782090370000153
the second step is M steps, how to solve the values of theta and z is a complex mathematical problem, and according to the analysis of the likelihood function, the following formula is constructed:
Figure BDA0003782090370000154
is provided with
Figure BDA0003782090370000155
Then
Figure BDA0003782090370000156
Illustrating that the above equation introduces a new distribution Q that is unknown i (z i ) And satisfies the following conditions:
Figure BDA0003782090370000157
scaling with the Jensen inequality yields:
Figure BDA0003782090370000158
after the reaction is carried into the original formula, the reaction is changed into:
Figure BDA0003782090370000161
as can be seen from the Jensen inequality, the random variable equation constant can make the equation hold, that is:
Figure BDA0003782090370000162
and also
Figure BDA0003782090370000163
It is possible to obtain:
Figure BDA0003782090370000164
from this, the distribution of Q (z) is the implicit variable z for known sample and model parameters. Thus, Q after a fixed parameter theta is derived i (z i ) Thereby establishing a lower bound for L (theta, Z), i.e.
Figure BDA0003782090370000165
This lower bound is maximized by adjusting θ.
After the likelihood function is maximized to obtain new model parameters, the model parameters are brought into the first step, and more accurate GMM parameters are obtained through continuous iteration. Thus a good UBM model is obtained.
And 2, step: constructing a feature matrix of a power grid frequency spectrum feature super vector of the digital audio signal according to the target database;
the specific implementation comprises the following substeps:
in order to obtain the GMM-UBM model, UBM model parameters in the first step are updated in a target database containing original voice and tampered voice through a MAP self-adaption method, and a Gaussian mixture model can be derived from each digital audio signal to be tested.
1) The adaptive process is the sameThe method is a parameter updating process and comprises the following two steps: first, calculate the jth eigenvector f j Belonging to the ith joint Gaussian component p in UBM i (f) Probability of (c):
Figure BDA0003782090370000166
2) The second step uses the calculated P (i | f) j ) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:
Figure BDA0003782090370000171
Figure BDA0003782090370000172
3) Finally these new sufficient statistics generated from the training data are used to update the sufficient statistics of the ith mixing member of the UBM:
Figure BDA0003782090370000173
wherein the content of the first and second substances,
Figure BDA0003782090370000174
are adaptive coefficients that control the balance between the new mean and the old estimator. The adaptive coefficient is defined as
Figure BDA0003782090370000175
k is a fixed parameter factor and the invention takes an empirical value of 16. And taking the mean matrix of each GMM-UNM as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and the high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.
And 3, step 3: inputting the power grid frequency spectrum characteristic super vector into the designed tamper detection deep network to perform shallow characteristic representation learning;
the method specifically comprises the following steps:
the power grid frequency spectrum feature super vector obtained by the invention is input to a deep neural network to carry out shallow feature representation learning. The deep neural network has good capability of feature extraction and representation learning, and shallow features, namely power grid frequency spectrum feature supervectors can be further trained by modeling the input signal representation.
Step A1: attention input mechanism
As shown in the schematic diagram of the attention mechanism network of fig. 3, weights are constructed by convolution, pooling, and dot multiplication to re-adjust the feature map. And different weights are given to the features in the power grid frequency spectrum feature super vector to fulfill the aims of strengthening important features and weakening edge features. M represents a two-dimensional characteristic diagram formed by transforming a power grid frequency spectrum characteristic super vector, the first convolution layer K is a matrix with the convolution kernel size of n x n, and Y is obtained after convolution kernel filtering. The convolution is calculated as:
Figure BDA0003782090370000176
wherein M is ij Representing the elements of the input feature map corresponding to the convolution kernel when convolved, and R is the use of the relu function as the activation function.
After convolution, a layer of pooling layer is passed, and pooling is the secondary extraction of features. The invention uses maximal pooling, the maximum of which is selected to represent the characteristics of the area. The high-level characteristic diagram obtained after pooling not only can reduce the dimensionality and parameter quantity of the original characteristic diagram, but also can avoid problems of overfitting and the like. The pooling formula is:
H=E(Y α )+b 2
wherein, Y α Representing the original feature map, the pooling domain of the feature map being a matrix E, b 2 And traversing the pooling domain of the original feature map for deviation to obtain a pooled feature map H. After the original characteristic diagram M is processed by convolution, pooling and full connectionAnd multiplying the feature map by the original feature map to reconstruct the original feature map.
Step A2: input to residual error network
And after the power grid frequency spectrum characteristic super vector is subjected to characteristic reconstruction through an attention mechanism, inputting the power grid frequency spectrum characteristic super vector into a residual error module to train the characteristic into a specific structure. The residual error module is based on rennet18, and the invention removes the high-order convolution layer, thereby not only reducing the calculation parameters, but also saving the calculation resources. For image-related tasks, image pixels are input into the neural network, but for the voice tampering detection task of the invention, the invention needs to perform a series of feature extraction on an original waveform, and then convert the extracted two-dimensional features into three-dimensional features to be input into the neural network. In addition, the size of the input feature vector is N M. N is the extracted fitted feature 31 and m is a gaussian component.
The residual block can be represented as:
x l+1 =h(x l )+F(x l ,W l )
in the formula: h (x) l )=W’ l x。W’ l Convolution operation 1*1; f (x) l ,W l ) Is the residual part.
In addition, compared with the input dimension 224 x 224 suggested in the traditional resnet18 network, the feature dimension input by the invention is much smaller than the input dimension of the image. The convolution kernel can continuously perform down-sampling, the number of channels is increased, and the size of feature map is reduced. In addition, the input size of the invention is smaller than the recommended input size, which results in the generated feature map being too small, and partial feature loss. To further reduce parameters and calculations, the present invention replaces the 7*7 convolutional layer with 5*5 convolutional layer, which can significantly reduce parameters.
And 4, step 4: and inputting the shallow feature into the constructed tamper detection classification network, and distinguishing the original voice and the tamper voice through a sigmod function.
The specific implementation comprises the following substeps:
the power grid frequency spectrum characteristic super vector obtains shallow layer characteristics through the characterization learning of a deep network, and whether the power grid frequency spectrum characteristic super vector is tampered or not is further judged through a classification network. A tamper detection classification network is shown in fig. 2.
1) And features are further learned through a convolutional layer and a pooling layer. The obtained shallow characteristic parameter with too large amount is directly used for tampering classification, and the obtained effect cannot reach the best. A convolutional layer, a pooling layer and a full connection layer are further adopted in the classification network, and shallow layer characteristics are enhanced through local receptive fields, weight sharing and down sampling.
2) The activation function of the output layer adopts a Sigmoid function. Sigmoid has the formula:
Figure BDA0003782090370000191
the formula shows that the output mapping of the sigmoid function is between (0,1), the output mapping is monotonous and continuous, the output range is limited, and the optimization is stable. And is convenient for use as a second category. Meanwhile, the sigmoid layer of the invention is expressed as follows:
H=Sigmoid(P*W+b)
where H outputs, W is the weight; b is the offset and P is the output of the fully connected layer.
Binary cross entropy is a Loss of Loss function commonly used in Binary classification problems. The expression is as follows:
Figure BDA0003782090370000192
the number of N features, y corresponds to the label value of each voice, and p (y) is the probability that the output belongs to the y label. Loss is the value of Binary cross entry Loss function, and is used for judging the quality of the model of the invention.
The technical effects of the present invention will be further explained in conjunction with simulation experiments.
The invention uses 2397 voices from the Ahumada-25 database as the original voices to extract signal characteristics, and establishes UBM models of the original voices. The model of the invention was evaluated by performing experiments on three target databases, carioca (consisting of Carioca1 database and Carioca2 database), new Spanish database, and home-made database ENF-HG. The four databases have 3253 samples, and the obtained power grid frequency spectrum feature supervectors of each sample are 31 x 32 dimensions. The process of extracting the continuous super vector is carried out in an MATLAB platform, and the extracted data is stored as a csv format and is input into a network structure built in a keras for training. The change of the gaussians can influence the dimensionality of the fluctuation supervectors extracted by the invention, the influence of different gaussians on the model established by the invention is verified, the influence of four gaussians of 16, 32, 64 and 128 on the model is respectively verified, and as shown in table 1, the highest precision of the gaussians in the three databases respectively reaches 95.0%, 94.2% and 93.7%.
Table 1:
number of gauss Carioca New Spanish ENF-HG All data
16 0.942 0.933 0.932 0.938
32 0.950 0.942 0.937 0.951
64 0.928 0.914 0.937 0.928
128 0.895 0.911 0.923 0.932
The positive effects of the present invention will be further described below with reference to specific experimental data.
1) Different network architectures
In order to verify the feasibility of the extracted features, the extracted features are respectively input into a traditional machine learning classifier and a deep network for training. To better illustrate the feasibility of the features of the present invention, experiments were performed on different data sets and all data were validated together.
Experiments are carried out on the traditional machine learning model, and in order to compare different results, experimental comparison is carried out on SVM, random forest, decision tree, logistic regression and XGboost respectively. As shown in Table 2, the results show that the features of the present invention perform poorly on decision trees. The results of the Carioca database on the SVM are better, reaching 90.6%. The New Spanish database has a better effect on XGboost reaching 92.1%. The homemade database ENF-HG also performed 92.3% better on XGBoost.
TABLE 2
Figure BDA0003782090370000211
For the purpose of comparing different results for the models, the present invention compares CNN (self-designed), resnet50, resnet34, resnet18, and tamper detection network, respectively. The tampering detection network is the deep neural network and the classification network system designed by the invention. And the effects of different databases in the neural network are compared. It can be seen from table 3 that the power grid frequency spectrum feature supervector has the best effect in the tamper detection network. Meanwhile, compared with the table 2, the power grid frequency spectrum characteristic super vector has better performance in a deep network. In general, the performance of the proposed model on a data set is superior to the structure and features of other models.
TABLE 3
Figure BDA0003782090370000212
2) Comparison of existing methods
The invention was also compared with the results of other researchers' experiments on the public databases Carioca1, carioca2 and New Spanish with the best method proposed by the invention. The results are shown in Table 4.
TABLE 4
Figure BDA0003782090370000221
From table 4, it can be seen that the accuracy of using a single phase feature or frequency feature is not very high, and the grid frequency spectrum super vector used in the invention has higher accuracy.
In the description of the present invention, "a plurality" means two or more unless otherwise specified; the terms "upper", "lower", "left", "right", "inner", "outer", "front", "rear", "head", "tail", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. It will be appreciated by those skilled in the art that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, for example such code provided on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware) or a data carrier such as an optical or electronic signal carrier. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.
The above description is only for the purpose of illustrating the embodiments of the present invention, and the scope of the present invention should not be limited thereto, and any modifications, equivalents and improvements made by those skilled in the art within the technical scope of the present invention as disclosed in the present invention should be covered by the scope of the present invention.

Claims (10)

1. A method for automatically detecting operations of deleting and inserting digital audios and tampering, which is characterized by comprising the following steps:
extracting a power grid frequency spectrum characteristic super vector of each digital audio signal by using a trained general background model of the power grid frequency;
inputting the extracted power grid frequency spectrum characteristic super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;
inputting the trained shallow features into a classification network, and judging whether the shallow features are subjected to deletion or insertion tampering.
2. The digital audio deletion and insertion tampering operation oriented automatic detection method of claim 1,
preprocessing an original digital audio signal by using a band-pass filter, and extracting a power grid frequency component of a signal to be detected; extracting phase characteristics and fitting characteristic parameters, and constructing a general background model of the power grid frequency;
the training data set digital audio signals update the general background model parameters of the power grid frequency through self-adaption on the obtained general background model, and a characteristic matrix of the power grid frequency spectrum characteristic super-vector of the digital audio signals is constructed according to a target database;
inputting the obtained power grid frequency spectrum characteristic super vector into a deep neural network to perform representation learning of shallow features, and obtaining the shallow features, namely the power grid frequency spectrum characteristic super vector;
inputting the trained shallow features into a pre-constructed tampering detection classification network, and distinguishing the original voice and the tampering voice through a sigmod function to obtain a tampering detection result.
3. The method for automatically detecting digital audio deletion and insertion tampering operations as defined in claim 2, wherein the pre-processing of the original digital audio signal by the band-pass filter to extract the grid frequency components of the signal under test, and the extracting phase characteristics and fitting characteristic parameters comprises:
method for processing original digital audio signal f [ n ] by using 10000-order linear phase FIR filter]Performing band-pass filtering to obtain a power grid frequency component F in the signal to be measured ENFC [n];
Based on DFT 0 And DFT 1 Phase fluctuation characteristics F1 and F2 are obtained through transformation, and instantaneous frequency characteristics F3 are obtained based on Hilbert transformation;
and respectively fitting a phase curve and a frequency curve by using Sum of Sines and Gaussian expressions, and combining the phase characteristic and the fitting characteristic parameter to obtain a characteristic vector.
4. The method for automatic detection of digital audio deletion and insertion tampering operations as defined in claim 2, wherein said constructing a generic background model of the grid frequency comprises:
(1) Determining a Gaussian mixture model:
Figure FDA0003782090360000021
wherein f represents an N-dimensional eigenvector f = { f) composed of phase characteristics and fitting characteristic parameters 1 ,f 2 ,…,f N };φ j J =1, … L denotes the mixing weight; sigma j Representing a covariance matrix; mu.s j Representing a mean vector;
(2) And (3) adopting an EM algorithm to carry out parameter estimation of the mixed Gaussian model:
(2.1) determining a suitable θ and z-maximization log-likelihood function:
Figure FDA0003782090360000022
wherein, x = (x) 1 ,x 2 ,x 3 ,…,x m ) Representing the voice feature vectors, and m represents the number of the voice feature vectors which are independent of each other; λ represents a digital audio signal model, θ represents a known model parameter, z i ,z i ∈(z 1 ,z 2 ,z 3 ,…,z i ) Representation and feature vector x i Corresponding hidden variable, let p (x) i ,z i | θ) max;
(2.2) calculating values of θ and z: determining Q after a fixed parameter θ based on the distribution of Q (z) as an implicit variable z under known sample and model parameters i (z i ) To establish L (theta)Lower bound of Z) is
Figure FDA0003782090360000023
And (3) maximizing the lower bound by adjusting theta, maximizing the likelihood function to obtain new model parameters, returning and substituting into the step (2.1), and continuously iterating to obtain more accurate GMM parameters to obtain a good general background model of the power grid frequency.
5. The method for automatic detection of digital audio deletion and insertion tampering operations as defined in claim 2, wherein the training data set digital audio signal adaptively updating its mean parameter for the resulting generic background model comprises:
computing the jth feature vector f j Belonging to the ith joint Gaussian component p in UBM i (f) Probability of (c):
Figure FDA0003782090360000031
using calculated P (i | f) j ) Calculating the mean parameter of the GMM model of the untampered target digital audio signal respectively:
Figure FDA0003782090360000032
Figure FDA0003782090360000033
updating the sufficient statistics of the ith mixing member of the UBM with the new sufficient statistics generated from the training data:
Figure FDA0003782090360000034
wherein the content of the first and second substances,
Figure FDA0003782090360000035
representing adaptive coefficients for controlling the balance between the new mean and the old estimator;
Figure FDA0003782090360000036
representing the adaptive coefficients; k denotes a factor of a fixed parameter.
6. The method for automatically detecting digital audio deletion and insertion tampering operations as defined in claim 2, wherein the constructing a feature matrix of a power grid frequency spectral feature supervector of the digital audio signal from the target database comprises:
and taking the mean matrix of each GMM-UBM model derived from each voice as a power grid frequency spectrum characteristic super vector, constructing a characteristic relation between each voice and the high-dimensional vector, adjusting the mean matrix of each voice, and reconstructing to obtain the power grid frequency spectrum characteristic super vector.
7. The method for automatic detection of digital audio deletion and insertion tampering oriented operations according to claim 2, wherein said deep neural network is provided with an attention mechanism and a residual network, wherein,
the attention mechanism comprises a convolution layer, a pooling layer, a full-connection layer and a dot multiplication module, and is used for performing feature reconstruction of the power grid frequency spectrum feature super-vector and endowing different weights to features in the power grid frequency spectrum feature super-vector;
the residual error network is used for training a specific characteristic structure of the power grid frequency spectrum characteristic super vector; the size of the feature vector input by the residual error network is N x M; where N represents the extracted fitted feature 31, M represents the Gaussian component; input size 224 × 224;
the residual error network convolution layer is a convolution layer of 5*5;
the residual block is as follows:
x l+1 =h(x l )+F(x l ,W l );
wherein, h (x) l )=W l 'x;W l ' represents 1*1 convolution operation; f (x) l ,W l ) Representing a residual portion;
the attention mechanism includes:
the first convolution layer K is a matrix with convolution kernel size n x n, and the activation function is the relu function; for shallow feature extraction, the formula is as follows:
Figure FDA0003782090360000041
wherein M is ij Representing elements corresponding to convolution kernels in the input characteristic diagram during convolution, wherein R represents that a relu function is adopted as an activation function;
and the maximum pooling layer is used for carrying out secondary extraction on the shallow feature to obtain a pooled feature map, and the formula is as follows:
H=E(Y α )+b 2
wherein, Y α The representation is an original feature map, and E represents a pooling domain matrix of the feature map; b 2 Indicating a deviation;
the full connection layer is used for integrating the pooled feature maps;
and the dot multiplication module is used for performing dot multiplication on the feature map processed by the full connection layer and the original feature map.
8. The automatic detection method for digital audio deletion and insertion tampering operations of claim 2, wherein the tamper detection classification network is composed of a convolutional layer, a pooling layer, a full connection layer, and an output layer; the activation function of the output layer adopts a sigmoid function;
the loss function of the tampering detection classification network is Binary cross entropy, and the expression is as follows:
Figure FDA0003782090360000042
wherein, N represents the number of features, y corresponds to the label value of each voice, and p (y) represents the probability that the output belongs to the y label.
9. The automatic detection method for digital audio deletion and insertion tampering operation of claim 2, wherein in step four, the shallow feature obtained is input into a pre-constructed tampering detection classification network, and the distinguishing between the original speech and the tampered speech by the sigmod function comprises:
1) Reinforcing shallow layer characteristics by using a convolution layer, a pooling layer and a full connection layer of the tamper detection classification network through local receptive field and weight sharing and down-sampling;
2) Distinguishing original voice and tampered voice by using a Sigmoid function of a tamper detection classification network output layer:
H=Sigmoid(P*W+b);
wherein H represents an output and W represents a weight; b denotes the deviation and P denotes the output of the fully connected layer.
10. A system, comprising:
a first module: the general background model of the trained power grid frequency is used for extracting a power grid frequency spectrum characteristic super vector of each digital audio signal;
a second module: the system is configured to input the extracted grid frequency spectrum feature super vector into a depth representation learning network formed by an attention mechanism and a residual error network to learn shallow features;
a third module: and the system is configured to input the trained shallow features into a classification network and judge whether the shallow features are subjected to deletion or insertion tampering.
CN202210932618.6A 2022-08-04 2022-08-04 Automatic detection method and system for digital audio deletion and insertion tampering operation Pending CN115472179A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210932618.6A CN115472179A (en) 2022-08-04 2022-08-04 Automatic detection method and system for digital audio deletion and insertion tampering operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210932618.6A CN115472179A (en) 2022-08-04 2022-08-04 Automatic detection method and system for digital audio deletion and insertion tampering operation

Publications (1)

Publication Number Publication Date
CN115472179A true CN115472179A (en) 2022-12-13

Family

ID=84367257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210932618.6A Pending CN115472179A (en) 2022-08-04 2022-08-04 Automatic detection method and system for digital audio deletion and insertion tampering operation

Country Status (1)

Country Link
CN (1) CN115472179A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795370A (en) * 2023-02-10 2023-03-14 南昌大学 Electronic digital information evidence obtaining method and system based on resampling trace

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795370A (en) * 2023-02-10 2023-03-14 南昌大学 Electronic digital information evidence obtaining method and system based on resampling trace

Similar Documents

Publication Publication Date Title
CN107610707B (en) A kind of method for recognizing sound-groove and device
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN110534101B (en) Mobile equipment source identification method and system based on multimode fusion depth features
Rantzsch et al. Signature embedding: Writer independent offline signature verification with deep metric learning
CN107564513A (en) Audio recognition method and device
CN108694346B (en) Ship radiation noise signal identification method based on two-stage CNN
CN103077720B (en) Speaker identification method and system
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN110929836B (en) Neural network training and image processing method and device, electronic equipment and medium
CN108806718B (en) Audio identification method based on analysis of ENF phase spectrum and instantaneous frequency spectrum
CN104795064A (en) Recognition method for sound event under scene of low signal to noise ratio
CN108959474B (en) Entity relation extraction method
CN108766464B (en) Digital audio tampering automatic detection method based on power grid frequency fluctuation super vector
CN108898181B (en) Image classification model processing method and device and storage medium
WO2022257453A1 (en) Training method and apparatus for semantic analysis model, terminal device, and storage medium
CN115062678B (en) Training method of equipment fault detection model, fault detection method and device
CN113646833A (en) Voice confrontation sample detection method, device, equipment and computer readable storage medium
WO2023093346A1 (en) Exogenous feature-based model ownership verification method and apparatus
CN108776795A (en) Method for identifying ID, device and terminal device
CN112767927A (en) Method, device, terminal and storage medium for extracting voice features
CN114332500A (en) Image processing model training method and device, computer equipment and storage medium
CN114548586A (en) Short-term power load prediction method and system based on hybrid model
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN115472179A (en) Automatic detection method and system for digital audio deletion and insertion tampering operation
Fan et al. Modeling voice pathology detection using imbalanced learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination