CN113990297A

CN113990297A - ENF-based audio tampering identification method

Info

Publication number: CN113990297A
Application number: CN202111376438.6A
Authority: CN
Inventors: 申兴发; 刘立立; 赵海峰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-01-28

Abstract

The invention discloses an ENF-based audio tampering identification method, which comprises the steps of firstly preprocessing audio data and tampering the audio data; then down-sampling and signal filtering the audio data; windowing and framing, and then extracting ENF signal data based on Butterworth low-pass filtering denoising; carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set; training the model through the marked ENF signal data set; and finally, completing audio tampering identification through the trained model. The invention abandons the manual threshold setting mode of the original traditional signal processing, and takes the ENF signal characteristics extracted from the audio data as the data input of the model, thereby bringing simpler audio evidence-obtaining identification process and avoiding the complex signal processing operand in the original method. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.

Description

ENF-based audio tampering identification method

Technical Field

The invention belongs to the field of audio forensics, and particularly relates to an identification model combining power grid frequency signal characteristics captured automatically in a digital audio recording process and deep learning.

Background

With the increasing enhancement of legal concepts of people and the popularization of a large number of mobile phone users, digital sound recording is increasingly used in litigation and criminal cases. Meanwhile, with the rapid development of audio signal processing technology and the continuous emergence of audio editing software with powerful functions, people can edit and modify audio contents very conveniently and can present the audio contents as evidences or publish the audio contents on a network, which brings great challenges to audio evidence collection taking recording authenticity as core content. Therefore, audio forensics, which is a research on the validity, authenticity and relevance of audio signals, is an urgent problem to be solved in forensic forensics.

Frequency is an important parameter for the operation and control of an electric power system, the same power grid has the same frequency variation trend, and the dynamic balance between power generation and load makes the variation unique within a certain time range. In 2005, doctor Grigoras recorded sound cards in 3 cities with a distance of more than 300km, surprisingly found that the three places show that the power grid frequency characteristics have the same change rule, found that the recorded audio contains power grid frequency components for the first time, and can determine the authenticity of the audio and the recording time by utilizing the ENF characteristics. The power grid frequency is emitted from AC-driven appliances and equipment, can be captured by nearby microphones and cameras, and is recorded in audio and video, so that the power grid frequency can be used as a key technology for audio forensics.

Current grid frequency based audio forensics schemes are limited by the inability to obtain a synchronized grid frequency reference data set from the grid. Most of the existing detection methods mainly adopt an original signal processing method and classify the audio types by manually marking a threshold value. This will bring many disadvantages, such as ambiguity of the manually calibrated threshold value, and easy inaccuracy. In addition, the conventional signal processing method requires a higher frequency resolution, and a higher computational complexity is caused to achieve the high resolution, so that a new technical solution is necessary to be proposed.

Aiming at the problems of the existing method, the invention provides an identification model combining the characteristics of the power grid frequency signal captured autonomously in the digital audio recording process with deep learning, and based on an audio evidence obtaining scheme of automatic power grid frequency, the ENF phase in the audio is extracted as the characteristics to detect the abrupt change of a tampered frame and the discontinuity of the power grid frequency characteristic change so as to identify audio tampering, and three tampering modes of insertion, deletion and exchange can be identified with less calculation complexity, meanwhile, the manual threshold determination is avoided, and the tampering identification detection efficiency is greatly improved.

Disclosure of Invention

The invention mainly aims to solve the problems that the existing audio tampering identification method using the traditional digital signal processing method is low in performance efficiency and difficult to judge the tampering type, and provides an identification model combining power grid frequency signal features autonomously captured in the digital audio recording process and deep learning.

An ENF-based audio tampering identification method comprises the following steps:

step 1, preprocessing audio data.

And 2, tampering the audio data.

And 3, down sampling the audio data and filtering the signal.

And 4, windowing and framing the audio data obtained in the step 3.

And 5, denoising based on Butterworth low-pass filtering.

And performing Butterworth low-pass filtering on the obtained audio data after windowing and framing, and separating the interfered noise signals.

And 6, extracting ENF signal data.

And 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set.

And 8, training the model through the marked ENF signal data set.

Step 9, completing audio tampering identification through the trained model;

the specific method of the step 1 is as follows:

firstly, carrying out data expansion on a starting data set Carioca 1, wherein the original data set is a database with 100 public telephone records, the original sampling rate is 16.1kHz, 44-bit single-channel waves exist, the value of an ENF signal pivot in the audio of the data set is about 50Hz and is not tampered, and carrying out expansion on the data set to increase the number of the data set.

The specific method of the step 2 is as follows:

and (3) respectively carrying out three tampering of deletion, insertion and exchange sequences on the expanded data set obtained in the step (1) to obtain audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering, wherein the original data set which is not expanded by the Carioca 1 is not included. And (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.

The specific method in step 3 is as follows:

in order to ensure that the number of sampling points in each period of the ENF is as many as the number of sampling points in each period, the down-sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF, so that the audio data set to be verified is down-sampled to 1kHz, signals outside the power grid frequency band in the audio data set to be verified are removed through band-pass filtering, and the bandwidth is set to be 0.5Hz by taking a pivot value as a center.

The specific method of the step 4 is as follows:

let the audio sinusoidal signal be s (n), then pass s (n) through a smooth hanning window function w (n), which is:

wherein N is the total length of the audio, N represents the current audio length, and then the audio signal x (N) after windowing is obtained as follows:

x(n)＝s(n)*w(n)

the specific method of step 6 is as follows:

the audio data after being denoised by the Butterworth low-pass filtering is subjected to short-time Fourier transform (STFT), discrete time domain Fourier transform is carried out on the mth frame of voice signal x (n) to obtain the STFT, and the formula is as follows:

where m denotes the current frame number, window function w (m-n) tableShowing a sliding window, sliding along the sequence x (n) as m varies, e^jwDiscrete Fourier value, e, representing the current frame^-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.

The specific method of step 7 is as follows:

and carrying out sample marking on the audio data after capacity expansion, the audio data after insertion and tampering, the audio data after deletion and the audio data after exchange sequence tampering in the audio data set to be verified, marking the 0 th bit of the obtained ENF signal data, marking the audio data after capacity expansion as '0', marking the audio data after deletion and tampering as '1', marking the audio data after insertion and tampering as '2', marking the audio data after exchange sequence tampering as '3', and obtaining the marked ENF signal data set.

The specific method of step 8 is as follows:

and training the 1D-CNN model by using the marked ENF signal data set to obtain the trained 1D-CNN model.

The convolution kernel size in 1D-CNN is smaller than the audio length N, the step size is 1, the number of filters is N _ filter, and is expressed as:

where N represents the audio length, kernel _ length represents the convolution kernel size, stride represents the convolution step size, N_outRepresenting the output dimension.

The specific method of step 9 is as follows:

firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition;

compared with the prior art, the invention has the following beneficial effects:

according to the method, the power grid frequency signals captured in the audio are combined with the deep learning one-dimensional convolutional neural network model 1D-CNN, the original mode of manually setting a threshold value in the traditional signal processing is abandoned, ENF signal characteristics extracted from the audio data are used as data input of the model, a simpler audio evidence obtaining and identifying process is brought, and meanwhile, the complex signal processing operation amount in the original method is avoided. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.

Drawings

FIG. 1 is an overall block diagram of an embodiment of the present invention;

FIG. 2 is an exemplary diagram of ENF extracted from audio;

FIG. 3 is a diagram of a network model of the present invention;

FIG. 4 is a comparison graph of an ENF phase jump after tampering;

FIG. 5 is a comparison graph of Butterworth low pass filtering;

FIG. 6 is a comparison graph of audio deletion tampering;

FIG. 7 is a graph comparing audio insertion tampering;

FIG. 8 is a comparison of audio exchange sequence tampering;

Detailed Description

The invention is further illustrated by the following examples.

As shown in FIG. 1, the implementation steps of the present invention are as follows:

and step 1, audio data capacity expansion.

Expanding the volume of the audio of the data set Carioca 1, wherein each audio of 100 tones of data has about thirty seconds of duration, and setting segment in pyCharm_size10 x 1000ms, 10 seconds, one segment is segmented, and the rest less than 10 seconds are directly classified into the previous segment of audio. Finally, the data is expanded to 126 pieces of male voice and 132 pieces of female voice.

And 2, tampering the audio data.

And (3) respectively modifying the data set in the step (1) by three types of tampering, namely deleting, inserting and exchanging, so that the capacity of the data is expanded to 1032.

Step 2-1, deleting, namely randomly deleting a section of the original audio, wherein the deleting position is random, and a comparison graph of audio deleting and tampering is shown in fig. 6.

Step 2-2, inserting fixed audio, ensuring sampling rate, inserting random positions, and comparing the audio inserting and tampering as shown in fig. 7.

And 2-3, exchanging. Fig. 8 shows a comparison graph of sequential tampering of audio front and back by exchanging certain segments in sequence and exchanging positions randomly.

And 3, performing down sampling and signal filtering on the audio data to be verified.

In order to ensure synchronous sampling, the number of sampling points in each period of the ENF is ensured to be as many, the analysis calculation amount is reduced, an accurate number of samples can be used in each period of the ENF, and the new sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF. Because the ENF standard value of the data set is 50Hz, the audio frequency to be verified is down-sampled to 1kHz, then signals outside the frequency band of the power grid are removed through band-pass filtering, and the upper and lower bounds of the signals are l respectively by taking a pivot value as a center and setting the bandwidth to be 0.5Hz₁＝(50-0.5)Hz，l₂＝(50+0.5)Hz。

And 4, windowing and framing the data.

The voice signal has time-varying characteristics, but the characteristics of the voice signal are basically kept unchanged and relatively stable in a short time range, and a short-time analysis technology needs to be implemented, so that the short-time analysis technology needs to be implemented, the characteristics of the voice signal are controlled to be kept unchanged in a short time, and a Hanning window function is used for windowing and framing audio data by a frame length of 1s and a frame shift of 0.1 s.

x(n)＝s(n)*w(n)

and 5, denoising the audio based on Butterworth low-pass filtering.

In order to eliminate the influence of noise on the experimental result, butterworth low-pass filtering denoising is performed on the audio data to be verified, as shown in fig. 5.

And 6, extracting ENF signal data. And performing STFT (short time Fourier transform) on the audio to be verified and connecting the frequency extreme values of each frame to obtain power grid frequency time-frequency data, as shown in figure 2.

where m denotes the current frame number, the window function w (m-n) denotes a sliding window, which slides along the sequence x (n) as m changes, e^jwDiscrete Fourier value, e, representing the current frame^-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.

And 8, inputting the obtained data into a model.

The one-dimensional convolution neural network model of the invention is shown in fig. 3, and has stronger adaptability to noise compared with the traditional signal processing algorithm. Compared with the method shown in fig. 4, the method can clearly show that the tampered position has obvious phase mutation. Wherein, deletion tampering only causes 1 phase mutation, insertion tampering causes 2 phase mutations, and exchange sequence tampering causes 4 phase mutations. CNN learns to capture phase discontinuities leading to an exposed spurious audio signal, thereby performing tamper classification. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.

Step 9, completing audio tampering identification through the trained 1D-CNN model;

firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition.

Claims

1. An ENF-based audio tampering identification method is characterized by comprising the following steps:

step 1, preprocessing audio data;

step 2, tampering the audio data;

step 3, down-sampling and signal filtering the audio data;

step 4, windowing and framing the audio data obtained in the step 3;

step 5, denoising based on Butterworth low-pass filtering;

performing Butterworth low-pass filtering on the obtained audio data after windowing and framing, and separating an interference noise signal;

step 6, extracting ENF signal data;

step 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set;

step 8, training the model through the marked ENF signal data set;

and 9, completing audio tampering identification through the trained model.

2. The ENF-based audio tampering identification method according to claim 1, wherein the specific method in step 1 is as follows:

3. The ENF-based audio tampering identification method according to claim 2, wherein the specific method in step 2 is as follows:

the expanded data set obtained in the step 1 is subjected to three types of tampering of deletion, insertion and exchange sequences respectively, audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering are obtained, and the original data set of Carioca 1 which is not expanded is not included; and (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.

4. The ENF-based audio tampering identification method according to claim 3, wherein the specific method in step 3 is as follows:

5. The ENF-based audio tampering identification method according to claim 4, wherein the specific method in step 4 is as follows:

x(n)＝s(n)*w(n)。

6. the ENF-based audio tampering identification method according to claim 5, wherein the specific method in step 6 is as follows:

7. The ENF-based audio tampering identification method according to claim 6, wherein the specific method in step 7 is as follows:

8. The ENF-based audio tampering identification method according to claim 7, wherein the specific method in step 8 is as follows:

training the 1D-CNN model by using the marked ENF signal data set to obtain a trained 1D-CNN model;

9. The ENF-based audio tampering identification method according to claim 8, wherein the specific method in step 9 is as follows: