CN113990297A - ENF-based audio tampering identification method - Google Patents

ENF-based audio tampering identification method Download PDF

Info

Publication number
CN113990297A
CN113990297A CN202111376438.6A CN202111376438A CN113990297A CN 113990297 A CN113990297 A CN 113990297A CN 202111376438 A CN202111376438 A CN 202111376438A CN 113990297 A CN113990297 A CN 113990297A
Authority
CN
China
Prior art keywords
audio
enf
tampering
data
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111376438.6A
Other languages
Chinese (zh)
Inventor
申兴发
刘立立
赵海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111376438.6A priority Critical patent/CN113990297A/en
Publication of CN113990297A publication Critical patent/CN113990297A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

The invention discloses an ENF-based audio tampering identification method, which comprises the steps of firstly preprocessing audio data and tampering the audio data; then down-sampling and signal filtering the audio data; windowing and framing, and then extracting ENF signal data based on Butterworth low-pass filtering denoising; carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set; training the model through the marked ENF signal data set; and finally, completing audio tampering identification through the trained model. The invention abandons the manual threshold setting mode of the original traditional signal processing, and takes the ENF signal characteristics extracted from the audio data as the data input of the model, thereby bringing simpler audio evidence-obtaining identification process and avoiding the complex signal processing operand in the original method. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.

Description

ENF-based audio tampering identification method
Technical Field
The invention belongs to the field of audio forensics, and particularly relates to an identification model combining power grid frequency signal characteristics captured automatically in a digital audio recording process and deep learning.
Background
With the increasing enhancement of legal concepts of people and the popularization of a large number of mobile phone users, digital sound recording is increasingly used in litigation and criminal cases. Meanwhile, with the rapid development of audio signal processing technology and the continuous emergence of audio editing software with powerful functions, people can edit and modify audio contents very conveniently and can present the audio contents as evidences or publish the audio contents on a network, which brings great challenges to audio evidence collection taking recording authenticity as core content. Therefore, audio forensics, which is a research on the validity, authenticity and relevance of audio signals, is an urgent problem to be solved in forensic forensics.
Frequency is an important parameter for the operation and control of an electric power system, the same power grid has the same frequency variation trend, and the dynamic balance between power generation and load makes the variation unique within a certain time range. In 2005, doctor Grigoras recorded sound cards in 3 cities with a distance of more than 300km, surprisingly found that the three places show that the power grid frequency characteristics have the same change rule, found that the recorded audio contains power grid frequency components for the first time, and can determine the authenticity of the audio and the recording time by utilizing the ENF characteristics. The power grid frequency is emitted from AC-driven appliances and equipment, can be captured by nearby microphones and cameras, and is recorded in audio and video, so that the power grid frequency can be used as a key technology for audio forensics.
Current grid frequency based audio forensics schemes are limited by the inability to obtain a synchronized grid frequency reference data set from the grid. Most of the existing detection methods mainly adopt an original signal processing method and classify the audio types by manually marking a threshold value. This will bring many disadvantages, such as ambiguity of the manually calibrated threshold value, and easy inaccuracy. In addition, the conventional signal processing method requires a higher frequency resolution, and a higher computational complexity is caused to achieve the high resolution, so that a new technical solution is necessary to be proposed.
Aiming at the problems of the existing method, the invention provides an identification model combining the characteristics of the power grid frequency signal captured autonomously in the digital audio recording process with deep learning, and based on an audio evidence obtaining scheme of automatic power grid frequency, the ENF phase in the audio is extracted as the characteristics to detect the abrupt change of a tampered frame and the discontinuity of the power grid frequency characteristic change so as to identify audio tampering, and three tampering modes of insertion, deletion and exchange can be identified with less calculation complexity, meanwhile, the manual threshold determination is avoided, and the tampering identification detection efficiency is greatly improved.
Disclosure of Invention
The invention mainly aims to solve the problems that the existing audio tampering identification method using the traditional digital signal processing method is low in performance efficiency and difficult to judge the tampering type, and provides an identification model combining power grid frequency signal features autonomously captured in the digital audio recording process and deep learning.
An ENF-based audio tampering identification method comprises the following steps:
step 1, preprocessing audio data.
And 2, tampering the audio data.
And 3, down sampling the audio data and filtering the signal.
And 4, windowing and framing the audio data obtained in the step 3.
And 5, denoising based on Butterworth low-pass filtering.
And performing Butterworth low-pass filtering on the obtained audio data after windowing and framing, and separating the interfered noise signals.
And 6, extracting ENF signal data.
And 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set.
And 8, training the model through the marked ENF signal data set.
Step 9, completing audio tampering identification through the trained model;
the specific method of the step 1 is as follows:
firstly, carrying out data expansion on a starting data set Carioca 1, wherein the original data set is a database with 100 public telephone records, the original sampling rate is 16.1kHz, 44-bit single-channel waves exist, the value of an ENF signal pivot in the audio of the data set is about 50Hz and is not tampered, and carrying out expansion on the data set to increase the number of the data set.
The specific method of the step 2 is as follows:
and (3) respectively carrying out three tampering of deletion, insertion and exchange sequences on the expanded data set obtained in the step (1) to obtain audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering, wherein the original data set which is not expanded by the Carioca 1 is not included. And (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.
The specific method in step 3 is as follows:
in order to ensure that the number of sampling points in each period of the ENF is as many as the number of sampling points in each period, the down-sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF, so that the audio data set to be verified is down-sampled to 1kHz, signals outside the power grid frequency band in the audio data set to be verified are removed through band-pass filtering, and the bandwidth is set to be 0.5Hz by taking a pivot value as a center.
The specific method of the step 4 is as follows:
let the audio sinusoidal signal be s (n), then pass s (n) through a smooth hanning window function w (n), which is:
Figure BDA0003364038560000031
wherein N is the total length of the audio, N represents the current audio length, and then the audio signal x (N) after windowing is obtained as follows:
x(n)=s(n)*w(n)
the specific method of step 6 is as follows:
the audio data after being denoised by the Butterworth low-pass filtering is subjected to short-time Fourier transform (STFT), discrete time domain Fourier transform is carried out on the mth frame of voice signal x (n) to obtain the STFT, and the formula is as follows:
Figure BDA0003364038560000032
where m denotes the current frame number, window function w (m-n) tableShowing a sliding window, sliding along the sequence x (n) as m varies, ejwDiscrete Fourier value, e, representing the current frame-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.
The specific method of step 7 is as follows:
and carrying out sample marking on the audio data after capacity expansion, the audio data after insertion and tampering, the audio data after deletion and the audio data after exchange sequence tampering in the audio data set to be verified, marking the 0 th bit of the obtained ENF signal data, marking the audio data after capacity expansion as '0', marking the audio data after deletion and tampering as '1', marking the audio data after insertion and tampering as '2', marking the audio data after exchange sequence tampering as '3', and obtaining the marked ENF signal data set.
The specific method of step 8 is as follows:
and training the 1D-CNN model by using the marked ENF signal data set to obtain the trained 1D-CNN model.
The convolution kernel size in 1D-CNN is smaller than the audio length N, the step size is 1, the number of filters is N _ filter, and is expressed as:
Figure BDA0003364038560000041
where N represents the audio length, kernel _ length represents the convolution kernel size, stride represents the convolution step size, NoutRepresenting the output dimension.
The specific method of step 9 is as follows:
firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition;
compared with the prior art, the invention has the following beneficial effects:
according to the method, the power grid frequency signals captured in the audio are combined with the deep learning one-dimensional convolutional neural network model 1D-CNN, the original mode of manually setting a threshold value in the traditional signal processing is abandoned, ENF signal characteristics extracted from the audio data are used as data input of the model, a simpler audio evidence obtaining and identifying process is brought, and meanwhile, the complex signal processing operation amount in the original method is avoided. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.
Drawings
FIG. 1 is an overall block diagram of an embodiment of the present invention;
FIG. 2 is an exemplary diagram of ENF extracted from audio;
FIG. 3 is a diagram of a network model of the present invention;
FIG. 4 is a comparison graph of an ENF phase jump after tampering;
FIG. 5 is a comparison graph of Butterworth low pass filtering;
FIG. 6 is a comparison graph of audio deletion tampering;
FIG. 7 is a graph comparing audio insertion tampering;
FIG. 8 is a comparison of audio exchange sequence tampering;
Detailed Description
The invention is further illustrated by the following examples.
As shown in FIG. 1, the implementation steps of the present invention are as follows:
and step 1, audio data capacity expansion.
Expanding the volume of the audio of the data set Carioca 1, wherein each audio of 100 tones of data has about thirty seconds of duration, and setting segment in pyCharmsize10 x 1000ms, 10 seconds, one segment is segmented, and the rest less than 10 seconds are directly classified into the previous segment of audio. Finally, the data is expanded to 126 pieces of male voice and 132 pieces of female voice.
And 2, tampering the audio data.
And (3) respectively carrying out three tampering of deletion, insertion and exchange sequences on the expanded data set obtained in the step (1) to obtain audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering, wherein the original data set which is not expanded by the Carioca 1 is not included. And (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.
And (3) respectively modifying the data set in the step (1) by three types of tampering, namely deleting, inserting and exchanging, so that the capacity of the data is expanded to 1032.
Step 2-1, deleting, namely randomly deleting a section of the original audio, wherein the deleting position is random, and a comparison graph of audio deleting and tampering is shown in fig. 6.
Step 2-2, inserting fixed audio, ensuring sampling rate, inserting random positions, and comparing the audio inserting and tampering as shown in fig. 7.
And 2-3, exchanging. Fig. 8 shows a comparison graph of sequential tampering of audio front and back by exchanging certain segments in sequence and exchanging positions randomly.
And 3, performing down sampling and signal filtering on the audio data to be verified.
In order to ensure synchronous sampling, the number of sampling points in each period of the ENF is ensured to be as many, the analysis calculation amount is reduced, an accurate number of samples can be used in each period of the ENF, and the new sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF. Because the ENF standard value of the data set is 50Hz, the audio frequency to be verified is down-sampled to 1kHz, then signals outside the frequency band of the power grid are removed through band-pass filtering, and the upper and lower bounds of the signals are l respectively by taking a pivot value as a center and setting the bandwidth to be 0.5Hz1=(50-0.5)Hz,l2=(50+0.5)Hz。
And 4, windowing and framing the data.
The voice signal has time-varying characteristics, but the characteristics of the voice signal are basically kept unchanged and relatively stable in a short time range, and a short-time analysis technology needs to be implemented, so that the short-time analysis technology needs to be implemented, the characteristics of the voice signal are controlled to be kept unchanged in a short time, and a Hanning window function is used for windowing and framing audio data by a frame length of 1s and a frame shift of 0.1 s.
Let the audio sinusoidal signal be s (n), then pass s (n) through a smooth hanning window function w (n), which is:
Figure BDA0003364038560000061
wherein N is the total length of the audio, N represents the current audio length, and then the audio signal x (N) after windowing is obtained as follows:
x(n)=s(n)*w(n)
and 5, denoising the audio based on Butterworth low-pass filtering.
In order to eliminate the influence of noise on the experimental result, butterworth low-pass filtering denoising is performed on the audio data to be verified, as shown in fig. 5.
And 6, extracting ENF signal data. And performing STFT (short time Fourier transform) on the audio to be verified and connecting the frequency extreme values of each frame to obtain power grid frequency time-frequency data, as shown in figure 2.
The audio data after being denoised by the Butterworth low-pass filtering is subjected to short-time Fourier transform (STFT), discrete time domain Fourier transform is carried out on the mth frame of voice signal x (n) to obtain the STFT, and the formula is as follows:
Figure BDA0003364038560000071
where m denotes the current frame number, the window function w (m-n) denotes a sliding window, which slides along the sequence x (n) as m changes, ejwDiscrete Fourier value, e, representing the current frame-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.
And 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set.
And carrying out sample marking on the audio data after capacity expansion, the audio data after insertion and tampering, the audio data after deletion and the audio data after exchange sequence tampering in the audio data set to be verified, marking the 0 th bit of the obtained ENF signal data, marking the audio data after capacity expansion as '0', marking the audio data after deletion and tampering as '1', marking the audio data after insertion and tampering as '2', marking the audio data after exchange sequence tampering as '3', and obtaining the marked ENF signal data set.
And 8, inputting the obtained data into a model.
The one-dimensional convolution neural network model of the invention is shown in fig. 3, and has stronger adaptability to noise compared with the traditional signal processing algorithm. Compared with the method shown in fig. 4, the method can clearly show that the tampered position has obvious phase mutation. Wherein, deletion tampering only causes 1 phase mutation, insertion tampering causes 2 phase mutations, and exchange sequence tampering causes 4 phase mutations. CNN learns to capture phase discontinuities leading to an exposed spurious audio signal, thereby performing tamper classification. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.
Step 9, completing audio tampering identification through the trained 1D-CNN model;
firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition.

Claims (9)

1. An ENF-based audio tampering identification method is characterized by comprising the following steps:
step 1, preprocessing audio data;
step 2, tampering the audio data;
step 3, down-sampling and signal filtering the audio data;
step 4, windowing and framing the audio data obtained in the step 3;
step 5, denoising based on Butterworth low-pass filtering;
performing Butterworth low-pass filtering on the obtained audio data after windowing and framing, and separating an interference noise signal;
step 6, extracting ENF signal data;
step 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set;
step 8, training the model through the marked ENF signal data set;
and 9, completing audio tampering identification through the trained model.
2. The ENF-based audio tampering identification method according to claim 1, wherein the specific method in step 1 is as follows:
firstly, carrying out data expansion on a starting data set Carioca 1, wherein the original data set is a database with 100 public telephone records, the original sampling rate is 16.1kHz, 44-bit single-channel waves exist, the value of an ENF signal pivot in the audio of the data set is about 50Hz and is not tampered, and carrying out expansion on the data set to increase the number of the data set.
3. The ENF-based audio tampering identification method according to claim 2, wherein the specific method in step 2 is as follows:
the expanded data set obtained in the step 1 is subjected to three types of tampering of deletion, insertion and exchange sequences respectively, audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering are obtained, and the original data set of Carioca 1 which is not expanded is not included; and (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.
4. The ENF-based audio tampering identification method according to claim 3, wherein the specific method in step 3 is as follows:
in order to ensure that the number of sampling points in each period of the ENF is as many as the number of sampling points in each period, the down-sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF, so that the audio data set to be verified is down-sampled to 1kHz, signals outside the power grid frequency band in the audio data set to be verified are removed through band-pass filtering, and the bandwidth is set to be 0.5Hz by taking a pivot value as a center.
5. The ENF-based audio tampering identification method according to claim 4, wherein the specific method in step 4 is as follows:
let the audio sinusoidal signal be s (n), then pass s (n) through a smooth hanning window function w (n), which is:
Figure FDA0003364038550000021
wherein N is the total length of the audio, N represents the current audio length, and then the audio signal x (N) after windowing is obtained as follows:
x(n)=s(n)*w(n)。
6. the ENF-based audio tampering identification method according to claim 5, wherein the specific method in step 6 is as follows:
the audio data after being denoised by the Butterworth low-pass filtering is subjected to short-time Fourier transform (STFT), discrete time domain Fourier transform is carried out on the mth frame of voice signal x (n) to obtain the STFT, and the formula is as follows:
Figure FDA0003364038550000022
where m denotes the current frame number, the window function w (m-n) denotes a sliding window, which slides along the sequence x (n) as m changes, ejwDiscrete Fourier value, e, representing the current frame-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.
7. The ENF-based audio tampering identification method according to claim 6, wherein the specific method in step 7 is as follows:
and carrying out sample marking on the audio data after capacity expansion, the audio data after insertion and tampering, the audio data after deletion and the audio data after exchange sequence tampering in the audio data set to be verified, marking the 0 th bit of the obtained ENF signal data, marking the audio data after capacity expansion as '0', marking the audio data after deletion and tampering as '1', marking the audio data after insertion and tampering as '2', marking the audio data after exchange sequence tampering as '3', and obtaining the marked ENF signal data set.
8. The ENF-based audio tampering identification method according to claim 7, wherein the specific method in step 8 is as follows:
training the 1D-CNN model by using the marked ENF signal data set to obtain a trained 1D-CNN model;
the convolution kernel size in 1D-CNN is smaller than the audio length N, the step size is 1, the number of filters is N _ filter, and is expressed as:
Figure FDA0003364038550000031
where N represents the audio length, kernel _ length represents the convolution kernel size, stride represents the convolution step size, NoutRepresenting the output dimension.
9. The ENF-based audio tampering identification method according to claim 8, wherein the specific method in step 9 is as follows:
firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition.
CN202111376438.6A 2021-11-19 2021-11-19 ENF-based audio tampering identification method Pending CN113990297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111376438.6A CN113990297A (en) 2021-11-19 2021-11-19 ENF-based audio tampering identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111376438.6A CN113990297A (en) 2021-11-19 2021-11-19 ENF-based audio tampering identification method

Publications (1)

Publication Number Publication Date
CN113990297A true CN113990297A (en) 2022-01-28

Family

ID=79749509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111376438.6A Pending CN113990297A (en) 2021-11-19 2021-11-19 ENF-based audio tampering identification method

Country Status (1)

Country Link
CN (1) CN113990297A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155875A (en) * 2022-02-09 2022-03-08 中国科学院自动化研究所 Method and device for identifying voice scene tampering, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155875A (en) * 2022-02-09 2022-03-08 中国科学院自动化研究所 Method and device for identifying voice scene tampering, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107274915B (en) Digital audio tampering automatic detection method based on feature fusion
CN107507626B (en) Mobile phone source identification method based on voice frequency spectrum fusion characteristics
CN112949708B (en) Emotion recognition method, emotion recognition device, computer equipment and storage medium
CN108198561A (en) A kind of pirate recordings speech detection method based on convolutional neural networks
CN105975568A (en) Audio processing method and apparatus
US9058384B2 (en) System and method for identification of highly-variable vocalizations
Zeng et al. Audio tampering forensics based on representation learning of enf phase sequence
EP2962299A1 (en) Audio signal analysis
CN108021635A (en) The definite method, apparatus and storage medium of a kind of audio similarity
CN110767248B (en) Anti-modulation interference audio fingerprint extraction method
CN113990297A (en) ENF-based audio tampering identification method
Kamaladas et al. Fingerprint extraction of audio signal using wavelet transform
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
EP3504708B1 (en) A device and method for classifying an acoustic environment
CN101877223A (en) Video and audio editing system and method and electronic equipment with video and audio editing system
CN111445924B (en) Method for detecting and positioning smoothing process in voice segment based on autoregressive model coefficient
CN1707613A (en) Collecting apparatus and method for noise insulation audio frequency
CN103380457B (en) Sound processing apparatus, method and integrated circuit
CN110310660B (en) Speech resampling detection method based on spectrogram
CN103077706A (en) Method for extracting and representing music fingerprint characteristic of music with regular drumbeat rhythm
KR101382356B1 (en) Apparatus for forgery detection of audio file
CN113539298B (en) Sound big data analysis and calculation imaging system based on cloud edge end
CN109829265A (en) A kind of the infringement evidence collecting method and system of audio production
CN108665905B (en) Digital voice resampling detection method based on frequency band bandwidth inconsistency
CN103297674A (en) Signal processing apparatus, system and method, and program, electric device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination