CN113990297A - ENF-based audio tampering identification method - Google Patents
ENF-based audio tampering identification method Download PDFInfo
- Publication number
- CN113990297A CN113990297A CN202111376438.6A CN202111376438A CN113990297A CN 113990297 A CN113990297 A CN 113990297A CN 202111376438 A CN202111376438 A CN 202111376438A CN 113990297 A CN113990297 A CN 113990297A
- Authority
- CN
- China
- Prior art keywords
- audio
- enf
- tampering
- data
- audio data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Abstract
The invention discloses an ENF-based audio tampering identification method, which comprises the steps of firstly preprocessing audio data and tampering the audio data; then down-sampling and signal filtering the audio data; windowing and framing, and then extracting ENF signal data based on Butterworth low-pass filtering denoising; carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set; training the model through the marked ENF signal data set; and finally, completing audio tampering identification through the trained model. The invention abandons the manual threshold setting mode of the original traditional signal processing, and takes the ENF signal characteristics extracted from the audio data as the data input of the model, thereby bringing simpler audio evidence-obtaining identification process and avoiding the complex signal processing operand in the original method. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.
Description
Technical Field
The invention belongs to the field of audio forensics, and particularly relates to an identification model combining power grid frequency signal characteristics captured automatically in a digital audio recording process and deep learning.
Background
With the increasing enhancement of legal concepts of people and the popularization of a large number of mobile phone users, digital sound recording is increasingly used in litigation and criminal cases. Meanwhile, with the rapid development of audio signal processing technology and the continuous emergence of audio editing software with powerful functions, people can edit and modify audio contents very conveniently and can present the audio contents as evidences or publish the audio contents on a network, which brings great challenges to audio evidence collection taking recording authenticity as core content. Therefore, audio forensics, which is a research on the validity, authenticity and relevance of audio signals, is an urgent problem to be solved in forensic forensics.
Frequency is an important parameter for the operation and control of an electric power system, the same power grid has the same frequency variation trend, and the dynamic balance between power generation and load makes the variation unique within a certain time range. In 2005, doctor Grigoras recorded sound cards in 3 cities with a distance of more than 300km, surprisingly found that the three places show that the power grid frequency characteristics have the same change rule, found that the recorded audio contains power grid frequency components for the first time, and can determine the authenticity of the audio and the recording time by utilizing the ENF characteristics. The power grid frequency is emitted from AC-driven appliances and equipment, can be captured by nearby microphones and cameras, and is recorded in audio and video, so that the power grid frequency can be used as a key technology for audio forensics.
Current grid frequency based audio forensics schemes are limited by the inability to obtain a synchronized grid frequency reference data set from the grid. Most of the existing detection methods mainly adopt an original signal processing method and classify the audio types by manually marking a threshold value. This will bring many disadvantages, such as ambiguity of the manually calibrated threshold value, and easy inaccuracy. In addition, the conventional signal processing method requires a higher frequency resolution, and a higher computational complexity is caused to achieve the high resolution, so that a new technical solution is necessary to be proposed.
Aiming at the problems of the existing method, the invention provides an identification model combining the characteristics of the power grid frequency signal captured autonomously in the digital audio recording process with deep learning, and based on an audio evidence obtaining scheme of automatic power grid frequency, the ENF phase in the audio is extracted as the characteristics to detect the abrupt change of a tampered frame and the discontinuity of the power grid frequency characteristic change so as to identify audio tampering, and three tampering modes of insertion, deletion and exchange can be identified with less calculation complexity, meanwhile, the manual threshold determination is avoided, and the tampering identification detection efficiency is greatly improved.
Disclosure of Invention
The invention mainly aims to solve the problems that the existing audio tampering identification method using the traditional digital signal processing method is low in performance efficiency and difficult to judge the tampering type, and provides an identification model combining power grid frequency signal features autonomously captured in the digital audio recording process and deep learning.
An ENF-based audio tampering identification method comprises the following steps:
And 2, tampering the audio data.
And 3, down sampling the audio data and filtering the signal.
And 4, windowing and framing the audio data obtained in the step 3.
And 5, denoising based on Butterworth low-pass filtering.
And performing Butterworth low-pass filtering on the obtained audio data after windowing and framing, and separating the interfered noise signals.
And 6, extracting ENF signal data.
And 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set.
And 8, training the model through the marked ENF signal data set.
Step 9, completing audio tampering identification through the trained model;
the specific method of the step 1 is as follows:
firstly, carrying out data expansion on a starting data set Carioca 1, wherein the original data set is a database with 100 public telephone records, the original sampling rate is 16.1kHz, 44-bit single-channel waves exist, the value of an ENF signal pivot in the audio of the data set is about 50Hz and is not tampered, and carrying out expansion on the data set to increase the number of the data set.
The specific method of the step 2 is as follows:
and (3) respectively carrying out three tampering of deletion, insertion and exchange sequences on the expanded data set obtained in the step (1) to obtain audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering, wherein the original data set which is not expanded by the Carioca 1 is not included. And (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.
The specific method in step 3 is as follows:
in order to ensure that the number of sampling points in each period of the ENF is as many as the number of sampling points in each period, the down-sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF, so that the audio data set to be verified is down-sampled to 1kHz, signals outside the power grid frequency band in the audio data set to be verified are removed through band-pass filtering, and the bandwidth is set to be 0.5Hz by taking a pivot value as a center.
The specific method of the step 4 is as follows:
let the audio sinusoidal signal be s (n), then pass s (n) through a smooth hanning window function w (n), which is:
wherein N is the total length of the audio, N represents the current audio length, and then the audio signal x (N) after windowing is obtained as follows:
x(n)=s(n)*w(n)
the specific method of step 6 is as follows:
the audio data after being denoised by the Butterworth low-pass filtering is subjected to short-time Fourier transform (STFT), discrete time domain Fourier transform is carried out on the mth frame of voice signal x (n) to obtain the STFT, and the formula is as follows:
where m denotes the current frame number, window function w (m-n) tableShowing a sliding window, sliding along the sequence x (n) as m varies, ejwDiscrete Fourier value, e, representing the current frame-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.
The specific method of step 7 is as follows:
and carrying out sample marking on the audio data after capacity expansion, the audio data after insertion and tampering, the audio data after deletion and the audio data after exchange sequence tampering in the audio data set to be verified, marking the 0 th bit of the obtained ENF signal data, marking the audio data after capacity expansion as '0', marking the audio data after deletion and tampering as '1', marking the audio data after insertion and tampering as '2', marking the audio data after exchange sequence tampering as '3', and obtaining the marked ENF signal data set.
The specific method of step 8 is as follows:
and training the 1D-CNN model by using the marked ENF signal data set to obtain the trained 1D-CNN model.
The convolution kernel size in 1D-CNN is smaller than the audio length N, the step size is 1, the number of filters is N _ filter, and is expressed as:
where N represents the audio length, kernel _ length represents the convolution kernel size, stride represents the convolution step size, NoutRepresenting the output dimension.
The specific method of step 9 is as follows:
firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition;
compared with the prior art, the invention has the following beneficial effects:
according to the method, the power grid frequency signals captured in the audio are combined with the deep learning one-dimensional convolutional neural network model 1D-CNN, the original mode of manually setting a threshold value in the traditional signal processing is abandoned, ENF signal characteristics extracted from the audio data are used as data input of the model, a simpler audio evidence obtaining and identifying process is brought, and meanwhile, the complex signal processing operation amount in the original method is avoided. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.
Drawings
FIG. 1 is an overall block diagram of an embodiment of the present invention;
FIG. 2 is an exemplary diagram of ENF extracted from audio;
FIG. 3 is a diagram of a network model of the present invention;
FIG. 4 is a comparison graph of an ENF phase jump after tampering;
FIG. 5 is a comparison graph of Butterworth low pass filtering;
FIG. 6 is a comparison graph of audio deletion tampering;
FIG. 7 is a graph comparing audio insertion tampering;
FIG. 8 is a comparison of audio exchange sequence tampering;
Detailed Description
The invention is further illustrated by the following examples.
As shown in FIG. 1, the implementation steps of the present invention are as follows:
and step 1, audio data capacity expansion.
Expanding the volume of the audio of the data set Carioca 1, wherein each audio of 100 tones of data has about thirty seconds of duration, and setting segment in pyCharmsize10 x 1000ms, 10 seconds, one segment is segmented, and the rest less than 10 seconds are directly classified into the previous segment of audio. Finally, the data is expanded to 126 pieces of male voice and 132 pieces of female voice.
And 2, tampering the audio data.
And (3) respectively carrying out three tampering of deletion, insertion and exchange sequences on the expanded data set obtained in the step (1) to obtain audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering, wherein the original data set which is not expanded by the Carioca 1 is not included. And (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.
And (3) respectively modifying the data set in the step (1) by three types of tampering, namely deleting, inserting and exchanging, so that the capacity of the data is expanded to 1032.
Step 2-1, deleting, namely randomly deleting a section of the original audio, wherein the deleting position is random, and a comparison graph of audio deleting and tampering is shown in fig. 6.
Step 2-2, inserting fixed audio, ensuring sampling rate, inserting random positions, and comparing the audio inserting and tampering as shown in fig. 7.
And 2-3, exchanging. Fig. 8 shows a comparison graph of sequential tampering of audio front and back by exchanging certain segments in sequence and exchanging positions randomly.
And 3, performing down sampling and signal filtering on the audio data to be verified.
In order to ensure synchronous sampling, the number of sampling points in each period of the ENF is ensured to be as many, the analysis calculation amount is reduced, an accurate number of samples can be used in each period of the ENF, and the new sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF. Because the ENF standard value of the data set is 50Hz, the audio frequency to be verified is down-sampled to 1kHz, then signals outside the frequency band of the power grid are removed through band-pass filtering, and the upper and lower bounds of the signals are l respectively by taking a pivot value as a center and setting the bandwidth to be 0.5Hz1=(50-0.5)Hz,l2=(50+0.5)Hz。
And 4, windowing and framing the data.
The voice signal has time-varying characteristics, but the characteristics of the voice signal are basically kept unchanged and relatively stable in a short time range, and a short-time analysis technology needs to be implemented, so that the short-time analysis technology needs to be implemented, the characteristics of the voice signal are controlled to be kept unchanged in a short time, and a Hanning window function is used for windowing and framing audio data by a frame length of 1s and a frame shift of 0.1 s.
Let the audio sinusoidal signal be s (n), then pass s (n) through a smooth hanning window function w (n), which is:
wherein N is the total length of the audio, N represents the current audio length, and then the audio signal x (N) after windowing is obtained as follows:
x(n)=s(n)*w(n)
and 5, denoising the audio based on Butterworth low-pass filtering.
In order to eliminate the influence of noise on the experimental result, butterworth low-pass filtering denoising is performed on the audio data to be verified, as shown in fig. 5.
And 6, extracting ENF signal data. And performing STFT (short time Fourier transform) on the audio to be verified and connecting the frequency extreme values of each frame to obtain power grid frequency time-frequency data, as shown in figure 2.
The audio data after being denoised by the Butterworth low-pass filtering is subjected to short-time Fourier transform (STFT), discrete time domain Fourier transform is carried out on the mth frame of voice signal x (n) to obtain the STFT, and the formula is as follows:
where m denotes the current frame number, the window function w (m-n) denotes a sliding window, which slides along the sequence x (n) as m changes, ejwDiscrete Fourier value, e, representing the current frame-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.
And 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set.
And carrying out sample marking on the audio data after capacity expansion, the audio data after insertion and tampering, the audio data after deletion and the audio data after exchange sequence tampering in the audio data set to be verified, marking the 0 th bit of the obtained ENF signal data, marking the audio data after capacity expansion as '0', marking the audio data after deletion and tampering as '1', marking the audio data after insertion and tampering as '2', marking the audio data after exchange sequence tampering as '3', and obtaining the marked ENF signal data set.
And 8, inputting the obtained data into a model.
The one-dimensional convolution neural network model of the invention is shown in fig. 3, and has stronger adaptability to noise compared with the traditional signal processing algorithm. Compared with the method shown in fig. 4, the method can clearly show that the tampered position has obvious phase mutation. Wherein, deletion tampering only causes 1 phase mutation, insertion tampering causes 2 phase mutations, and exchange sequence tampering causes 4 phase mutations. CNN learns to capture phase discontinuities leading to an exposed spurious audio signal, thereby performing tamper classification. The result shows that the model can accurately identify the tampering type of the audio, and the robustness of the classification model is improved.
Step 9, completing audio tampering identification through the trained 1D-CNN model;
firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition.
Claims (9)
1. An ENF-based audio tampering identification method is characterized by comprising the following steps:
step 1, preprocessing audio data;
step 2, tampering the audio data;
step 3, down-sampling and signal filtering the audio data;
step 4, windowing and framing the audio data obtained in the step 3;
step 5, denoising based on Butterworth low-pass filtering;
performing Butterworth low-pass filtering on the obtained audio data after windowing and framing, and separating an interference noise signal;
step 6, extracting ENF signal data;
step 7, carrying out classified data marking on the extracted ENF signal data to obtain a marked ENF signal data set;
step 8, training the model through the marked ENF signal data set;
and 9, completing audio tampering identification through the trained model.
2. The ENF-based audio tampering identification method according to claim 1, wherein the specific method in step 1 is as follows:
firstly, carrying out data expansion on a starting data set Carioca 1, wherein the original data set is a database with 100 public telephone records, the original sampling rate is 16.1kHz, 44-bit single-channel waves exist, the value of an ENF signal pivot in the audio of the data set is about 50Hz and is not tampered, and carrying out expansion on the data set to increase the number of the data set.
3. The ENF-based audio tampering identification method according to claim 2, wherein the specific method in step 2 is as follows:
the expanded data set obtained in the step 1 is subjected to three types of tampering of deletion, insertion and exchange sequences respectively, audio data after insertion and tampering, audio data after deletion and tampering and audio data after exchange sequence tampering are obtained, and the original data set of Carioca 1 which is not expanded is not included; and (3) merging the audio data expanded in the step (1) and the tampered audio data to serve as an audio data set to be verified.
4. The ENF-based audio tampering identification method according to claim 3, wherein the specific method in step 3 is as follows:
in order to ensure that the number of sampling points in each period of the ENF is as many as the number of sampling points in each period, the down-sampling frequency is adjusted to be 20 times of the standard value frequency of the ENF, so that the audio data set to be verified is down-sampled to 1kHz, signals outside the power grid frequency band in the audio data set to be verified are removed through band-pass filtering, and the bandwidth is set to be 0.5Hz by taking a pivot value as a center.
5. The ENF-based audio tampering identification method according to claim 4, wherein the specific method in step 4 is as follows:
let the audio sinusoidal signal be s (n), then pass s (n) through a smooth hanning window function w (n), which is:
wherein N is the total length of the audio, N represents the current audio length, and then the audio signal x (N) after windowing is obtained as follows:
x(n)=s(n)*w(n)。
6. the ENF-based audio tampering identification method according to claim 5, wherein the specific method in step 6 is as follows:
the audio data after being denoised by the Butterworth low-pass filtering is subjected to short-time Fourier transform (STFT), discrete time domain Fourier transform is carried out on the mth frame of voice signal x (n) to obtain the STFT, and the formula is as follows:
where m denotes the current frame number, the window function w (m-n) denotes a sliding window, which slides along the sequence x (n) as m changes, ejwDiscrete Fourier value, e, representing the current frame-jwnAnd performing Fourier transform on the current frame, and finally connecting the frequency extreme value of each frame to obtain power grid frequency data corresponding to the audio, namely ENF signal data.
7. The ENF-based audio tampering identification method according to claim 6, wherein the specific method in step 7 is as follows:
and carrying out sample marking on the audio data after capacity expansion, the audio data after insertion and tampering, the audio data after deletion and the audio data after exchange sequence tampering in the audio data set to be verified, marking the 0 th bit of the obtained ENF signal data, marking the audio data after capacity expansion as '0', marking the audio data after deletion and tampering as '1', marking the audio data after insertion and tampering as '2', marking the audio data after exchange sequence tampering as '3', and obtaining the marked ENF signal data set.
8. The ENF-based audio tampering identification method according to claim 7, wherein the specific method in step 8 is as follows:
training the 1D-CNN model by using the marked ENF signal data set to obtain a trained 1D-CNN model;
the convolution kernel size in 1D-CNN is smaller than the audio length N, the step size is 1, the number of filters is N _ filter, and is expressed as:
where N represents the audio length, kernel _ length represents the convolution kernel size, stride represents the convolution step size, NoutRepresenting the output dimension.
9. The ENF-based audio tampering identification method according to claim 8, wherein the specific method in step 9 is as follows:
firstly, processing the audio data to be recognized in the steps 3-6 to obtain corresponding ENF signal data, and then inputting the obtained ENF signal data into the 1D-CNN model trained in the step 8 to finish audio tampering recognition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111376438.6A CN113990297A (en) | 2021-11-19 | 2021-11-19 | ENF-based audio tampering identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111376438.6A CN113990297A (en) | 2021-11-19 | 2021-11-19 | ENF-based audio tampering identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113990297A true CN113990297A (en) | 2022-01-28 |
Family
ID=79749509
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111376438.6A Pending CN113990297A (en) | 2021-11-19 | 2021-11-19 | ENF-based audio tampering identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113990297A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114155875A (en) * | 2022-02-09 | 2022-03-08 | 中国科学院自动化研究所 | Method and device for identifying voice scene tampering, electronic equipment and storage medium |
-
2021
- 2021-11-19 CN CN202111376438.6A patent/CN113990297A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114155875A (en) * | 2022-02-09 | 2022-03-08 | 中国科学院自动化研究所 | Method and device for identifying voice scene tampering, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107274915B (en) | Digital audio tampering automatic detection method based on feature fusion | |
CN107507626B (en) | Mobile phone source identification method based on voice frequency spectrum fusion characteristics | |
CN112949708B (en) | Emotion recognition method, emotion recognition device, computer equipment and storage medium | |
CN108198561A (en) | A kind of pirate recordings speech detection method based on convolutional neural networks | |
CN105975568A (en) | Audio processing method and apparatus | |
US9058384B2 (en) | System and method for identification of highly-variable vocalizations | |
Zeng et al. | Audio tampering forensics based on representation learning of enf phase sequence | |
EP2962299A1 (en) | Audio signal analysis | |
CN108021635A (en) | The definite method, apparatus and storage medium of a kind of audio similarity | |
CN110767248B (en) | Anti-modulation interference audio fingerprint extraction method | |
CN113990297A (en) | ENF-based audio tampering identification method | |
Kamaladas et al. | Fingerprint extraction of audio signal using wavelet transform | |
CN113823323B (en) | Audio processing method and device based on convolutional neural network and related equipment | |
EP3504708B1 (en) | A device and method for classifying an acoustic environment | |
CN101877223A (en) | Video and audio editing system and method and electronic equipment with video and audio editing system | |
CN111445924B (en) | Method for detecting and positioning smoothing process in voice segment based on autoregressive model coefficient | |
CN1707613A (en) | Collecting apparatus and method for noise insulation audio frequency | |
CN103380457B (en) | Sound processing apparatus, method and integrated circuit | |
CN110310660B (en) | Speech resampling detection method based on spectrogram | |
CN103077706A (en) | Method for extracting and representing music fingerprint characteristic of music with regular drumbeat rhythm | |
KR101382356B1 (en) | Apparatus for forgery detection of audio file | |
CN113539298B (en) | Sound big data analysis and calculation imaging system based on cloud edge end | |
CN109829265A (en) | A kind of the infringement evidence collecting method and system of audio production | |
CN108665905B (en) | Digital voice resampling detection method based on frequency band bandwidth inconsistency | |
CN103297674A (en) | Signal processing apparatus, system and method, and program, electric device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |