CN112394224B

CN112394224B - Audio file generation time tracing dynamic matching method and system

Info

Publication number: CN112394224B
Application number: CN202011217274.8A
Authority: CN
Inventors: 华光; 王清懿; 张海剑
Original assignee: Wuhan University WHU
Current assignee: WUHAN DASHENGJI TECHNOLOGY Co.,Ltd.
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-08-10
Anticipated expiration: 2040-11-04
Also published as: CN112394224A

Abstract

The invention discloses a method and a system for tracing the source of audio file generation time, wherein the method comprises the following steps: extracting a power grid frequency time domain signal in the audio to be detected by using a narrow-band-pass filter; converting the time-domain power grid frequency signal into a time-frequency domain power grid frequency signal by using a short-time Fourier transform and extremum adding method; simultaneously, identifying a region with a smaller signal-to-noise ratio in the audio frequency to be detected, and performing noise reduction processing by using a short-time Fourier transform method for locally increasing the window length; acquiring a power grid frequency reference signal of a time-frequency domain; and matching the power grid frequency signal of the time-frequency domain with the power grid frequency reference signal of the time-frequency domain by adopting a dynamic matching algorithm with the compressed Fourier transform frequency resolution as a threshold value, wherein the optimal matching position corresponds to the timestamp information of the audio to be detected. The method and the device can successfully acquire the timestamp information of the audio file to be detected, judge the authenticity of the digital multimedia file and guarantee the information safety.

Description

Audio file generation time tracing dynamic matching method and system

Technical Field

The invention belongs to the technical field of multimedia digital evidence obtaining, and particularly relates to a source tracing dynamic matching method and system for audio file generation time.

Background

The rapid development of information technology enables people to record and transmit information in a mode which is not limited to texts any more, and multimedia files stored in a digital form become an important component of information communication due to the advantages of rapid transmission, convenient storage and the like. Meanwhile, the digital multimedia content is easy to be tampered and attacked in the process of online propagation, so that false and wrong information is transmitted, the daily life of people is influenced, and even the stable union of the society is threatened. Therefore, how to judge the authenticity of the digital multimedia file is very important to guarantee the information security. The multimedia digital evidence collection aims to analyze the problems of generation time (timestamp), generation place, tampering and the like of a digital multimedia file by technical means to determine the authenticity and reliability of the file, so as to reveal rumors and criminal behaviors related to the multimedia file, and play a vital role in the aspects of judicial law, criminal investigation, security and the like related to multimedia content.

Multimedia forensics techniques fall into two categories: active and passive evidence collection. The active forensics technology is to embed some specific information, such as digital watermark, digital signature, etc., into a file in advance, and then to perform copyright verification or detect whether the file is tampered by extracting and analyzing the information. The passive evidence obtaining technology is a method for identifying whether the multimedia file is artificially modified or not through the characteristics of related software, hardware and file content under the condition that specific information is not embedded in advance. Because the unprocessed original multimedia file has certain inherent rules, and the rules can be damaged after the file is artificially tampered, the file information is changed, therefore, the information which naturally exists without being embedded in advance can be regarded as the natural watermark in the multimedia file, and the authenticity and the integrity of the multimedia file can be identified by extracting the natural watermark.

The power grid frequency (ENF), i.e. the transmission frequency of the power grid signal, has a nominal value of 50 or 60Hz, and in recent years, the ENF gradually becomes an important passive criterion for audio and video evidence collection. The existing research shows that due to the activities of an alternating current power supply and an electric appliance, the power grid frequency and harmonic waves thereof can be collected by audio and video recording equipment through the vibration hum of the electric appliance and the flicker of an electric lamp, so that the audio and video contain power grid frequency signals captured together. The grid frequency signal can be used as a 'natural' evidence-obtaining criterion, and besides the characteristic of easy capture, the grid frequency signal also has the following two important characteristics: the instantaneous value of the power grid frequency fluctuates randomly in a small range around the nominal value of the instantaneous value, and the fluctuation modes of the power grid frequency are kept consistent in the same power grid. Based on the random fluctuation of the instantaneous value of the power grid frequency and the fluctuation consistency in the same power grid, the acquired instantaneous frequency of the reference power grid signal forms an important mapping relation with time, so that the time stamp information can be acquired from the audio file to be detected.

Disclosure of Invention

The invention aims to provide a method and a system for tracing the source of the audio file generation time, which solve the problem of obtaining the timestamp information of the audio file to be detected.

The invention provides a dynamic source tracing matching method for audio file generation time, which comprises the following steps:

extracting a power grid frequency time domain signal in the audio to be detected by using a narrow-band-pass filter;

converting the time-domain power grid frequency signal into a time-frequency domain power grid frequency signal by using a short-time Fourier transform and extremum adding method; simultaneously, identifying a region with a smaller signal-to-noise ratio in the audio frequency to be detected, and performing noise reduction processing by using a short-time Fourier transform method for locally increasing the window length;

acquiring a power grid frequency reference signal, and converting the power grid frequency reference signal in a time domain into a power grid frequency reference signal in a time-frequency domain by using a short-time Fourier transform and extremum adding method;

and matching the power grid frequency signal of the time-frequency domain with the power grid frequency reference signal of the time-frequency domain by adopting a dynamic matching algorithm with the compressed Fourier transform frequency resolution as a threshold value, wherein the optimal matching position corresponds to the timestamp information of the audio to be detected.

Further, when a power grid frequency time domain signal in the audio to be detected is extracted, the band-pass filter is used for filtering noise and recording content in the audio to be detected.

Further, the method for converting the time-domain power grid frequency signal into the time-domain power grid frequency signal by using the short-time fourier transform plus extremum method specifically includes:

dividing a long-time power grid frequency time domain signal into a plurality of equal-length short-time signals;

respectively calculating Fourier transform of each short-time signal, and drawing a transformed frequency spectrum into a function related to time;

and the maximum value of the frequency spectrum energy in each short-time signal is used as the instantaneous value of the power grid frequency at the moment, so that the power grid frequency signal of the time-frequency domain is obtained.

Further, an area with a small signal-to-noise ratio in the audio frequency to be detected is identified by setting a signal variance and a difference threshold.

Further, the corresponding grid frequency reference signal is determined by searching the time range.

Further, the threshold of the dynamic matching algorithm is 0.5 Δ f, and Δ f is the frequency resolution of the short-time fourier transform.

Further, a dynamic matching algorithm with the resolution of the compressed Fourier transform frequency as a threshold is adopted to match the power grid frequency signal of the time-frequency domain with the power grid frequency reference signal of the time-frequency domain, and the optimal matching position corresponds to the timestamp information of the audio frequency to be detected, and the method specifically comprises the following steps:

let x (N) be the power grid frequency signal of the time-frequency domain, the signal length is N; r is_k(n) is a grid frequency reference signal of the time-frequency domain, r_k(N) r (N + k), the signal length is L, L > N, and k is an optimal matching coefficient, i.e. a matching region serial number of the grid frequency reference signal; n is 0, 1, 2 … … N-1;

s11, initializing n ═ 0, k ═ 0, p (k) ═ 0;

s12, setting a frequency resolution threshold, and correcting the power grid frequency signal by using a correction signal c (n, k):

the corrected grid frequency signal is: x (n) + c (n, k)

S13, judging whether N is equal to N-1;

if not, if n is equal to n +1, the process proceeds to step S12, and the grid frequency signal of the matching area is continuously corrected;

if yes, calculating a penalty coefficient P (k) of the power grid frequency signal in the matching area and a corrected mean square error

And proceeds to perform step S14;

s14, judging whether k is equal to L-N;

if not, k is k +1, and n is 0, and the process proceeds to step S12, and a next matching area is calculated;

if yes, go to step S15;

s15, judging the mean square error

Whether the minimum of (a) is unique;

if yes, outputting a k value which enables the mean square error to be minimum;

if not, outputting the k value which enables the penalty coefficient to be minimum.

The invention also provides a source tracing dynamic matching system for audio file generation time, which comprises:

the signal extraction module is used for extracting a power grid frequency time domain signal in the audio to be detected by using the narrow-band-pass filter;

a power grid signal module; converting the time-domain power grid frequency signal into a time-frequency domain power grid frequency signal by using a short-time Fourier transform and extremum adding method; simultaneously, identifying a region with a small signal-to-noise ratio in the audio frequency to be detected, and performing noise reduction processing by using a short-time Fourier transform method for locally increasing the window length;

the reference signal module is used for acquiring a power grid frequency reference signal, and converting the power grid frequency reference signal in a time domain into a power grid frequency reference signal in a time-frequency domain by using a short-time Fourier transform and extremum adding method;

and the dynamic matching module is used for matching the power grid frequency signal of the time-frequency domain with the power grid frequency reference signal of the time-frequency domain by adopting a dynamic matching algorithm with the compressed Fourier transform frequency resolution as a threshold value, and the optimal matching position corresponds to the timestamp information of the audio frequency to be detected.

The invention has the beneficial effects that: according to the audio file generation time tracing dynamic matching method and system, when the time domain power grid frequency signal is converted into the time domain power grid frequency signal, the region with the smaller signal-to-noise ratio in the audio to be detected can be identified, the noise reduction processing is carried out by using the short-time Fourier transform method of locally increasing the window length, and the accuracy of timestamp information acquisition is improved; the dynamic matching algorithm can enable the audio time stamp identification to be more accurate, and the audio time stamp identification can show good robustness in the face of larger noise and shorter duration, successfully acquire the time stamp information of the audio file to be detected, judge the authenticity of the digital multimedia file and guarantee the information safety.

Further, when a power grid frequency time domain signal in the audio to be detected is extracted, the band-pass filter is used for filtering noise and recording content in the audio to be detected, so that the power grid frequency signal waveform is effectively extracted, and the timestamp information acquisition accuracy is improved; in the process of matching the audio to be detected, due to the influence of frequency resolution, the noise component is likely to cause deviation smaller than 0.5 delta f, and the peak value in the deviation amount is corrected to the center of a nominal main lobe to avoid noise interference; therefore, after the method is adopted for automatic correction, the influence of frequency resolution is reduced, and the audio time stamp identification becomes more accurate.

Drawings

FIG. 1 is a flowchart of an audio file generation time tracing dynamic matching method of the present invention;

FIG. 2 is a schematic diagram of a narrow band pass filter extracting a power grid frequency time domain signal in a to-be-detected audio frequency according to the present invention;

FIG. 3 is a schematic diagram of a short-time Fourier transform and extremum adding method of the present invention;

FIG. 4 is a time-frequency diagram of a main lobe of a grid frequency signal according to the present invention;

fig. 5 is a flow chart of an ECM matching algorithm of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

the invention aims to generate time tracing dynamic matching for audio files, and time stamp information of the audio files to be tested is obtained through local window length increase and dynamic threshold matching based on power grid frequency signals. The method has the advantages that the change of a weak power grid frequency instantaneous value in the audio record is accurately compared with a reference signal database, so that the creation time of the recording file is identified, and even whether the existing audio is artificially tampered, edited or damaged or not is detected.

The invention provides a dynamic source tracing matching method for audio file generation time, which comprises the following steps as shown in figure 1:

s1, extracting a power grid frequency time domain signal in the audio to be detected by using a narrow-band-pass filter;

s2, converting the time-domain power grid frequency signal into a time-frequency domain power grid frequency signal by using a short-time Fourier transform and extremum adding method; simultaneously, identifying a region with a smaller signal-to-noise ratio in the audio frequency to be detected, and performing noise reduction processing by using a short-time Fourier transform method for locally increasing the window length;

s3, acquiring a power grid frequency reference signal, and converting the power grid frequency reference signal in a time domain into a power grid frequency reference signal in a time-frequency domain by using a short-time Fourier transform and extremum adding method;

and S4, matching the power grid frequency signal of the time-frequency domain with the power grid frequency reference signal of the time-frequency domain by adopting a dynamic matching algorithm with the compressed Fourier transform frequency resolution as a threshold value, wherein the optimal matching position corresponds to the timestamp information of the audio to be detected.

Further, step S1 is to extract the grid frequency signal in the audio to be detected mainly by using a narrow-band bandpass filter. In addition, a band-pass filter with higher precision and lower calculation amount can be designed, as shown in fig. 2, in addition to extracting the power grid frequency signal in the audio to be detected, the noise and the recording content in the audio file to be detected can be effectively filtered, and the power grid frequency signal is extracted to ensure the smooth proceeding of the subsequent timestamp identification process.

The band-pass filter is divided into two categories, namely a finite length unit impulse response Filter (FIR) and an infinite length unit impulse response filter (IIR): for an IIR filter, distortion caused by nonlinear phase-frequency characteristics of the IIR filter can cause certain influence in application, but a voice signal is not as sensitive to phase as an image signal; for FIR filters, which have a linear phase characteristic, different window functions can be selected or FIR digital filters can be designed using frequency sampling. However, FIR filters require more parameters and orders than IIR filters, and too high an order affects the efficiency of the filter. And analyzing and comparing respective performance indexes, such as dynamic attenuation of cut-off frequency of the upper and lower limits of a pass band, attenuation of a stop band, time delay and the like. And finally determining the optimal band-pass filters corresponding to different scenes by combining the factors, so that the work of obtaining evidence of the power grid frequency can be better realized.

Further, in step S2, the key point is to convert the time-domain grid frequency signal into a time-frequency domain grid frequency signal by a short-time fourier transform (STFT) plus extremum method. As shown in fig. 3, the short-time fourier transform can be considered as dividing a long time signal into equal length shorter time signals, then calculating the fourier transform of each shorter time signal, and plotting the transformed spectrum as a function of time. And the maximum value of the frequency spectrum energy in each section of signal is regarded as the instantaneous value of the power grid frequency at the moment, so that the power grid frequency signal of the time-frequency domain is obtained.

In addition to converting a time-domain power grid frequency signal into a time-domain signal, when the signal-to-noise ratio of a record file to be detected is small, the noise reduction processing of the signal is also important by adopting a short-time Fourier transform method for locally increasing the window length.

The signal-to-noise ratio in the audio file to be detected is expressed as:

in the formula, s (n) represents a corresponding power grid frequency reference signal in the file to be detected, and v (n) represents noise in the file to be detected. During recording, the SNR of the recording file is detected_ENFHas already been determined. One criterion for describing whether the signal noise is large is the signal variance, which is about 4.56 x 10 for the grid frequency reference signal^-4Hz²And the signal variance may be increased by one to two orders of magnitude when the noise is large, so that the noise level can be effectively controlled by reducing the signal variance.

The lower cramer-meror bound of the grid frequency signal variance is:

in the formula (f)_SRepresenting the sampling frequency and N the window length of the short-time fourier transform. Under the condition that the signal-to-noise ratio and the sampling frequency are not changed, the lower boundary of the Cramer-O is in negative correlation with the window length N, namely when the window length is increased, the lower boundary of the Cramer-O is reduced, the signal variance is reduced, and the effect of controlling the noise can be achieved.

Under the condition that the signal-to-noise ratio is small, the signal variance and the difference of the signal variance become large, so that whether the audio file to be detected needs to control the noise or not can be judged by setting the signal variance and the difference threshold. The differential detection has no formula derivation and is a detection mode based on experimental results. After the area with larger noise is identified, noise control is carried out by adopting a short-time Fourier algorithm with the increased local window length, so that the waveform of the power grid frequency signal is effectively extracted.

Further, step S3 is mainly to obtain a grid frequency reference signal of the time-frequency domain, so as to match with the audio frequency to be measured. And determining a corresponding power grid frequency reference signal by searching a time range, and converting the time-domain power grid frequency reference signal into a time-frequency domain signal by using an STFT (space time transformation) extremum adding method, wherein the time-frequency domain signal is consistent with the power grid frequency signal and is convenient to match.

Further, in step S4, a dynamic threshold-based matching algorithm (ECM) is used to select the threshold value according to the frequency resolution determined by the STFT window size during the matching process. A penalty factor is introduced to monitor the auto-correction process and ultimately determine the estimated time stamp, the specific flow of which is shown in fig. 5.

The uncertainty principle states that the time resolution and the frequency resolution of the signal cannot be arbitrarily high at the same time. This results in a trade-off in selecting the STFT window size. Therefore, noise and interference components located near the ENF frequency may cause a bias in the frequency estimation, compared to a noise-free estimation of the reference data, which is inevitable even if the noise and interference are weaker than the ENF signal.

The frequency resolution problem or offset is considered to be an important problem affecting the matching result when the window size of the STFT is N_WSampling frequency of f_SHz, the duration of the window segment is T_W＝N_W/f_SSecond, the corresponding frequency resolution of the segment:

this means that any two frequency components smaller than af cannot be resolved in this signal segment and the peaks of the two frequency components will merge into a single peak. Without loss of generality, one of these frequency components may be considered as an extracted ENF signal, while the other is considered as a noise component, thus causing a deviation of the resultant peak of the frequency interval.

Shown in fig. 4 is a spectral mainlobe waveform, with a mainlobe width of approximately Δ f. In this case, the noise component is likely to result in a deviation of less than 0.5 Δ f, and the peak within this deviation amount should be corrected to the nominal main lobe center. If the noise component is so strong that the corresponding peak is outside the main lobe region, the true peak is not recoverable. Therefore, a threshold η may be selected as the frequency resolution value, i.e., η ═ 0.5 Δ f.

The ECM algorithm is based on the Minimum Mean Square Error (MMSE) of the conventional algorithm, and the mean square error of the reference signal and the grid frequency signal to be detected is expressed as follows:

the optimal matching index is as follows:

wherein, x (N) is a power grid frequency signal extracted from the audio to be detected, x (N) ═ s (N) + v (N), the length of the signal is N, r_k(n) is a reference signal, r_kWhere (N) is r (N + k), the signal length is L (L > N), and e (k) represents the mean square error.

Because the uncertain principle in the ENF estimation limits the frequency resolution, the minimum mean square error is improved: in the matching process, a threshold η Hz is defined to tolerate the difference between the ENF signal in the audio to be detected and the reference signal with which it is aligned, and to automatically correct the ENF signal. The automatic correction rules are set as follows: for any given k, i.e. any segment of the grid frequency reference signal, match the region if | x (n) -r in the matching region_k(n) < eta, this x (n) in the matching region is considered correctable by r_k(n) otherwise, x (n) is considered uncorrectable and MMSE computation is still required. Thus, all x (n) within the matching region are corrected.

Thus, a correction signal c (n, k) and a penalty factor p (k) are introduced. The penalty coefficient can be subjected to superposition operation at each time of automatic correction, as shown in fig. 5; or after the automatic correction of a certain matching area is completed. The expressions for c (n, k) and P (k) are as follows:

the corrected ENF correction signal of the audio frequency to be detected is x (n) + c (n, k), wherein the corrected mean square error and the optimum matching coefficient are as follows:

when the value of the best matching coefficient is unique, the best matching coefficient is the area with the best matching; when the best match index is not unique, there are multiple best match coefficients for k₁、k₂、k₃And (3) the coefficients are equal, a matching area is determined through a penalty coefficient, and the optimal matching coefficient is as follows:

this means that if there are multiple positions with the same MMSE, the selection corresponding to the minimum penalty factor is the estimated matching position.

Comprehensive performance analysis is carried out on the ECM algorithm in an experiment, and the method is found to have good anti-noise capability and good self-correcting capability. The use of the ECM algorithm allows audio time stamp identification to be more accurate, and the ECM algorithm also exhibits good robustness in the face of greater noise and shorter duration.

a power grid signal module; converting the time-domain power grid frequency signal into a time-frequency domain power grid frequency signal by using a short-time Fourier transform and extremum adding method; simultaneously, identifying a region with a smaller signal-to-noise ratio in the audio frequency to be detected, and performing noise reduction processing by using a short-time Fourier transform method for locally increasing the window length;

It will be understood by those skilled in the art that the foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included within the scope of the present invention.

Claims

1. A dynamic source tracing matching method for audio file generation time is characterized by comprising the following steps:

matching the power grid frequency signal of the time-frequency domain with the power grid frequency reference signal of the time-frequency domain by adopting a dynamic matching algorithm with the compressed Fourier transform frequency resolution as a threshold value, wherein the optimal matching position corresponds to the timestamp information of the audio to be detected;

the method comprises the following steps of matching a power grid frequency signal of a time-frequency domain with a power grid frequency reference signal of the time-frequency domain by adopting a dynamic matching algorithm with a compressed Fourier transform frequency resolution as a threshold, wherein the optimal matching position corresponds to timestamp information of audio to be detected, and the method specifically comprises the following steps:

let x (N) be the power grid frequency signal of the time-frequency domain, the signal length is N; r is_k(n) is a grid frequency reference signal of the time-frequency domain, r_k(N) r (N + k), length of signal L, L > N, k is optimum matching coefficient, i.e. power network frequency parameterThe matching area serial number of the test signal; n is 0, 1, 2 … … N-1;

s11, initializing n ═ 0, k ═ 0, p (k) ═ 0;

the corrected grid frequency signal is: x (n) + c (n, k)

S13, judging whether N is equal to N-1;

P(k)＝∑_nc²(n,k)

And proceeds to perform step S14;

s14, judging whether k is equal to L-N;

if yes, go to step S15;

s15, judging the mean square error

Whether the minimum of (a) is unique;

if yes, outputting a k value which enables the mean square error to be minimum;

2. The audio file generation time tracing dynamic matching method according to claim 1, wherein when extracting a power grid frequency time domain signal in the audio to be tested, a band-pass filter is used to filter out noise and recording contents in the audio to be tested.

3. The audio file generation time tracing dynamic matching method according to claim 1, wherein a method of short-time fourier transform plus extremum is used to convert a time domain power grid frequency signal into a time-frequency domain power grid frequency signal, and specifically comprises:

4. The audio file generation time tracing dynamic matching method according to claim 1, wherein a region with a small signal-to-noise ratio in the audio to be tested is identified by setting a signal variance and a difference threshold.

5. The audio file generation time tracing dynamic matching method of claim 1, wherein the corresponding grid frequency reference signal is determined by searching a time range.

6. The audio file generation time tracing dynamic matching method of claim 1, wherein the threshold of the dynamic matching algorithm is 0.5 Δ f, and Δ f is the frequency resolution of the short-time fourier transform.

7. An audio file generation time tracing dynamic matching system, comprising:

the dynamic matching module is used for matching the power grid frequency signal of the time-frequency domain with the power grid frequency reference signal of the time-frequency domain by adopting a dynamic matching algorithm taking the frequency resolution of the compressed Fourier transform as a threshold, and the optimal matching position corresponds to the timestamp information of the audio frequency to be detected, and specifically comprises the following steps:

s11, initializing n ═ 0, k ═ 0, p (k) ═ 0;

the corrected grid frequency signal is: x (n) + c (n, k)

S13, judging whether N is equal to N-1;

if yes, calculating the power grid in the matching areaPenalty factor p (k) for frequency signal and corrected mean square error

P(k)＝∑_nc²(n,k)

And proceeds to perform step S14;

s14, judging whether k is equal to L-N;

if yes, go to step S15;

s15, judging the mean square error

Whether the minimum of (a) is unique;

if yes, outputting a k value which enables the mean square error to be minimum;