CN108510994A

CN108510994A - A kind of homologous altering detecting method of audio using byte interframe amplitude spectrum correlation

Info

Publication number: CN108510994A
Application number: CN201810072583.7A
Authority: CN
Inventors: 胡永健; 余颖娟; 刘琲贝; 贺前华
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-01-25
Filing date: 2018-01-25
Publication date: 2018-09-07
Anticipated expiration: 2038-01-25
Also published as: CN108510994B

Abstract

The invention discloses a kind of homologous altering detecting methods of audio using byte interframe amplitude spectrum correlation, including audio preemphasis, framing adding window, calculate each frame zero-crossing rate, detach byte, reject short byte, the amplitude spectrum similarity for calculating each frame between two bytes judges that byte replicates stickup relationship and tampering location.Inventive method Detection accuracy is high, positioning accuracy is small compared with high and computation complexity.

Description

A kind of homologous altering detecting method of audio using byte interframe amplitude spectrum correlation

Technical field

The present invention relates to audio forensics technical fields, and in particular to a kind of audio using byte interframe amplitude spectrum correlation Homologous altering detecting method.

Background technology

Generally using and reaching its maturity with multimedia technology, people be easier obtain information, produce therewith how The problem for examining multimedia messages whether complete, reliable.How effective tampering detection is carried out to multi-medium data and has become letter Cease an important subject of security fields.Compared to image and video, the tampering detection research for digital audio is less. For audio forgery, it is to be easiest to realize to be also most common that homologous duplication stickup, which is distorted,.Interpolater is by some in audio Segment carries out the other positions for being copied and pasted to the audio, to change the true semanteme of audio.If criminal will turn round Bent distorts audio for court evidence, department's confidential information etc., and it will cause serious consequences.Because homologous replicate is glued Patch, which is distorted, only to be operated in same section audio so that this kind of distort has the characteristics that concealment is high and easy to implement.Therefore, sound is studied Frequently the homologous detection method pasted and distorted that replicates is for ensureing that the primitiveness of digital medium information, authenticity and integrity have Very important meaning.

Invention content

In order to overcome shortcoming and deficiency of the existing technology, the present invention to provide a kind of related using byte interframe amplitude spectrum The homologous altering detecting method of audio of property.

The present invention adopts the following technical scheme that；

A kind of homologous altering detecting method of audio using byte interframe amplitude spectrum correlation includes the following steps：

S1 is by audio signal preemphasis to be measured；

A length of m when S2 carries out adding window sub-frame processing, wherein frame to the audio after preemphasis, it is n that frame, which moves, after framing adding window Time-domain audio signal is expressed as y_l, wherein frame number l=1,2 ..., N_frame, N_frameFor audio frame number；

S3 calculates zero-crossing rate zcr (l) to each frame audio signal after adding window framing；

S4 is according to each byte in low-frequency spectra energy separation audio to be measured；

S5 rejects slack byte, specially：Set shortest word section duration threshold value t_m, duration is less than t_mByte reject, obtain To effective byte set X={ x₁,x₂,x₃,…,x_M, wherein x_iFor i-th of byte, M is the number of effective byte；

S6 calculates the amplitude spectrum similarity of each frame between two bytes in the audio signal to be measured after rejecting slack byte；

S7 sets similarity threshold Th, if there are two pairs or more frame amplitude spectrum similarities to be more than in two bytes Given threshold value then judges byte x_iAnd x_jIn the presence of duplication stickup relationship；

S8 repeats step 6 and 7 to all byte i ≠ j ∈ { 1,2 ..., M }, obtains all in the presence of duplication stickup relationship Byte pair, thus can orient the duplication sticking area in audio to be measured.

The calculation formula of the zero-crossing rate is：

Wherein, y_l(k) indicate that k-th of data point of l frames, K are that the data of each frame are counted, sgn [] is sign function, such as Following formula：

According to each byte in low-frequency spectra energy separation audio to be measured in the S4, specially：Acoustic frequency is treated to believe Number each frame y_lProgress length is N_fftThe Fourier transformation of point, obtains corresponding amplitude spectrum S (l, f), and wherein f indicates Frequency point Serial number,

Then the low frequency energy average value for calculating all frames in audio signal to be measured calculates each frame y_lLow frequency energy with it is low The ratio NLFER of frequency average energy.

The NLFER

Wherein, if low frequency part lower-frequency limit is f_{0_min}Hz, upper frequency limit f_{0_max}Hz, if sampling frequency is f_s, then right The bound of FFT transform frequency is answered to be respectively：F_{0_min}=(f_{0_min}×2/f_s)×N_fft, F_{0_max}=(f_{0_max}/f_s)×N_fft；

Energy threshold is set, the frame that NLFER values are more than to threshold value is determined as speech frame, is otherwise determined as noise frame, continuously Multiple speech frames constitute byte, to isolate each byte in audio to be measured.

Window function selects Hamming window in S2.

In the S6, when the absolute value of the difference of the zero-crossing rate of two frames is less than given threshold value T_zcrWhen just calculate its amplitude spectrum phase Like degree.

Frame duration m chooses between 16 milliseconds to 128 milliseconds, and frame moves duration n and takes audio frame duration 1/2~2/3.

Amplitude spectrum similarity between two frames is measured using Pearson correlation coefficient.

Beneficial effects of the present invention

(1) existing algorithm when detection replicates sticking area and does not differentiate between voice snippet and noise segments, it is contemplated that practical In application scenario, usual voice byte could express actual semantic information, thus the present invention first extract it is effective in audio Byte, then similarity mode is carried out for these bytes, it on the one hand can greatly reduce operation time, on the other hand can also carry The accuracy rate of high detection；

(2) because the operand of related coefficient is larger, the present invention is in the amplitude spectral correlative coefficient between calculating two frames, first The similitude between two frames is tentatively judged with zero-crossing rate, and related coefficient is just further calculated when zero-crossing rate is close, it can be into one Step reduces operation time.

Description of the drawings

Fig. 1 is the work flow diagram of the present invention；

Fig. 2 is original audio volume control figure in the embodiment of the present invention；

Fig. 3 is that audio volume control figure is distorted in amplitude stickup in the embodiment of the present invention；

Fig. 4 is the zero-crossing rate schematic diagram that audio is distorted in the embodiment of the present invention per frame；

Fig. 5 is byte segmentation effect figure in the embodiment of the present invention；

Fig. 6 is tampering detection result figure in the embodiment of the present invention.

Specific implementation mode

With reference to embodiment and attached drawing, the present invention is described in further detail, but embodiments of the present invention are not It is limited to this.

Embodiment

It is as shown in Figure 1 the flow diagram of the present invention, including eight steps, respectively audio preemphasis, framing adding window, meter Each frame zero-crossing rate is calculated, byte is detached, rejects short byte, calculates the amplitude spectrum similarity of each frame between two bytes, judges that byte replicates Stickup relationship and tampering location.

The present embodiment, according to the process that the present invention is judged, is such as schemed using the audio of one section of WAV format as analysis object It is original audio oscillogram shown in 2, voice content behaviour is spoken " one two three four, 34 ".As shown in figure 3, to distort audio wave Shape figure, voice content are " one two three four, 1 ", wherein the 5th and the 6th byte is to be replicated to glue by the 1st and the 2nd byte Patch, i.e., the 1st is respectively present replication relation with the 5th byte, the 2nd with the 6th byte.Two section audio sample rates are 8kHz.The duplication location for paste distorted in audio is detected by method through the invention in embodiment and is oriented to come.

Include the following steps：

S1 treats acoustic frequency and carries out preemphasis, is realized using single order high-pass digital filter, and filter response such as following formula is：

H (Z)=1-uz^-1

Preemphasis purpose is to promote high frequency section, convenient for spectrum analysis, and for eliminating sound in voiced process The effect of band and lip, to compensate the high frequency section that voice signal is inhibited by articulatory system, also for being total to for prominent high frequency Shake peak.Pre emphasis factor u takes 0.97 in embodiment.

A length of m when S2 carries out framing windowing process, wherein frame to the audio after preemphasis, it is n that frame, which moves, and window function can be selected Hamming window.Time-domain audio signal after framing adding window is expressed as y_l, wherein frame number l=1,2 ..., N_frame, N_frameFor audio frame Quantity.

The audio frame sum N of audio after preemphasis_frameIt can be sought by following formula：

Wherein,Represent downward round numbers operation, t_sFor audio duration to be measured, m is audio frame duration, t_s>m>0, n is frame Move duration, m>n>0.Audio frame duration m generally chooses between 16 milliseconds to 128 milliseconds, and audio frame moves duration n and indicates adjacent tone The part size overlapped between frequency frame, between generally take audio frame duration 1/2 to 2/3, making can be smoothed between frame and frame It crosses.Give up the data of the last inadequate frame length of audio.In the present embodiment, a length of 5984 milliseconds are distorted when audio, chooses audio A length of 128 milliseconds when frame, it is the 1/2 of frame length that frame, which moves, and audio shares 128 milliseconds × 8kHz=1024 data point per frame, according to Formula (3) is calculated audio and shares 92 frames.Audio frame uses Hamming window adding window.

S3 calculates zero-crossing rate zcr (l) to each frame audio signal after framing adding window, specially：

Wherein, y_l(k) indicate that k-th of data point of l frames, K are that the data of each frame are counted, sgn [] is sign function, such as Formula (5)：

As shown in figure 4, to distort the zero-crossing rate variation diagram of each frame of audio, it can be seen that there are the 1st of replication relation the and the 5th The zero-crossing rate of a byte, the 2nd and the 6th each frame of byte is close.

S4 treats each frame y of acoustic frequency according to each byte in low-frequency spectra energy separation audio to be measured_lCarrying out length is N_fftThe Fourier transformation of point, obtains corresponding amplitude spectrum S (l, f), and wherein f indicates frequency point serial number.Calculate all frame low frequencies of audio Average energy, to audio frame y_lCalculate ratio NLFER (the Normalized Low of its low frequency energy and the average value Frequency Energy Ratio), such as following formula：

Wherein, if low frequency part lower-frequency limit is f_{0_min}Hz, upper frequency limit f_{0_max}Hz, if sampling frequency is f_s, then The bound that FFT transform frequency is corresponded in formula (1) is respectively：F_{0_min}=(f_{0_min}×2/f_s)×N_fft, F_{0_max}=(f_{0_max}/f_s) ×N_fft.The characteristics of according to mute section with high-frequency noise being main, can suitable threshold value be set to NLFER values, if NLFER values are higher than On the contrary threshold value judges that the frame is to have an acoustic frame, then be mute frame, it is continuous it is multiple have acoustic frame composition byte, wait for acoustic to isolate Each byte in frequency.

In the present embodiment, totalframes N_frameIt is 92, low frequency part lower-frequency limit f_{0_min}For 60Hz, upper frequency limit f_{0_max}For 400Hz, the length N of Fourier transformation_fftIt is 8192, the FFT lower-frequency limits F in formula (1)_{0_min}=(f_{0_min}×2/f_s)×N_fft, It is approximately equal to 123, FFT upper frequency limits F_{0_max}=(f_{0_max}/f_s)×N_fft, it is approximately equal to 410.

Energy threshold is set, the frame that NLFER values are more than to threshold value is determined as speech frame, is otherwise determined as noise frame, continuously Multiple speech frames constitute byte.Energy threshold is 0.75 in the present embodiment.

S5 rejects too short slack byte.

By Environmental Noise Influence, too short slack byte, setting shortest word section duration threshold value t are will appear in audio_m, by when It is long to be less than t_mByte reject.In the present embodiment, t_mValue is the duration of a frame, i.e., 128 milliseconds, 8 effective bytes are obtained, Byte set is denoted as X={ x₁,x₂,x₃,…,x₈}.Fig. 5 illustrates for the final result of audio byte segmentation to be measured in the present embodiment Scheme, the part that range value is 1 in figure indicates effective byte.

Between S6 calculates two bytes in the audio signal to be measured after rejecting slack byte, the amplitude spectrum similarity of each frame.

The amplitude spectrum similarity formula that Pearson correlation coefficient measures two frames is as follows：

Two byte x are chosen from X_iAnd x_j, byte x is calculated one by one_iIn each frame and byte x_jIn each frame amplitude spectrum it is similar Degree, wherein byte x_iBy frame set I={ y_l| l=α_i…β_iComposition, byte x_jBy frame set J={ y_k| k=α_j…β_jComposition, The amplitude spectrum similarity of each frame and each frame in J in I is calculated one by one.For reduce calculation amount, first examine two frames zero-crossing rate whether phase Closely, only when the absolute value of the difference of the zero-crossing rate of two frames is less than given threshold value T_zcrWhen just calculate its amplitude spectrum similarity.

Wherein y_lAnd y_kRespectively byte x_iWith byte x_jIn frame, indicate inner product operation,Indicate vectorial mean value.This reality It applies in example, the start frame and end frame number of 8 bytes are as shown in table 1.

The corresponding starting frame number α of 18 bytes of table_iWith end frame number β_i

Byte	1	2	3	4	5	6	7	8
									α_i	5	18	30	42	50	63	73	83
β_i	8	22	34	44	53	67	76	86

To reduce calculation amount, first examine the zero-crossing rate of two frames whether close, it is only absolute when the difference of the zero-crossing rate of two frames Value is less than given threshold value T_zcrWhen just calculate its amplitude spectrum similarity.In the present embodiment, threshold value T is taken_zcrIt is 60.As shown in table 2, it adopts The calculation times that amplitude spectral correlative coefficient can be significantly reduced with short-time zero-crossing rate anticipation, compare to reduce two byte of detection algorithm Partial run time.

Whether table 2 is compared using the calculation amount of zero-crossing rate anticipation

	Related coefficient calculation times	Rating unit run time (s)
			It is prejudged using zero-crossing rate	247	0.045
Zero-crossing rate is not used to prejudge	504	0.085

Table 3 gives the amplitude spectral correlative coefficient of the 1st and the 2nd each frame of byte in the present embodiment, and table 4 then gives The amplitude spectral correlative coefficient of 1 and the 5th each frame of byte.

The amplitude spectral correlative coefficient of the 1st and the 2nd each frame of byte of table 3

ρ(l,k)	L=5	L=6	L=7	L=8
					K=18	-0.1714	-0.0982	-0.1675	-0.2620
K=19	-0.0258	-0.0604	-0.0635	0.0603
					K=20	0.3999	0.1888	0.1817	0.1821
K=21	0.6535	0.1008	0.0198	0.2024
					K=22	0.3120	0.0654	-0.0458	0.0818

The amplitude spectral correlative coefficient of the 1st and the 5th each frame of byte of table 4

ρ(l,k)	L=5	L=6	L=7	L=8
					K=50	0.9090	0.3784	0.0654	0.2240
K=51	0.0979	0.9654	0.5834	0.0851
					K=52	-0.0275	0.3679	0.9603	0.5527
K=53	0.3039	0.1110	0.2994	0.9417

Contrast table 3 and table 4 can see, and the interframe related coefficient very little between two bytes of replication relation is not present, There are the interframe correlation coefficient value between two bytes of replication relation is larger, the related coefficient of diagonal positions especially in table Value is close to 1.

Specially：Similarity threshold Th is set, if there is two pairs or more frame amplitude spectral correlative coefficient in the 5th step More than given threshold value, then its affiliated byte x is judged_iAnd x_jIn the presence of duplication stickup relationship.Threshold value Th is 0.94 in the present embodiment.From Table 3 is as it can be seen that all interframe amplitude spectral correlative coefficients of the 1st and the 2nd byte are less than threshold value, therefore, it is determined that the two words There is no replicate stickup relationship for section.As seen from Table 4, when the 1st and the 5th byte compare, there is the amplitude spectrum of 3 pairs of audio frames related Coefficient is more than threshold value Th, therefore, it is determined that the two bytes, which exist, replicates stickup relationship.

S8 repeats S6 and S7 to all i ≠ j ∈ { 1,2 ..., M }, obtains all bytes for existing and replicating stickup relationship It is right, it thus can orient the duplication sticking area in audio to be measured.

Share 8 bytes in the present embodiment, 28 matchings need to be carried out, finally obtain the 1st and the 5th byte, the 2nd and 6th byte, which is respectively present, replicates stickup relationship, thus can orient the duplication sticking area in audio to be measured.Fig. 6 gives The testing result of the present embodiment, the result be consistent with actual conditions, it was demonstrated that effectiveness of the invention.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by the embodiment Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications, Equivalent substitute mode is should be, is included within the scope of the present invention.

Claims

1. a kind of homologous altering detecting method of audio using byte interframe amplitude spectrum correlation, which is characterized in that including as follows Step：

S1 is by audio signal preemphasis to be measured；

A length of m when S2 carries out adding window sub-frame processing, wherein frame to the audio after preemphasis, it is n, the time domain after framing adding window that frame, which moves, Audio signal is expressed as y_l, wherein frame number l=1,2 ..., N_frame, N_frameFor audio frame number；

S5 rejects slack byte, specially：Set shortest word section duration threshold value t_m, duration is less than t_mByte reject, had Imitate byte set X={ x₁,x₂,x₃,…,x_M, wherein x_iFor i-th of byte, M is the number of effective byte；

S7 sets similarity threshold Th, if there are two pairs or more frame amplitude spectrum similarities more than given in two bytes Threshold value then judges byte x_iAnd x_jIn the presence of duplication stickup relationship；

S8 repeats S6 and S7 to all byte i ≠ j ∈ { 1,2 ..., M }, obtains all bytes for existing and replicating stickup relationship It is right, it thus can orient the duplication sticking area in audio to be measured.

2. the homologous altering detecting method of audio according to claim 1, which is characterized in that the calculation formula of the zero-crossing rate For：

Wherein, y_l(k) indicate that k-th of data point of l frames, K are that the data of each frame are counted, sgn [] is sign function, as follows Formula：

3. the homologous altering detecting method of audio according to claim 1, which is characterized in that according to low-frequency spectra in the S4 Each byte in energy separation audio to be measured, specially：Treat each frame y for surveying audio signal_lProgress length is N_fftFu of point In leaf transformation, obtain corresponding amplitude spectrum S (l, f), wherein f indicates Frequency point serial number,

Then the low frequency energy average value for calculating all frames in audio signal to be measured calculates each frame y_lLow frequency energy and low frequency energy Measure the ratio NLFER of average value.

4. the homologous altering detecting method of audio according to claim 3, which is characterized in that the NLFER

Wherein, if low frequency part lower-frequency limit is f_{0_min}Hz, upper frequency limit f_{0_max}Hz, if sampling frequency is f_s, then FFT is corresponded to The bound of conversion frequency is respectively：F_{0_min}=(f_{0_min}×2/f_s)×N_fft, F_{0_max}=(f_{0_max}/f_s)×N_fft；

Energy threshold is set, the frame that NLFER values are more than to threshold value is determined as speech frame, is otherwise determined as noise frame, continuous multiple Speech frame constitutes byte, to isolate each byte in audio to be measured.

5. the homologous altering detecting method of audio according to claim 1, which is characterized in that window function selects Hamming in S2 Window.

6. the homologous altering detecting method of audio according to claim 1, which is characterized in that in the S6, when the mistake of two frames The absolute value of the difference of zero rate is less than given threshold value T_zcrWhen just calculate its amplitude spectrum similarity.

7. the homologous altering detecting method of audio according to claim 1, which is characterized in that frame duration m is at 16 milliseconds to 128 It is chosen between millisecond, frame moves duration n and takes audio frame duration 1/2~2/3.

8. the homologous altering detecting method of audio according to claim 1, which is characterized in that use Pearson correlation coefficient degree Measure the amplitude spectrum similarity between two frames.