CN103730128A

CN103730128A - Audio clip authentication method based on frequency spectrum SIFT feature descriptor

Info

Publication number: CN103730128A
Application number: CN201210389030.7A
Authority: CN
Inventors: 李伟; 殷玥; 董旭炯
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2012-10-13
Filing date: 2012-10-13
Publication date: 2014-04-16

Abstract

The invention belongs to the technical field of information safety and protection, and relates to an audio clip authentication method based on a frequency spectrum SIFT feature descriptor, in particular to a content authentication method which is based on the computer vision technology and can use audio clips as detection objects. The SIFT feature matching is utilized for extracting the feature descriptor, and the detected suspicious audio clips are aligned in a reference audio file; the suspicious audio clips are divided into a plurality of blocks through contraction-expansion factors extracted from an SIFT key point; the time domain blocks can be directly used for recognizing malicious clipping, malicious insertion and other actions; modified factors are estimated and used for describing corresponding time domain units, so that robust Hash is calculated conveniently. Due to matching and Hash detection, not only can the integrity and authenticity of the suspicious audio clips be authenticated accurately, but also the positions of malicious tamper operations are accurately locked and classified according to types.

Description

A kind of audio fragment authentication method based on frequency spectrum SIFT Feature Descriptor

Technical field

The invention belongs to information security and the resist technology field of audio authentication aspect; relate to a kind of audio fragment authentication method based on frequency spectrum SIFT Feature Descriptor, be specifically related to a kind of based on content authentication method computer vision technique, that the audio fragment of can take is detected object.

Background technology

The audio content authentication techniques effective technology that to be realizations detect and protect for integrality and the authenticity of the voice datas such as music and voice; its object is mainly that the data that the reciever of assurance audio transmission obtains are not suffered third-party malice editor and distort in transport process, the angle of human perception system, is identical with original audio.Different from traditional signatures authentication, the multimedia authentications such as audio frequency are guaranteed is the authentication of file content but not protects simply bit stream.Audio authentication is recorded in national security, trade secret, news at present, music records distribution and there is important application in many fields such as copyright protection, military communication.

Up to now, the research of audio authentication only has a small amount of method to deliver, and is summarized as follows.

Document [1] has proposed a kind of half fragile voice digital watermark for inspection content integrality, i.e. exponential odd even modulation technique.The method, at DFT territory embed watermark, does not need excessive data to assist its completeness check, and can distinguish malice and distort and keep content operation, but the method only to resampling, white noise pollutes and the minority such as voice coding allows that operation tests.

Document [2] proposes a kind of authentication method based on feature according to following principle, and two similar its masking curves of audio frequency of acoustical quality are also highly similar.First by calculating the Hash functional value of audio masking curve, then adopt known data-hiding method that it is embedded in sound signal as watermark.When detecting, by after watermark extracting with the hash value comparison calculating before, calculate its related coefficient.Because this coefficient declines along with the decline of audio frequency acoustical quality also has appropriateness, thereby can judgement thresholding be suitably set according to receptible acoustical quality standard.The method can distinguish the Audio Signal Processing such as MP3 and malice is distorted operation.

Document [3] has been introduced two kinds of methods for audio content authentication.The first has been discussed possible audio frequency characteristics, to allow several follow-up signals to process; The second, for obtaining highest security, detects the change of each bit, and carrys out reconstruct original audio by introducing the concept of reversible water mark; The method and then again in conjunction with digital signature and digital watermarking, and use key to produce the method that can openly verify and can rebuild original audio.

The method that document [4] proposes is based on audio-frequency fingerprint, and the method that employing robust hashing function is combined with robust watermarking is verified the integrality of audio file.Experiment is mainly tested for MP3 compression artefacts, when higher sampling bits rate is as above in 128kbps, can reach the vision response test below 7%, but compared with low sampling rate during as 32kbps vision response test about 40% left and right.

In document [5] and document [6], mentioned distributed source compression for the application that keeps audio frequency and video quality, detection of malicious attack operation.Document utilizes reference audio in [5], has the audio authentication of robustness by Slepian-Wolf Code And Decode process, yet has supposed that in advance audio frequency to be verified and original reference audio frequency align.Document has been used the hash signature compacting in [6], and the storage space of reference database is not reduced to 20%-70% not etc.

Above audio authentication method is all to carry out on the audio frequency to be certified basis identical with reference audio length, but the frequent just fragment of audio frequency to be certified, so the present invention in actual applications intends providing a kind of studied brand new technical first: audio fragment content authentication.Audio fragment authentication, on the basis of conventional audio authentication, can be compared with the reference audio of raw footage with a bit of audio frequency and mate behind location, adopts Hash or watermarking algorithm to obtain authentication result.

Reference related to the present invention has:

[1]C.P.Wu?and?C.C.Kuo,“Fragile?speech?watermarking?based?on?exponential?scalequantization?for?tamper?detection,”ICASSP?2002,pp.3305-3308.

[2]R.Radhakrishnan?and?N.Memon,“Audio?content?authentication?based?onpsycho-acoustic?model,”SPIE?Security?and?Watermarking?of?Multimedia?Contents,4675:110-117,2002.

[3]M.Steinebach?and?J.Dittmann,“Watermarking-based?digital?audio?dataauthentication,”EURASIP?JournalonApplied?Signal?Processing,10:1001-1015,2003.

[4]S.Zmudzinski?and?M.Steinebach,“Perception-based?audio?authenticationwatermarking?in?the?time-frequency?domain,”IH2009,pp.146-160.

[5]D.Varodayan,Y.?C.Lin?and?B.Girod,“Audio?authentication?based?on?distributedsource?coding,”ICASSP?2008,pp.225-228.

[6]G.?Valenzise,G.?Prandi,M.Tagliasacchi?and?A.Sarti,“Identification?of?sparse?audiotampering?using?distributed?source?coding?and?compressive?sensing?techniques,”EURASIP?JournalonImage?and?Video?Processing,2009:1-12.

Summary of the invention

The object of the invention is to propose the new audio content authentication method of information protection and field of authentication.Be specifically related to a kind of audio fragment authentication method based on frequency spectrum SIFT Feature Descriptor, especially a kind of based on content authentication method computer vision technique, that the audio fragment of can take is detected object.

The audio content authentication method that the present invention proposes is the method based on computer vision technique.The present invention solves is audio fragment integrality in audio authentication and the test problems of authenticity.The present invention utilizes SIFT (Scale Invariant FeatureTransform) characteristic matching to extract Feature Descriptor, and realizes the alignment of examined suspicious audio fragment in reference audio file; The time contraction-expansion factor (Time-Stretching Factor) that recycling is extracted from SIFT key point, is divided into a plurality of piecemeals by suspicious audio fragment.Time domain partitioning can be directly used in that identification malice is sheared and the behavior such as insertion; In addition for the estimation of the modified tone factor (pitch-shiftingfactor), also can be used for describing corresponding time frequency unit, thereby be convenient to robust Hash, calculate; By coupling and Hash, detect integrality and authenticity that not only can the suspicious audio-frequency fragments of precise Identification, and the accurately locking classifying by type of the position that can distort operation to malice.

Audio authentication method of the present invention.Compare with the authentication method of the traditional complete audio frequency to be measured of needs, the present invention only needs an audio fragment to be measured can carry out to it judgement of authenticity and integrity, the requirement of more realistic application.Fragment alignment (step 1 ~ step 4), robust hashing value (Hash) that the method is divided into based on sound spectrograph SIFT local description are calculated (step 5 ~ step 6) and authentication decision (step 7) three parts, and concrete steps are as follows:

Step 1, is used Short Time Fourier Transform (STFT) to convert one dimension sound signal to corresponding two-dimentional time-frequency representation, gets the medium and low frequency section of 100 ~ 3000Hz; To guarantee to cover the frequency range at voice and most musical instruments place; The length of choosing of STFT time window is 4096;

Step 2, calculated characteristics descriptor

Calculate respectively 128 dimension SIFT Feature Descriptors of suspicious sound signal and reference audio signal, and obtain crucial match point (Matched SIFT Key Point) by comparing two groups of descriptors.If coupling logarithm is N, be designated as

and then coupling centering suspicious audio frequency high order end, low order end match point and reference audio high order end, low order end match point can be expressed as,

T_{0}^{D} = \min {P_{1}^{D} . x,, . . ., P_{N}^{D} . x},

T_{1}^{D} = \max {P_{1}^{D} . x,, . . ., P_{N}^{D} . x},

T_{0}^{R} = \min {P_{1}^{R} . x,, . . ., P_{N}^{R} . x},

T_{1}^{R} = \max {P_{1}^{R} . x,, . . ., P_{N}^{R} . x},

Step 3, the length of establishing suspicious audio fragment is L _d, the length of reference audio is L _r, the distance of audio fragment left margin and leftmost side SIFT unique point is

the distance of right margin and rightmost side SIFT unique point is thereby the mapping distance in corresponding reference audio

with

available formula (1) calculates,

\{\begin{matrix} Δ_{0}^{R} = \frac{(T_{0}^{D}) \overset{&OverBar;}{Δ_{R}}}{\overset{&OverBar;}{Δ_{D}}} \\ Δ_{1}^{R} = \frac{(L_{D} - T_{1}^{D}) \overset{&OverBar;}{Δ_{R}}}{\overset{&OverBar;}{Δ_{D}}} \end{matrix} - - - (1)

Wherein

\overset{&OverBar;}{Δ_{D}} = \frac{1}{N - 1} Σ_{i = 1}^{i = N - 1} (P_{i + 1}^{D} . x - P_{i}^{D} . x),

\overset{&OverBar;}{Δ_{R}} = \frac{1}{N - 1} Σ_{i = 1}^{i = N - 1} (P_{i + 1}^{R} . x - P_{i}^{R} . x);

Step 4, by the ascending order arrangement in chronological order of SIFT key point, thereby formula (2) compute location can be used in the position of suspicious audio fragment in reference audio,

\{\begin{matrix} T_{start}^{R} = \frac{(T_{0}^{R} - Δ_{0}^{R}) \times (P_{R} \times {SR}_{R})}{L_{R}} \\ T_{end}^{R} = \frac{(T_{1}^{R} + Δ_{1}^{R}) \times (P_{R} \times {SR}_{R})}{L_{R}} \end{matrix} - - - (2)

L wherein _r, SR _r, P _rrespectively frame number, sampling rate and the time span (second) of reference audio;

Step 5, for the pre-service of audio frequency attack (time-scaling, modified tone, time domain shearing and insertion etc.)

(1) time, domain partitioning is processed: respectively suspicious audio frequency and reference audio are divided into a plurality of fritters, make every interior time domain zoom factor (time-stretching factor) approximately equal, thereby be convenient to find that fragment and robust Hash afterwards that malice is sheared or inserted calculate.

Concrete processing is as follows:

For SIFT key point pair

the distance L of two consecutive point in time domain ^{(D, R}}with corresponding time contraction-expansion factor R _ibe defined as follows:

\{\begin{matrix} L^{{D, R}} = {L_{i}^{{D, R}} = P_{i + 1}^{{D, R}} . x - P_{i}^{{D, R}} . x, i = 1,2, . . ., N - 1 \\ R_{i} = \frac{L_{i}^{D}}{L_{i}^{R}}, i = 1,2, . . ., N - 1 \end{matrix} - - - (3)

Then will gather { R _ibe divided into several subsets { C _i}:

every two continuous subsets, as C _i={ R _j..., R _kand C _i+1={ R _k+1..., R _i, both contraction-expansion factors averaging time have larger gap.

According to

be divided into the SIFT key point of each subset, utilize the fragment match method in step 2, suspicious audio fragment and reference audio fragment are divided into each corresponding piecemeal to { B ₁, B ₂..., B _m, as shown in Figure 2.

Fig. 2 illustrated about time domain partitioning details.Suppose only through the signal such as irrelevant sequential such as lossy compression method, to process before suspicious audio frequency the free contraction-expansion factor R of institute obtaining _iapproximately equal (≈ 1), and suspicious audio frequency and the reference audio signal corresponding with it all only have a time block; Suppose suspicious audio frequency flexible processing, all R of waiting of elapsed time before _iapproximately equal (> 1 or < 1) still, and respectively there is a piecemeal on both sides; / 3rd sections of centres supposing suspicious audio frequency are cut off, and are cut so the R on part both sides _ivalue will diminish, thereby form a concave point, and in this case, two section audios all can be divided into three piecemeals, the impact that still can not sheared of two sections of left and right, mutually alignment.

(2) frequency alignment: mate calculating the modified tone factor by SIFT key point, thereby calculate the corresponding relation of frequency between suspicious audible spectrum and reference audio frequency spectrum;

Concrete processing is as follows:

Owing to modifying tone, attack, the frequency values of suspicious audio frequency also can change in proportion from the initial value of reference audio; So in order to describe frequency content, must first estimate the modified tone factor (Pitch-shifting Factor).

Coupling for one group of SIFT key point is right

their frequency content can be expressed as

thereby the modified tone factor can be obtained by following formula,

\hat{R} = median ({{\hat{R}}_{i} | i = 1,2, . . ., N}) - - - (4)

Wherein,

the fundamental frequency of the ratio of reference audio and suspicious audio frequency

concrete corresponding relation as shown in Figure 3.

Step 6, robust Hash calculates

Adopt Philips algorithm to carry out the calculation of Hash yardage.Each is organized to corresponding piecemeal pair

its fragment length

and frequency range

first use formula (5) to adjust,

\{\begin{matrix} W_{i}^{R} = \frac{W_{i}^{D}}{{\overset{&OverBar;}{R}}_{i}}, i = 1,2, . . ., M \\ F_{start}^{R} = \frac{F_{start}^{D}}{\hat{R}}, F_{end}^{R} = \frac{F_{end}^{D}}{\hat{R}} \end{matrix} - - - (5)

Wherein,

with

represent respectively corresponding piecemeal zoom factor and average modified tone factor averaging time;

Make E ^{{ D, R}}(k, n) represents that in frequency spectrum, being positioned at k organizes frequency range, the energy of n time frame.By frequency band

be divided into 33 nonoverlapping sub-bands, the 32 bit Hash codes in each region can use formula (6) to calculate,

H^{{D, 0}} (k, n) = \{\begin{matrix} 1, if E^{{D, R}} (k, n) > E^{{D, R}} (k + 1, n) \\ 0, if E^{{D, R}} (k, n) \leq E^{{D, R}} (k + 1, n) \end{matrix} - - - (6)

K=1 wherein, 2 ..., 32, n=1,2 ... N _f, the border of n can be by

calculate;

Step 7, revises type detection

(1) malice shearing/Insert Fragment: by detection time zoom factor curve whether have concave point or salient point, judge whether audio file suffers the modification that malice is sheared or inserted; Wherein, if detection time, zoom factor was about 1 or fixed constant, illustrate that sound signal only experiences that the signal irrelevant with sequential processed or the whole time is flexible; If there is malice to shear/insert, on the curve of time contraction-expansion factor, corresponding position there will be concave point/salient point, as shown in Figure 4, Figure 5;

(2) malice frequency modification: constructing respectively according to SIFT key point can suspect signal and the histogram of reference signal, relatively judges from histogram whether apocrypha suffers malice frequency modification; For example, it is the behavior of a kind of typical malice frequency modification that bandwidth is blocked, spectrum position corresponding to region being modified, the SIFT key point of mating with reference audio can obviously reduce, accordingly, in the present invention, make the key point of SIFT coupling about the histogram of frequency, relatively determined whether that such distorts generation.As shown in Figure 6, transverse axis is that frequency 100 ~ 3000Hz is divided into 30 sections, and the longitudinal axis represents the number for the SIFT key point of frequency band match; Two figure are more known in left and right, and in 800 ~ 900Hz frequency range, right figure does not almost have match point, is very likely subject to frequency and distorts in known this section of frequency band.

(3) content modification: utilize bit error rate (BER) to judge whether file suffers the content modification of malice, and formula has defined threshold value T and decision rule in (7),

BER = \frac{1}{N_{f}} Σ_{n = 1}^{N_{f}} H^{D} (n) &CirclePlus; H^{R} (n) - - - (7)

If BER≤T, authentication is passed through, and represents that examined audio frequency does not suffer maliciously to distort, and the content integrity of file is good; If BER > is T, authentification failure, represents that audio frequency is maliciously tampered.In one embodiment of the present of invention, compare one section of given suspicious audio-frequency fragments x ^d() and one section of longer reference audio x ^r(), detects x ^dwhether () is subject to all kinds of attacks and distorts, and for the shearing/insertion of time domain and the bandwidth of frequency domain, blocks this two kinds of operations, judges respectively with time domain contraction-expansion factor and the crucial match point of SIFT about the histogram of time; If both all do not detect, x and then relatively ^r(), x ^r() robust hash value: pass through suspicious audio frequency x if BER (bit error rate) lower than threshold value, shows authentication ^r() do not suffer the operation of distorting semantically, may be that the contents such as lossy compression method, TSM and modified tone keep operation; If BER, higher than threshold value, shows authentification failure.

Accompanying drawing explanation

Fig. 1: the SIFT key point coupling between suspicious audio fragment frequency spectrum and reference audio file frequency spectrum.

Fig. 2: while there is shearing manipulation in suspicious audio frequency, the time domain partitioning based on time contraction-expansion factor detects explanation.

Fig. 3: the frequency alignment between suspicious audio frequency and reference audio piecemeal frequency spectrum.

Fig. 4: shear the time contraction-expansion factor curvilinear motion causing.

Fig. 5: insert the time contraction-expansion factor curvilinear motion causing.

Fig. 6: the histogram of the SIFT key point number vs. frequency band of coupling before and after frequency band blocks; Before the corresponding frequency band of left figure blocks, after the corresponding frequency band of right figure blocks.

Embodiment

For the validity of checking said method, the present invention has carried out following experiment.

Embodiment 1

The database that model has comprised 1030 audio files, has contained different voice signals and music signal.Long 2 minutes of each audio file, WAVE form, sampling rate is 44.1kHz, monophony.The suspicious audio fragment of intercepting to be certified is long for 10s, intercepts at random from above original audio file.In the present embodiment, the audio content Verification System that first test is constructed is according to the method described above for authentication percent of pass (the True Positive Rates that keeps content operation, TPR), and distort authentification failure rate (the True Negative Rates of operation for malice, TNR), the two is all considered the accuracy rate of Verification System, and the two numerical value is larger, all shows that the accuracy rate of system is higher.

Table 1. is to revise type and corresponding TPR/TNR thereof.

Table 1.

Result shows, keeping content operation MP3 compression (32kbps), TSM ± 10%, ± modified tone 20% below and under the low-pass filtering of 4kHz and 8kHz, the authentication percent of pass of system remains at least 81% accuracy rate level; When TSM bring up to ± 20% time, accuracy rate declines to some extent, but still remains on 80% left and right.Even if it is pointed out that without any attack, the Verification System being built by SIFT characteristic matching symbol also cannot guarantee 100% authentication accuracy rate, in other words, detects the audio fragment of 1030 unmodified, and 968 audio frequency of only having an appointment can be by authentication.Reason is that fragment to be certified can not be navigated to its accurate location in reference audio signal by coupling to entirely accurate, still can authentification failure without any distortion in the situation that thereby cause.

For the detection of the tampering in three kinds of time domains (replace, shear and insert), authentification failure rate is respectively 99.4%, 99.6% and 100%, has proved the validity of time contraction-expansion factor when identification time domain is distorted operation; And when the malice bandwidth blocking operation (barrage width is respectively 800 ~ 900Hz, 1500 ~ 1600Hz) detecting in frequency, 82.9% effect being compared in time domain with 76.7% accuracy rate is relative lower.

The present invention has detected two error statistics data important in actual authentication system, and wherein, one is FPR (False Positive Rate), refers to be mistaken for the ratio that keeps content, i.e. Error type I by distorting; Another is FNR (FalseNegative Rate), refers to keep content to be mistaken for the ratio of distorting, i.e. error type II.By the confusion matrix of table 2, carry out the index of illustrative system.

The confusion matrix of table 2. authentication error rate

Table 2 has shown and in content operations, have and by mistake, be judged to be and distort for 2340 times, so FNR is 0.1623 altogether keeping for 14420 times; Whether reason is to keep the operation of content can cause the time-frequency representation of fragment to change, and has influence on the accuracy of fragment positioned in alignment, simultaneously due to the demand authenticating, on threshold value setting, for keeping the boundary member of content to tend to make the judgement of distorting.In 3090 times distort altogether, have and by mistake, be judged to be maintenance content 10 times, so FPR is 0.00324.From the matrix of table 2, utilize the audio authentication system of this method effectively to differentiate to distort and keep content operation.

Claims

1. the audio fragment authentication method based on frequency spectrum SIFT Feature Descriptor, it is characterized in that, it comprises: the fragment alignment step (1 ~ 4) based on sound spectrograph SIFT local description, robust hashing value calculation procedure (5 ~ 6) and authentication decision step (7):

Step 1, is used Short Time Fourier Transform (STFT) to convert one dimension sound signal to corresponding two-dimentional time-frequency representation, gets the medium and low frequency section of 100 ~ 3000Hz;

Step 2, calculated characteristics descriptor

The 128 dimension SIFT Feature Descriptors that calculate respectively suspicious sound signal and reference audio signal, obtain crucial match point by comparing two groups of descriptors, and establishing coupling logarithm is N, is designated as

the suspicious audio frequency high order end of coupling centering, low order end match point and reference audio high order end, low order end match point are expressed as,

T_{0}^{D} = \min {P_{1}^{D} . x,, . . ., P_{N}^{D} . x},

T_{1}^{D} = \max {P_{1}^{D} . x,, . . ., P_{N}^{D} . x},

T_{0}^{R} = \min {P_{1}^{R} . x,, . . ., P_{N}^{R} . x},

T_{1}^{R} = \max {P_{1}^{R} . x,, . . ., P_{N}^{R} . x};

the distance of right margin and rightmost side SIFT unique point is

mapping distance in corresponding reference audio

with

by formula (1), obtain,

\{\begin{matrix} Δ_{0}^{R} = \frac{(T_{0}^{D}) \overset{&OverBar;}{Δ_{R}}}{\overset{&OverBar;}{Δ_{D}}} \\ Δ_{1}^{R} = \frac{(L_{D} - T_{1}^{D}) \overset{&OverBar;}{Δ_{R}}}{\overset{&OverBar;}{Δ_{D}}} \end{matrix} - - - (1)

Wherein

\overset{&OverBar;}{Δ_{D}} = \frac{1}{N - 1} Σ_{i = 1}^{i = N - 1} (P_{i + 1}^{D} . x - P_{i}^{D} . x),

\overset{&OverBar;}{Δ_{R}} = \frac{1}{N - 1} Σ_{i = 1}^{i = N - 1} (P_{i + 1}^{R} . x - P_{i}^{R} . x);

Step 4, by the ascending order arrangement in chronological order of SIFT key point, locate by formula (2) position of suspicious audio fragment in reference audio,

\{\begin{matrix} T_{start}^{R} = \frac{(T_{0}^{R} - Δ_{0}^{R}) \times (P_{R} \times {SR}_{R})}{L_{R}} \\ T_{end}^{R} = \frac{(T_{1}^{R} + Δ_{1}^{R}) \times (P_{R} \times {SR}_{R})}{L_{R}} \end{matrix} - - - (2)

Step 5, for the pre-service of audio frequency attack:

Time domain partitioning process: respectively suspicious audio frequency and reference audio are divided into a plurality of fritters, make time domain zoom factor approximately equal in piece, thus be convenient to find fragment that malice is sheared or inserted and convenient after robust Hash calculate;

Frequency alignment: mate calculating the modified tone factor by SIFT key point, thereby calculate the corresponding relation of frequency between suspicious audible spectrum and reference audio frequency spectrum;

Step 6, robust Hash calculates

Adopt Philips method to carry out the calculation of Hash yardage; Each is organized to corresponding piecemeal pair its fragment length

and frequency range

first use formula (3) to adjust,

\{\begin{matrix} W_{i}^{R} = \frac{W_{i}^{D}}{\overset{&OverBar;}{R_{i}}}, i = 1,2, . . ., M \\ F_{start}^{R} = \frac{F_{start}^{D}}{\hat{R}}, F_{end}^{R} = \frac{F_{end}^{D}}{\hat{R}} \end{matrix} - - - (3)

Wherein,

with

represent respectively corresponding piecemeal

zoom factor and average modified tone factor averaging time;

Make E ^{{ D, R}}, (k, n) represents that in frequency spectrum, being positioned at k organizes frequency range, the energy of n time frame; By frequency band

be divided into 33 nonoverlapping sub-bands, the 32 bit Hash codes in each region adopt formula (4) to calculate,

H^{{D, 0}} (k, n) = \{\begin{matrix} 1, if E^{{D, R}} (k, n) > E^{{D, R}} (k + 1, n) \\ 0, if E^{{D, R}} (k, n) \leq E^{{D, R}} (k + 1, n) \end{matrix} - - - (4)

K=1 wherein, 2 ..., 32, n=1,2 ..., N _f, the border of n by

obtain;

Step 7, revises type detection

Malice shearing/Insert Fragment: by detection time zoom factor curve whether have concave point or salient point, judge whether audio file suffers that malice is sheared or the modification of insertion;

Malice frequency modification: constructing respectively according to SIFT key point can suspect signal and the histogram of reference signal, relatively judges from histogram whether apocrypha suffers malice frequency modification;

Content modification: utilize bit error rate to judge whether file suffers the content modification of malice, and formula has defined threshold value T and decision rule in (5),

BER = \frac{1}{N_{f}} Σ_{n = 1}^{N_{f}} H^{D} (n) &CirclePlus; H^{R} (n) - - - (5)

If BER≤T, authentication is passed through, and represents that examined audio frequency does not suffer maliciously to distort, and the content integrity of file is good; If BER > is T, authentification failure, represents that audio frequency is maliciously tampered.

2. by method claimed in claim 1, it is characterized in that the attack of described step 5: time-scaling, modified tone, time domain are sheared or inserted.