CN112364386A

CN112364386A - Audio tampering detection and recovery method combining compressed sensing and DWT

Info

Publication number: CN112364386A
Application number: CN202011132924.9A
Authority: CN
Inventors: 胡洋霞; 魏建国
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-02-12
Anticipated expiration: 2040-10-21
Also published as: CN112364386B

Abstract

The invention discloses an audio tampering detection and recovery method combining compressed sensing and DWT, which comprises the following steps: original audio is framed; extracting DCT coefficients of each frame; connecting the compressed coefficients of each frame; linear transformation to obtain unquantized reference values; calculating a matrix A and changing floating point type watermark information into integer type to obtain watermark information to be embedded; embedding watermark information into a high-frequency region of discrete wavelet transform of an original audio; extracting watermark information, judging whether the audio information is tampered or not, and positioning a tampered area; discarding watermark information of the tampered area, and extracting effective watermark information of the area which is not tampered to obtain an approximate value of a quantized reference value; and obtaining the compressed information of the damaged area, decompressing, performing inverse discrete wavelet transform on the compressed information, and connecting the undamaged area to obtain a restored voice signal. The invention improves the peak signal-to-noise ratio of the embedded watermark information and the intelligibility of the audio frequency after the tampering recovery, and has simpler calculation process and more accurate tampering positioning.

Description

Audio tampering detection and recovery method combining compressed sensing and DWT

Technical Field

The invention belongs to the field of multimedia and signal processing, particularly relates to a method for researching the authentication and protection problem of voice signal integrity, and more particularly relates to an audio tampering detection and recovery method combining compressed sensing and DWT.

Background

Aiming at the problems of integrity authentication and protection of multimedia signals, a plurality of methods adopting watermarks exist, but most of the methods are based on images or videos, the effect based on voice is not ideal, the peak signal-to-noise ratio after recovery is still to be improved, because a human auditory perception system is more sensitive than a visual perception system, the signal needing to be embedded with the watermark has better imperceptibility, and meanwhile, the recovered voice is easier to be understood by people.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides an audio tampering detection and recovery method combining compressed sensing and DWT, improves the peak signal-to-noise ratio of embedded watermark information and the intelligibility of tampered and recovered audio, generates watermark information by calculating a quantized reference value, and has the advantages of simpler calculation process and more accurate tampering and positioning.

The purpose of the invention is realized by the following technical scheme.

The invention discloses an audio tampering detection and recovery method combining compressed sensing and DWT, which comprises the following processes:

firstly, framing original audio;

step two, extracting DCT coefficients of each frame of the original audio;

connecting the compressed coefficients of each frame to obtain a formula (1);

wherein the content of the first and second substances,

v is a vector obtained by connecting each frame coefficient after grouping and rearrangement for each compressed coefficient of each frame of the original signal;

performing linear transformation on the vector in the formula (1), and calculating according to the formula (2) to obtain an unquantized reference value;

wherein r is an unquantized reference value, k is the number of the reference values, and the dimension of the matrix A is determined according to the random number seed and the number of the groups;

step five, calculating a matrix A according to a formula (3), and changing floating-point type watermark information into integer type according to a formula (4) and a formula (5) to obtain watermark information to be embedded;

wherein A is₀Generated from random number seeds, A (i, j) and A₀(i, j) are matrices A and A₀Each compressed frame group contains n × m elements;

wherein the content of the first and second substances,

f(t)＝q/R_max·t (5)

(4) and (5) quantized integer information

As watermark information, R_maxIs the maximum value after the quantization and,

the value of the function corresponding to the maximum value after quantization is the coefficient value of the sampling point at the position in the audio signal, and q is a quantization parameter;

step six, according to a self-adaptive embedding algorithm, embedding watermark information into a high-frequency region of discrete wavelet transform of original audio, and finishing watermark embedding work, wherein w is watermark information, and alpha is embedding strength;

step seven, extracting watermark information according to the reverse process of watermark embedding, and judging that the audio of the Speech type is not tampered if the frame number tampered in a group does not influence the understanding of semantics, so that a judgment threshold value delta is set according to the Speech speed of the Speech and the set frame duration, whether the Speech information is tampered is judged by comparing the judgment threshold value delta with the delta, a tampered area is positioned, and the auditory effect is influenced when the audio of other types is tampered in the frame group, and the tampered area is directly positioned; the positioning process is shown in formula (6);

wherein p (i, j) represents the ith frame of the jth group, m is the number of frames, n is the number of groups, w' (i, j) is the extracted watermark, and w (i, j) is the generated watermark; for the audio of the Speech type, if the number of damaged frames in the jth group is smaller than a judgment threshold value delta, automatically judging the next group, if the number of damaged frames in the jth group is larger than the judgment threshold value delta, judging the group as a tampered group, and positioning a tampered area according to the value of i; other types of audio locate the tampered area directly from the values of i and j;

step eight, discarding the watermark information of the tampered area, extracting effective watermark information of the area which is not tampered, and obtaining an approximate value of the quantized reference value according to a formula (7) and a formula (8);

wherein, the sequence alpha₁,α₂,...,α_MIs a reference value extracted from an uncorrupted frame, A^(E)Is the matrix after deleting the rows of reference values of the damaged area in a;

wherein v is_RCorresponding to uncorrupted information in the compression matrix v, v_TInformation corresponding to a damaged area in the compression matrix v; a. the^(E,R)And A^(E,T)Are respectively A^(E)Corresponding v_RAnd v_TThe row in (1);

step nine, according to the formula (4), embeddingThe method embeds the quantized reference values into the original signal, so that the reference values extracted by the extraction method are all the quantized result, not the sequence alpha₁,α₂,...,α_M(ii) a And the quantized vector is processed to obtain a sequence alpha₁,α₂,...,α_MThe approximate value of (a) is calculated according to the formulas (9), (10) and (11);

wherein, formula (9) is the inverse process of formula (4); r_max、f_x、

Obtained from (4) and (5);

wherein r' (α) is calculated₁)，r′(α₂)，…，r′(α_M) Obtained vector of α'₁,α'₂,...,α'_MIs a processed extracted reference value which can be approximated as an original, non-over-quantized reference value;

step ten, obtaining the compressed information of the damaged area according to the formula (12) and the formula (13), decompressing, performing inverse discrete wavelet transform on the compressed information, and connecting the undamaged area to obtain a recovered voice signal;

the information approximation at the tamper is written as:

wherein, the formula (12) and the formula (13) are a decompression process, and S is obtained by the formula (12)₁，S₂，…，S_MThe compression amount v is obtained by the formula (13)_TDecompression v_TAnd obtaining a restored signal sequence, and finally obtaining a restored signal through inverse discrete wavelet transform and splicing the signal sequence.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

the method is based on the compressed sensing model and the DWT, so that the peak signal-to-noise ratio of the embedded watermark information and the intelligibility of the audio frequency after the tampering recovery are improved, the watermark information is generated by calculating the quantized reference value, the calculation process is simpler, and the tampering positioning is more accurate; the sensitivity of the auditory system and the characteristics of fragile watermarks are fully considered, and the method has better practicability.

Drawings

Figure 1 is a diagram of a watermark embedding process,

wherein, diagram (a) is the whole embedding process, diagram (b) is the adaptive embedding process schematic diagram;

FIG. 2 is a tamper detection location and recovery process;

figure 3 is an example of a spectrum before and after watermark embedding,

wherein, the graph (a) is an original audio frequency spectrogram, and the graph (b) is a watermarked audio frequency spectrogram;

fig. 4 is an example of tamper detection and repair, wherein,

graph (a) is a diagram of the original audio spectrum with the watermark, graph (b) is a diagram of the audio spectrum with 20% replaced,

fig. (c) is a restored audio frequency spectrum diagram.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention combines the compressed sensing and the audio tampering detection and recovery method of DWT, adopts the principle of compressed sensing, extracts the Discrete Cosine Transform (DCT) coefficient of the original audio information, forms a reference value through calculation, takes the quantized reference value as the watermark information, and embeds the watermark information into the high-frequency area of the discrete wavelet transform of the original audio, thereby increasing the embedding amount, improving the peak signal-to-noise ratio of the watermark information, simultaneously keeping the fragility of the watermark, positioning the tampered area more accurately, and making the recovered voice more approximate to the original audio. The specific technical scheme is as follows:

step one, framing an original audio.

And step two, extracting the DCT coefficient of each frame of the original audio.

And step three, connecting the compressed coefficients of each frame to obtain a formula (1).

Wherein the content of the first and second substances,

v is a vector obtained by connecting the coefficients of each frame after grouping and rearrangement.

And step four, performing linear transformation on the vector in the formula (1), and calculating according to the formula (2) to obtain an unquantized reference value.

Where r is an unquantized reference value, k is the number of reference values, and the dimension of matrix a is determined by the random number seed and the number of groups.

And step five, calculating the matrix A according to the formula (3), and changing the floating-point watermark information into integer according to the formulas (4) and (5) to obtain the watermark information to be embedded.

Wherein A is₀Is generated from a random number of seeds Ai, j) and A₀(i, j) are matrices A and A₀Each compressed frame group contains n × m elements.

Wherein the content of the first and second substances,

f(t)＝q/R_max·t (5)

(4) and (5) quantized integer information

As watermark information, R_maxIs the maximum value after the quantization and,

the function value corresponding to the quantized maximum value is the coefficient value of the sampling point at the position in the audio signal, and q is a quantization parameter.

Step six, according to the self-adaptive embedding algorithm shown in fig. 1(b), embedding the watermark information into the high-frequency region of the discrete wavelet transform of the original audio, and finishing the watermark embedding work, wherein w is the watermark information and alpha is the embedding strength.

And seventhly, extracting watermark information according to the reverse process of watermark embedding, and for the Speech type audio, if the number of tampered frames in a group does not influence the understanding of semantics, judging that the audio is not tampered, setting a judgment threshold delta according to the Speech speed of the Speech and the set duration of the frames, judging whether the Speech information is tampered by comparing the judgment threshold delta with the delta, and positioning a tampered area. In other types of audio (Pop, Jazz, Rock, Blues, Classic), tampering within a frame group affects the auditory effect, thus directly locating the tampered area. The positioning process is shown in equation (6):

where p (i, j) denotes the ith frame of the jth group, m is the number of frames, n is the number of groups, w' (i, j) is the extracted watermark, and w (i, j) is the watermark generated according to fig. 2. For the audio of the Speech type, if the number of damaged frames in the jth group is smaller than a judgment threshold value delta, automatically judging the next group, if the number of damaged frames in the jth group is larger than the judgment threshold value delta, judging the group as a tampered group, and positioning a tampered area according to the value of i; other types of audio locate the tampered area directly from the values of i and j.

And step eight, discarding the watermark information of the tampered area, extracting effective watermark information of the area which is not tampered, and obtaining an approximate value of the quantized reference value according to the formula (7) and the formula (8).

Wherein, the sequence alpha₁,α₂,...,α_MIs a reference value extracted from an uncorrupted frame, A^(E)The matrix is after deleting the rows of reference values of the damaged area in a.

Wherein v is_RCorresponding to uncorrupted information in the compression matrix v, v_TInformation corresponding to a damaged area in the compression matrix v; a. the^(E,R)And A^(E,T)Are respectively A^(E)Corresponding v_RAnd v_TRow (c).

Step nine, according to the formula (4), the embedding party embeds the quantized reference values into the original signal, so that the reference values extracted by the extracting party are all the quantized results, not the sequence alpha₁,α₂,...,α_M. And after the quantized vector is processed, a sequence alpha can be obtained₁,α₂,...,α_MThe approximate value of (2) can be obtained by calculation according to the formulas (9), (10) and (11);

wherein, the formula (9) is the inverse process of the formula (4). R_max、f_x、

Obtained from (4) and (5),

wherein r' (α) is calculated₁)，r′(α₂)，…，r′(α_M) Obtained vector of α'₁,α'₂,...,α'_MIs a processed extracted reference value that can be approximated as the original, un-emphasized reference value.

the information approximation at the tamper is written as:

Example (b):

as shown in fig. 1, the first step of the present invention is to divide the original audio into frames, in the experiment, the sampling frequency is 8000Hz, the sampling precision is 16bits, 200 adjacent sampling points are divided into one frame, there is no overlap between frames, and each frame group contains 20 frames by taking the frame group as a unit. DWT is carried out on each frame of audio to obtain a high-frequency wavelet coefficient and a low-frequency wavelet coefficient, each frame is compressed, a DCT coefficient of each frame is extracted in an experiment, and a quantized reference value is calculated according to the formulas (1) - (5) to obtain watermark information. According to the illustration of fig. 1, watermark information is embedded in a high frequency region, the embedding strength α is 0.01, and IDWT is performed simultaneously with a low frequency region to obtain an audio signal containing a watermark.

By analyzing the audio signal containing the watermark by the method, as shown in fig. 3, the spectrogram before and after embedding the watermark is basically the same. Subjective auditory identification experiments were performed and the signal-to-noise ratio was calculated and found to be correct from 47% to 54%, close to 50%, with an average signal-to-noise ratio of 45.76dB, at least 5dB higher than other algorithms.

As shown in fig. 2, the watermark is extracted, and tamper detection positioning and recovery are performed. Firstly, framing a received audio signal, performing the same embedding process as the method, then performing DWT to obtain a low-frequency coefficient and a high-frequency coefficient, extracting a watermark according to the method shown in FIG. 2, performing XOR operation with the watermark generated in the embedding process as shown in a formula (6), obtaining the sum of XOR values, comparing the sum with a detection threshold, judging whether Speech is tampered, and positioning the tampered frame by taking a group as a unit, wherein the detection threshold is 20. And recovering the tampered area, wherein the method takes a frame group as a unit, if the voice frames in one group are not damaged, the next frame group is jumped to, and each group is assumed to contain m frames, each frame contains n sampling points, and k is the number of reference values generated by the random number seeds. If z speech frames in a group are corrupted, the reference values in the corrupted frames are discarded, and the number of reference values that can be extracted from the group is

(alphabetical meaning). Obtaining v from equations (7) and (8)_RSince the extracted reference value is a passing amountFor this reason, the reference value sequence α 'is obtained by processing the reference value according to equations (9), (10) and (11) and obtaining the reference value sequence after quantization'₁,α'₂,...,α'_M(alphabetical meaning) may be approximately equal to the reference value. Obtaining v from equations (14) and (15)_TDecompressing, selecting information of damaged area, doing IDWT, connecting undamaged voice, and obtaining recovered voice.

By adopting the method, 20% of the content of a section of audio is replaced, and the audio is detected and restored, as shown in fig. 4, the restored spectrogram is approximately an undamaged spectrogram, 5 sentences in a database CASIA-863 are selected, 100 destructive experiments are randomly carried out, 98% of sentences can clearly hear the content of a damaged area after being restored by the method, and the average SNR is 23.9 dB; when the destruction rate is as high as 50%, the SNR is 10.32dB on average, and 80% of the content of the sentence destruction part is understandable after being restored. The SNR of the recovered voice is calculated in experiments, and the SNR value is found to be in a descending trend along with the increase of the destruction rate, but is averagely higher by 3-4dB than other methods.

While the present invention has been described in terms of its functions and operations with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise functions and operations described above, and that the above-described embodiments are illustrative rather than restrictive, and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope or spirit of the invention as defined by the appended claims.

Claims

1. An audio tamper detection and recovery method combining compressed sensing and DWT, characterized by comprising the following processes:

firstly, framing original audio;

step two, extracting DCT coefficients of each frame of the original audio;

connecting the compressed coefficients of each frame to obtain a formula (1);

wherein the content of the first and second substances,

wherein the content of the first and second substances,

f(t)＝q/R_max·t (5)

(4) and (5) quantized integer information

As watermark information, R_maxIs the maximum value after the quantization and,

step nine, according to the formula (4), the embedding party embeds the quantized reference values into the original signal, so that the reference values extracted by the extracting party are all the quantized results, not the sequence alpha₁,α₂,...,α_M(ii) a And the quantized vector is processed to obtain a sequence alpha₁,α₂,...,α_MThe approximate value of (a) is calculated according to the formulas (9), (10) and (11);

wherein, formula (9) is the inverse process of formula (4); r_max、f_x、

Obtained from (4) and (5);

the information approximation at the tamper is written as: