CN113704829B

CN113704829B - Method for preventing sound image file from being tampered

Info

Publication number: CN113704829B
Application number: CN202110533209.4A
Authority: CN
Inventors: 李喆; 邱杰峰; 陈莹; 程莉红; 施千里; 袁雯
Original assignee: CNNC Fujian Nuclear Power Co Ltd
Current assignee: CNNC Fujian Nuclear Power Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2024-06-11
Anticipated expiration: 2041-05-19
Also published as: CN113704829A

Abstract

The invention relates to the technical field of deep learning, in particular to a sound image file tamper-proof method, which comprises four steps in order to prevent the occurrence of sound image file tamper events. The invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.

Description

Method for preventing sound image file from being tampered

Technical Field

The invention relates to the technical field of deep learning, in particular to a tamper-proof method for an audio-video file.

Background

Along with the increasing trend of digital file management, the problem of tamper resistance of sound image files is more and more emphasized, and in the existing sound image file tamper resistance technical means, the sound image files are often separated into video files and audio files independently, and corresponding tamper resistance technologies are respectively used.

However, it is often not desirable to face the strong correlation of video and audio files. Therefore, a person skilled in the art provides a tamper-proof method for an audio-visual file to solve the above-mentioned problems in the background art.

Disclosure of Invention

The invention aims to provide an anti-tampering method for an audio-video file, which aims to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a tamper-proof method for an audio-visual file comprises the following steps:

Step 1: acquiring a sound image file data set F _t, and dividing the sound image file into a video set V1, an audio set A1 and a video set V2 according to whether the sound image file contains audio data or not;

step 2: extracting a core frame of the video set by adopting a core frame extraction algorithm;

Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain sound image file features, reconstructing the features based on a decoding network, and constructing a reconstruction loss function;

Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the sound image file characteristics and the reference sound image file characteristics, constructing a quantization loss function, constructing a model joint loss function by combining the quantization loss function and the reconstruction loss function, minimizing the model loss function until the model converges, and finally extracting to obtain sound image file quantization characteristics;

Step 5: and the key is generated by adopting a hashing algorithm MD5 to the extracted quantization characteristics of the sound image file, and is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.

As a further aspect of the present invention: in step 2, based on the video sets V1, V2, a frame image dataset F ₁＝[f₁ ¹,f₁ ²,...,f₁ ^t is acquired,Wherein f ₁ ⁱ, ε [1, t ], represent the ith frame image of video set V1; f ₂ ⁱ, i e [1, t ], represents that the i-th frame image of the video set V2 maintains the correlation between frame images while removing the time sequence redundancy between frame images, and the core frame of the frame image dataset is extracted, specifically comprising the following steps:

a1: the cumulative histogram of the frame image is calculated as follows:

Wherein h _i represents the cumulative histogram of the ith pixel of the image, and l is the value of the pixel value of the image; n is the total number of image pixels, N _k represents the number of pixels with a pixel value of k;

A2: based on the cumulative histogram, the video file is subjected to scene segmentation by the difference between adjacent frames and the difference between the adjacent frames and the scene initial frame, and the formula is as follows:

d^j＝τ₁d^j′+τ₂d^j″

Wherein d ^j′ represents the difference between the j-th frame image and the j+1st frame image, d ^j″ represents the difference between the j-th frame image and the first frame image, d ^j represents the scene boundary frame weight, and τ ₁、τ₂ is a weight coefficient;

A3: on the basis of the scene boundary frame weight d ', if d' is larger than a specified threshold sigma, defining the frame as a boundary frame of a scene, and on the basis of the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:

where v _in denotes the core frame set of the mth scene, The ith frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, the core frame set V ₁ ^k of the video set V1 and the core frame set V ₂ ^k of the video set V2 are finally obtained.

As a further aspect of the present invention: the step 3 comprises the following specific steps:

b1: extracting the spatial similarity of the interior of the image in the core frame set based on the convolutional neural network, and the formula is as follows:

where a ^l is the output of the layer one network, B ^l is the weight and bias parameters of the layer I network model;

B2: the spatial similarity between images in the core frame set is extracted based on the frame difference time sequence network, and the formula is as follows:

h′_t-1＝σ(W_dh_t-1+b_d)

σ_t＝f(Δt_i)*h′_t-1

y_t＝σ(W_o·h_t)

wherein h _t-1 represents the image characteristic of the previous frame, h' _t-1 represents partial image characteristic information influenced by a frame difference control gate, k _t represents a frame difference input gate, and the influence of a frame difference interval on the image characteristic of the frame is controlled; f (·) is the frame difference function; sigma (·), tan h (·) represent activation functions; An output representing the image characteristics of the previous frame; x _t denotes the current frame image feature, r _t is a reset gate, indicating how much of the previous frame image feature information remains in the current frame; /(I) The state information of the current frame image is memorized, z _t represents an update gate, the retention condition of the characteristic information of the current frame image is determined, h _t represents the hidden output of the current frame image information, y _t represents the characteristic output of the current frame image, and W _r,/>W _z,W_o,W_d,b_d represents frame difference timing network parameters;

Finally, a set of video core frames And/>Obtaining low-dimensional characteristic representation/>, through a frame difference network

B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T ₁ ^k corresponding to the core frame V ₁ ^k;

B4: for the text set T ₁ ^k, extracting word vectors of the text to obtain vector representation of the text, and extracting correlation among the text based on a cyclic neural network to finally obtain feature representation of the text set, wherein the formula is as follows:

f′_t＝σ(W′_f[h″_t-1,x′_t]+b′_f)

o′_t＝σ(W′_o[h″_t-1,x′_t]+b′_o)

h′_t＝o′_t*tanh(C′_t)

Wherein x' _t represents the current text feature, σ (·) and tanh (·) represent the activation function; h '_t-1 denotes the output of the last text feature, f' _t denotes the recurrent neural network forget gate, and denotes whether the last text feature information is retained to the current text; i' _t, C' _t represents the input gate of the recurrent neural network, and represents how many current text features are input into the model; o '_t and h' _t represent recurrent neural network output gates, representing current text feature outputs, W '_f、W′_i、W′_c、W′_o and b' _f、b′_i、b′_c、b′_o are model parameters;

finally, the text set T ₁ ^k passes through a cyclic neural network to obtain a low-dimensional characteristic representation of the text

B5: low-dimensional video set feature based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows: /(I)

Wherein h' (. Cndot.) represents a reconstructed network model, the formula is shown in steps B1 and B2, θ represents a reconstructed network parameter, and r ₁ ^k represents a video featureFeatures after reconstruction,/>Representing video features/>The reconstructed features;

b6: file feature based on cyclic neural network Reconstructing to enable the text set feature to approach the text set feature T ₁ ^k, wherein the formula is as follows:

Wherein h "(. Cndot.) represents a reconstructed network model, the formula is shown in step B4, η represents a reconstructed network parameter, Representing the reconstructed text features.

As a further aspect of the present invention: in step B5, based on the reconstructed features, a corresponding loss function is constructed, and the formula is as follows:

wherein l' _re、l″_re represents the reconstruction loss of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.

As a further aspect of the present invention: step B6 also comprises constructing a corresponding reconstruction loss, wherein the formula is as follows:

Where l' _d denotes the reconstruction penalty of the text feature.

As a further aspect of the present invention: step 4 comprises the following specific steps:

C1 compression characteristics based on model The characteristic representation of the sound image file is obtained, and the formula is as follows:

β₁+β₂＝1

wherein beta ₁、β₂ denotes the weighting parameters of the feature, Representing the characteristics of the audio frequency characteristic and the video frequency characteristic group, [ ] represents the dimension splice of the characteristics, and f ^k represents the sound image file characteristics extracted by the model;

And C2, quantizing the extracted features f ^k of the original sound image file and the extracted features f ^k of the reference sound image file to construct a corresponding loss function, wherein the formula is as follows:

q^t＝Q[f^k]

l_f＝(q^t-q^d)²

L_all＝λ₁(l′_re+l′_d)+λ₂l″_re-l_f;λ₁+λ₂＝1,λ₁,λ₂＞0

Wherein Q [. Cndot. ] represents a characteristic quantization function, Q ^t、q^d represents the quantization characteristics of f ^k、f^k, l _f represents a quantization loss function, lambda ₁、λ₂ represents the weight coefficient of the loss function, and model loss function is minimized based on Adam optimization algorithm until the model converges, and finally the extracted quantization characteristics Q ^t of the sound image file are obtained.

As a further aspect of the present invention: in step C1, for the features of the sound image file extracted by the model, because of the uncertainty of deep learning, in order to avoid the generation of the same features of the close sound image file, for the original sound image file, a corresponding disturbance quantity is added to generate a reference sound image dataset F _d, and the features of the disturbance sound image dataset are extracted by adopting the sound image file feature extraction network set in advance, with the following formula:

Wherein, The interference characteristics generated by the reference sound image data set are represented, kappa (·) represents a network characteristic extraction model, the structure is the same as the characteristic extraction network, and psi represents network model parameters.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.

Drawings

Fig. 1 is a schematic flow chart of a tamper-proof method for an audio/video file.

Detailed Description

Referring to fig. 1, in an embodiment of the present invention, a tamper-proof method for an audio/video file includes the following steps:

Further, in step 2, based on the video sets V1, V2, a frame image dataset F ₁＝[f₁ ¹,f₁ ²,...,f₁ ^t is acquired,Wherein f ₁ ⁱ, i e [1, t ] represents the i-th frame image of the video set V1, f ₂ ⁱ, i e [1, t ] represents the i-th frame image of the video set V2, and the method for extracting the core frame of the frame image dataset comprises the following specific steps:

a1: the cumulative histogram of the frame image is calculated as follows:

d ^j＝τ₁d^j′+τ₂d^j″ 0090, wherein d ^j′ represents the difference between the j-th frame image and the j+1th frame image, d ^j″ represents the difference between the j-th frame image and the first frame image, d ^j represents the scene boundary frame weight, and τ ₁、τ₂ is a weight coefficient;

Further, the step 3 includes the following specific steps:

h′_t-1＝σ(W_dh_t-1+b_d)

σ_t＝f(Δt_i)*h′_t-1

y_t＝σ(W_o·h_t)

Finally, video core frame set V ₁ ^k and Obtaining low-dimensional characteristic representation/>, through a frame difference network

f′_t＝σ(W′_f[h″_t-1,x′_t]+b′_f)

i′_t＝σ(W′_i[h″_t-1,x′_t]+b′_i)

o′_t＝σ(W′_o[h″_t-1,x′_t]+b′_o)

h′_t＝o′_t*tanh(C′_t)

Wherein, h' (. Cndot.) represents a reconstructed network model, the formula is shown in the step B4, eta represents a reconstructed network parameter, and r ₁ ^k′ represents a reconstructed text feature.

Further, in step B5, based on the reconstructed feature, a corresponding loss function is constructed, and the formula is as follows:

wherein l' _rel″_re represents the reconstruction loss of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.

Further, step B6 further includes constructing a corresponding reconstruction loss, where the formula is as follows:

Where l' _d denotes the reconstruction penalty of the text feature.

Further, the step 4 comprises the following specific steps:

β₁+β₂＝1

q^t＝Q[f^k]

l_f＝(q^t-q^d)²

L_all＝λ₁(l′_re+l′_d)+λ₂l″_re-l_f;λ_l+λ₂＝1,λ_l,λ₂＞0

Further, in step C1, for the features of the sound image file extracted by the model, due to the uncertainty of deep learning, in order to avoid the similar features generated by the close sound image file, for the original sound image file, a corresponding disturbance quantity is added to generate a reference sound image dataset F _d, and the features of the disturbance sound image dataset are extracted by adopting the sound image file feature extraction network set in the foregoing, and the formula is as follows:

To sum up: the invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.

The foregoing description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical solution of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A tamper-proof method for an audio-visual file is characterized by comprising the following steps:

Step 2: extracting core frames of the video sets V1 and V2 by adopting a core frame extraction algorithm to obtain a core frame set V ₁ ^k of the video set V1 and a core frame set V ₂ ^k of the video set V2;

Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain original sound image file features, and respectively reconstructing the video features and the text features based on a decoding network to construct a reconstruction loss function;

The extracting video features and text features based on the feature extraction network comprises the following steps:

The core frame sets V ₁ ^k and V ₂ ^k are processed to obtain video features

Converting the audio set A1 into a voice text, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T ₁ ^k corresponding to the core frame set V ₁ ^k; the text set T ₁ ^k is processed to obtain text characteristics

Wherein constructing the reconstruction loss function comprises:

Constructing a reconstruction loss function l ' _re corresponding to the video set V1, a reconstruction loss function l ' _re corresponding to the video set V2 and a reconstruction loss function l ' _d corresponding to the text feature;

Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the original sound image file characteristics and the reference sound image file characteristics, constructing a quantized loss function, constructing a model joint loss function by combining the quantized loss function and the reconstructed loss function, minimizing the model joint loss function until the model converges, and finally extracting to obtain sound image file quantized characteristics;

Wherein, the making the reference sound image file set comprises:

adding corresponding disturbance quantity to an original sound image file to generate a reference sound image file set;

The method for constructing the model joint loss function by quantizing the original sound image file characteristics and the reference sound image file characteristics and combining the quantized loss function and the reconstructed loss function comprises the following steps:

Obtaining a quantized feature q ^t corresponding to the original sound image file feature and a quantized feature q ^d corresponding to the reference sound image file feature through a feature quantization function;

Constructing a quantization loss function l _f according to the quantization characteristics q ^t and q ^d;

constructing a model joint loss function according to l' _re、l″_re、l′_d and l _f;

Step 5: and a hash algorithm MD5 is adopted for the extracted quantization characteristics of the sound image file to generate a secret key, and the secret key is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.

2. The method of claim 1, wherein in step 2, based on the video sets V1 and V2, a frame image dataset F ₁＝[f₁ ¹,f₁ ²,...,f₁ ^t is obtained,Wherein f ₁ ⁱ, i e [1, t ] represents the i-th frame image of video set V1; f ₂ ⁱ, i e [1, t ], represents the i-th frame image of video set V2;

In order to preserve the correlation between frame images while removing temporal redundancy between frame images, a core frame of a frame image dataset is extracted as follows:

a1: the cumulative histogram of the frame image is calculated as follows:

d^j＝τ₁d^j′+τ₂d^j″

A3: based on the scene boundary frame weight d ^j, if d ^j is greater than a prescribed threshold sigma, defining the frame as a boundary frame of a scene, and based on the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:

where v _m denotes the core frame set of the mth scene, The i frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, and finally obtaining the core frame set V ₁ ^k of the video set V1 and the core frame set V ₂ ^k of the video set V2.

3. The method for preventing tampering of an audio-visual file according to claim 1, wherein the step 3 comprises the following specific steps:

where a ^l is the output of the layer one network, B ^l is the weight and bias parameters of the layer I network;

h′_t-1＝σ(W_dh_t-1+b_d)

σ₁＝f(Δt_i)*h′_t-1

y_t＝σ(W_o·h_t)

Finally, the video core frame sets V ₁ ^k and V ₂ ^k pass through a frame difference network to obtain low-dimensional video features

B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T ₁ ^k corresponding to the core frame set V ₁ ^k;

f′_t＝σ(W′_f[h″_t-1,x′_t]+b′_f)

o′_t＝σ(W′_o[h″_t-1,x′_t]+b′_o)

h′_t＝o′_t*tanh(C′_t)

Finally, the text set T ₁ ^k passes through a cyclic neural network to obtain low-dimensional text characteristics of the text

B5: video feature for low dimension based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows:

Where h' (. Cndot.) represents the reconstructed network model, θ represents the reconstructed network parameters, and r ₁ ^k represents the video features Features after reconstruction,/>Representing video features/>The reconstructed features;

b6: text feature based on cyclic neural network Reconstructing to make it approach to the text set T ₁ ^k, and the formula:

Where h "(. Cndot.) represents the reconstructed network model, η represents the reconstructed network parameters, Representing the reconstructed text features.

4. A sound image file tamper-proofing method according to claim 3, wherein in step B5, based on the reconstructed characteristics, a corresponding reconstruction loss function is constructed, and the formula is as follows:

Wherein l' _re、l″_re represents the reconstruction loss function of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.

5. A method for tamper resistance of an audio-visual file according to claim 3, wherein step B6 further comprises constructing a corresponding reconstruction loss function according to the following formula:

Where l' _d denotes the reconstruction loss function of the text feature.

6. The method of claim 1, wherein step 3 comprises the following steps:

C1, based on the result The original sound image file characteristics are obtained, and the formula is as follows:

β₁+β₂＝1

wherein, beta ₁、β₂ represents a weight parameter, Representing text features/>And video feature/>The combined characteristics, [ ] represents the dimension splice of the characteristics, and f ^k represents the characteristics of the original sound image file;

Step4 comprises the following specific steps:

C2, for original sound image file characteristic f ^k and reference sound image file characteristic And carrying out quantization to construct a corresponding quantization loss function, wherein the formula is as follows:

q^t＝Q[f^k]

l_f＝(q^t-q^d)²

L_all＝λ₁(l′_re+l′_d)+λ₂l″_re-l_f

λ₁+λ₂＝1,λ₁、λ₂＞0

Wherein Q [ cndot ] represents a characteristic quantization function, and Q ^t、q^d represents f ^k, respectively, I _f represents a quantization loss function, lambda ₁、λ₂ represents a weight coefficient, and the model joint loss function is minimized based on an Adam optimization algorithm until the model converges, and finally the quantization characteristic of the sound image file is obtained.

7. The method of claim 6, wherein for the original sound image file, adding a corresponding disturbance amount to generate a reference sound image file set F _d, and extracting features of the reference sound image file set by using a feature extraction network, wherein the formula is as follows:

Wherein, And (3) representing the characteristics of the reference sound image file, wherein kappa (·) represents a network characteristic extraction model, the structure is the same as that of a characteristic extraction network, and psi represents network characteristic extraction model parameters.