CN113704829B - Method for preventing sound image file from being tampered - Google Patents
Method for preventing sound image file from being tampered Download PDFInfo
- Publication number
- CN113704829B CN113704829B CN202110533209.4A CN202110533209A CN113704829B CN 113704829 B CN113704829 B CN 113704829B CN 202110533209 A CN202110533209 A CN 202110533209A CN 113704829 B CN113704829 B CN 113704829B
- Authority
- CN
- China
- Prior art keywords
- frame
- text
- image file
- sound image
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 30
- 238000005516 engineering process Methods 0.000 claims abstract description 14
- 230000006870 function Effects 0.000 claims description 60
- 238000013139 quantization Methods 0.000 claims description 25
- 238000013528 artificial neural network Methods 0.000 claims description 18
- 230000001186 cumulative effect Effects 0.000 claims description 9
- 125000004122 cyclic group Chemical group 0.000 claims description 9
- 230000000306 recurrent effect Effects 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims 1
- 238000013135 deep learning Methods 0.000 abstract description 7
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/64—Protecting data integrity, e.g. using checksums, certificates or signatures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Security & Cryptography (AREA)
- Bioethics (AREA)
- Computer Hardware Design (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of deep learning, in particular to a sound image file tamper-proof method, which comprises four steps in order to prevent the occurrence of sound image file tamper events. The invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.
Description
Technical Field
The invention relates to the technical field of deep learning, in particular to a tamper-proof method for an audio-video file.
Background
Along with the increasing trend of digital file management, the problem of tamper resistance of sound image files is more and more emphasized, and in the existing sound image file tamper resistance technical means, the sound image files are often separated into video files and audio files independently, and corresponding tamper resistance technologies are respectively used.
However, it is often not desirable to face the strong correlation of video and audio files. Therefore, a person skilled in the art provides a tamper-proof method for an audio-visual file to solve the above-mentioned problems in the background art.
Disclosure of Invention
The invention aims to provide an anti-tampering method for an audio-video file, which aims to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a tamper-proof method for an audio-visual file comprises the following steps:
Step 1: acquiring a sound image file data set F t, and dividing the sound image file into a video set V1, an audio set A1 and a video set V2 according to whether the sound image file contains audio data or not;
step 2: extracting a core frame of the video set by adopting a core frame extraction algorithm;
Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain sound image file features, reconstructing the features based on a decoding network, and constructing a reconstruction loss function;
Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the sound image file characteristics and the reference sound image file characteristics, constructing a quantization loss function, constructing a model joint loss function by combining the quantization loss function and the reconstruction loss function, minimizing the model loss function until the model converges, and finally extracting to obtain sound image file quantization characteristics;
Step 5: and the key is generated by adopting a hashing algorithm MD5 to the extracted quantization characteristics of the sound image file, and is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.
As a further aspect of the present invention: in step 2, based on the video sets V1, V2, a frame image dataset F 1=[f1 1,f1 2,...,f1 t is acquired,Wherein f 1 i, ε [1, t ], represent the ith frame image of video set V1; f 2 i, i e [1, t ], represents that the i-th frame image of the video set V2 maintains the correlation between frame images while removing the time sequence redundancy between frame images, and the core frame of the frame image dataset is extracted, specifically comprising the following steps:
a1: the cumulative histogram of the frame image is calculated as follows:
Wherein h i represents the cumulative histogram of the ith pixel of the image, and l is the value of the pixel value of the image; n is the total number of image pixels, N k represents the number of pixels with a pixel value of k;
A2: based on the cumulative histogram, the video file is subjected to scene segmentation by the difference between adjacent frames and the difference between the adjacent frames and the scene initial frame, and the formula is as follows:
dj=τ1dj′+τ2dj″
Wherein d j′ represents the difference between the j-th frame image and the j+1st frame image, d j″ represents the difference between the j-th frame image and the first frame image, d j represents the scene boundary frame weight, and τ 1、τ2 is a weight coefficient;
A3: on the basis of the scene boundary frame weight d ', if d' is larger than a specified threshold sigma, defining the frame as a boundary frame of a scene, and on the basis of the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:
where v in denotes the core frame set of the mth scene, The ith frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, the core frame set V 1 k of the video set V1 and the core frame set V 2 k of the video set V2 are finally obtained.
As a further aspect of the present invention: the step 3 comprises the following specific steps:
b1: extracting the spatial similarity of the interior of the image in the core frame set based on the convolutional neural network, and the formula is as follows:
where a l is the output of the layer one network, B l is the weight and bias parameters of the layer I network model;
B2: the spatial similarity between images in the core frame set is extracted based on the frame difference time sequence network, and the formula is as follows:
h′t-1=σ(Wdht-1+bd)
σt=f(Δti)*h′t-1
yt=σ(Wo·ht)
wherein h t-1 represents the image characteristic of the previous frame, h' t-1 represents partial image characteristic information influenced by a frame difference control gate, k t represents a frame difference input gate, and the influence of a frame difference interval on the image characteristic of the frame is controlled; f (·) is the frame difference function; sigma (·), tan h (·) represent activation functions; An output representing the image characteristics of the previous frame; x t denotes the current frame image feature, r t is a reset gate, indicating how much of the previous frame image feature information remains in the current frame; /(I) The state information of the current frame image is memorized, z t represents an update gate, the retention condition of the characteristic information of the current frame image is determined, h t represents the hidden output of the current frame image information, y t represents the characteristic output of the current frame image, and W r,/>W z,Wo,Wd,bd represents frame difference timing network parameters;
Finally, a set of video core frames And/>Obtaining low-dimensional characteristic representation/>, through a frame difference network
B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame V 1 k;
B4: for the text set T 1 k, extracting word vectors of the text to obtain vector representation of the text, and extracting correlation among the text based on a cyclic neural network to finally obtain feature representation of the text set, wherein the formula is as follows:
f′t=σ(W′f[h″t-1,x′t]+b′f)
o′t=σ(W′o[h″t-1,x′t]+b′o)
h′t=o′t*tanh(C′t)
Wherein x' t represents the current text feature, σ (·) and tanh (·) represent the activation function; h 't-1 denotes the output of the last text feature, f' t denotes the recurrent neural network forget gate, and denotes whether the last text feature information is retained to the current text; i' t, C' t represents the input gate of the recurrent neural network, and represents how many current text features are input into the model; o 't and h' t represent recurrent neural network output gates, representing current text feature outputs, W 'f、W′i、W′c、W′o and b' f、b′i、b′c、b′o are model parameters;
finally, the text set T 1 k passes through a cyclic neural network to obtain a low-dimensional characteristic representation of the text
B5: low-dimensional video set feature based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows: /(I)
Wherein h' (. Cndot.) represents a reconstructed network model, the formula is shown in steps B1 and B2, θ represents a reconstructed network parameter, and r 1 k represents a video featureFeatures after reconstruction,/>Representing video features/>The reconstructed features;
b6: file feature based on cyclic neural network Reconstructing to enable the text set feature to approach the text set feature T 1 k, wherein the formula is as follows:
Wherein h "(. Cndot.) represents a reconstructed network model, the formula is shown in step B4, η represents a reconstructed network parameter, Representing the reconstructed text features.
As a further aspect of the present invention: in step B5, based on the reconstructed features, a corresponding loss function is constructed, and the formula is as follows:
wherein l' re、l″re represents the reconstruction loss of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.
As a further aspect of the present invention: step B6 also comprises constructing a corresponding reconstruction loss, wherein the formula is as follows:
Where l' d denotes the reconstruction penalty of the text feature.
As a further aspect of the present invention: step 4 comprises the following specific steps:
C1 compression characteristics based on model The characteristic representation of the sound image file is obtained, and the formula is as follows:
β1+β2=1
wherein beta 1、β2 denotes the weighting parameters of the feature, Representing the characteristics of the audio frequency characteristic and the video frequency characteristic group, [ ] represents the dimension splice of the characteristics, and f k represents the sound image file characteristics extracted by the model;
And C2, quantizing the extracted features f k of the original sound image file and the extracted features f k of the reference sound image file to construct a corresponding loss function, wherein the formula is as follows:
qt=Q[fk]
lf=(qt-qd)2
Lall=λ1(l′re+l′d)+λ2l″re-lf;λ1+λ2=1,λ1,λ2>0
Wherein Q [. Cndot. ] represents a characteristic quantization function, Q t、qd represents the quantization characteristics of f k、fk, l f represents a quantization loss function, lambda 1、λ2 represents the weight coefficient of the loss function, and model loss function is minimized based on Adam optimization algorithm until the model converges, and finally the extracted quantization characteristics Q t of the sound image file are obtained.
As a further aspect of the present invention: in step C1, for the features of the sound image file extracted by the model, because of the uncertainty of deep learning, in order to avoid the generation of the same features of the close sound image file, for the original sound image file, a corresponding disturbance quantity is added to generate a reference sound image dataset F d, and the features of the disturbance sound image dataset are extracted by adopting the sound image file feature extraction network set in advance, with the following formula:
Wherein, The interference characteristics generated by the reference sound image data set are represented, kappa (·) represents a network characteristic extraction model, the structure is the same as the characteristic extraction network, and psi represents network model parameters.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.
Drawings
Fig. 1 is a schematic flow chart of a tamper-proof method for an audio/video file.
Detailed Description
Referring to fig. 1, in an embodiment of the present invention, a tamper-proof method for an audio/video file includes the following steps:
Step 1: acquiring a sound image file data set F t, and dividing the sound image file into a video set V1, an audio set A1 and a video set V2 according to whether the sound image file contains audio data or not;
step 2: extracting a core frame of the video set by adopting a core frame extraction algorithm;
Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain sound image file features, reconstructing the features based on a decoding network, and constructing a reconstruction loss function;
Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the sound image file characteristics and the reference sound image file characteristics, constructing a quantization loss function, constructing a model joint loss function by combining the quantization loss function and the reconstruction loss function, minimizing the model loss function until the model converges, and finally extracting to obtain sound image file quantization characteristics;
Step 5: and the key is generated by adopting a hashing algorithm MD5 to the extracted quantization characteristics of the sound image file, and is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.
Further, in step 2, based on the video sets V1, V2, a frame image dataset F 1=[f1 1,f1 2,...,f1 t is acquired,Wherein f 1 i, i e [1, t ] represents the i-th frame image of the video set V1, f 2 i, i e [1, t ] represents the i-th frame image of the video set V2, and the method for extracting the core frame of the frame image dataset comprises the following specific steps:
a1: the cumulative histogram of the frame image is calculated as follows:
Wherein h i represents the cumulative histogram of the ith pixel of the image, and l is the value of the pixel value of the image; n is the total number of image pixels, N k represents the number of pixels with a pixel value of k;
A2: based on the cumulative histogram, the video file is subjected to scene segmentation by the difference between adjacent frames and the difference between the adjacent frames and the scene initial frame, and the formula is as follows:
d j=τ1dj′+τ2dj″ 0090, wherein d j′ represents the difference between the j-th frame image and the j+1th frame image, d j″ represents the difference between the j-th frame image and the first frame image, d j represents the scene boundary frame weight, and τ 1、τ2 is a weight coefficient;
A3: on the basis of the scene boundary frame weight d ', if d' is larger than a specified threshold sigma, defining the frame as a boundary frame of a scene, and on the basis of the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:
where v in denotes the core frame set of the mth scene, The ith frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, the core frame set V 1 k of the video set V1 and the core frame set V 2 k of the video set V2 are finally obtained.
Further, the step 3 includes the following specific steps:
b1: extracting the spatial similarity of the interior of the image in the core frame set based on the convolutional neural network, and the formula is as follows:
where a l is the output of the layer one network, B l is the weight and bias parameters of the layer I network model;
B2: the spatial similarity between images in the core frame set is extracted based on the frame difference time sequence network, and the formula is as follows:
h′t-1=σ(Wdht-1+bd)
σt=f(Δti)*h′t-1
yt=σ(Wo·ht)
wherein h t-1 represents the image characteristic of the previous frame, h' t-1 represents partial image characteristic information influenced by a frame difference control gate, k t represents a frame difference input gate, and the influence of a frame difference interval on the image characteristic of the frame is controlled; f (·) is the frame difference function; sigma (·), tan h (·) represent activation functions; An output representing the image characteristics of the previous frame; x t denotes the current frame image feature, r t is a reset gate, indicating how much of the previous frame image feature information remains in the current frame; /(I) The state information of the current frame image is memorized, z t represents an update gate, the retention condition of the characteristic information of the current frame image is determined, h t represents the hidden output of the current frame image information, y t represents the characteristic output of the current frame image, and W r,/>W z,Wo,Wd,bd represents frame difference timing network parameters;
Finally, video core frame set V 1 k and Obtaining low-dimensional characteristic representation/>, through a frame difference network
B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame V 1 k;
B4: for the text set T 1 k, extracting word vectors of the text to obtain vector representation of the text, and extracting correlation among the text based on a cyclic neural network to finally obtain feature representation of the text set, wherein the formula is as follows:
f′t=σ(W′f[h″t-1,x′t]+b′f)
i′t=σ(W′i[h″t-1,x′t]+b′i)
o′t=σ(W′o[h″t-1,x′t]+b′o)
h′t=o′t*tanh(C′t)
Wherein x' t represents the current text feature, σ (·) and tanh (·) represent the activation function; h 't-1 denotes the output of the last text feature, f' t denotes the recurrent neural network forget gate, and denotes whether the last text feature information is retained to the current text; i' t, C' t represents the input gate of the recurrent neural network, and represents how many current text features are input into the model; o 't and h' t represent recurrent neural network output gates, representing current text feature outputs, W 'f、W′i、W′c、W′o and b' f、b′i、b′c、b′o are model parameters;
finally, the text set T 1 k passes through a cyclic neural network to obtain a low-dimensional characteristic representation of the text
B5: low-dimensional video set feature based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows: /(I)
Wherein h' (. Cndot.) represents a reconstructed network model, the formula is shown in steps B1 and B2, θ represents a reconstructed network parameter, and r 1 k represents a video featureFeatures after reconstruction,/>Representing video features/>The reconstructed features;
b6: file feature based on cyclic neural network Reconstructing to enable the text set feature to approach the text set feature T 1 k, wherein the formula is as follows:
Wherein, h' (. Cndot.) represents a reconstructed network model, the formula is shown in the step B4, eta represents a reconstructed network parameter, and r 1 k′ represents a reconstructed text feature.
Further, in step B5, based on the reconstructed feature, a corresponding loss function is constructed, and the formula is as follows:
wherein l' rel″re represents the reconstruction loss of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.
Further, step B6 further includes constructing a corresponding reconstruction loss, where the formula is as follows:
Where l' d denotes the reconstruction penalty of the text feature.
Further, the step 4 comprises the following specific steps:
C1 compression characteristics based on model The characteristic representation of the sound image file is obtained, and the formula is as follows:
β1+β2=1
wherein beta 1、β2 denotes the weighting parameters of the feature, Representing the characteristics of the audio frequency characteristic and the video frequency characteristic group, [ ] represents the dimension splice of the characteristics, and f k represents the sound image file characteristics extracted by the model;
And C2, quantizing the extracted features f k of the original sound image file and the extracted features f k of the reference sound image file to construct a corresponding loss function, wherein the formula is as follows:
qt=Q[fk]
lf=(qt-qd)2
Lall=λ1(l′re+l′d)+λ2l″re-lf;λl+λ2=1,λl,λ2>0
Wherein Q [. Cndot. ] represents a characteristic quantization function, Q t、qd represents the quantization characteristics of f k、fk, l f represents a quantization loss function, lambda 1、λ2 represents the weight coefficient of the loss function, and model loss function is minimized based on Adam optimization algorithm until the model converges, and finally the extracted quantization characteristics Q t of the sound image file are obtained.
Further, in step C1, for the features of the sound image file extracted by the model, due to the uncertainty of deep learning, in order to avoid the similar features generated by the close sound image file, for the original sound image file, a corresponding disturbance quantity is added to generate a reference sound image dataset F d, and the features of the disturbance sound image dataset are extracted by adopting the sound image file feature extraction network set in the foregoing, and the formula is as follows:
Wherein, The interference characteristics generated by the reference sound image data set are represented, kappa (·) represents a network characteristic extraction model, the structure is the same as the characteristic extraction network, and psi represents network model parameters.
To sum up: the invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.
The foregoing description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical solution of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (7)
1. A tamper-proof method for an audio-visual file is characterized by comprising the following steps:
Step 1: acquiring a sound image file data set F t, and dividing the sound image file into a video set V1, an audio set A1 and a video set V2 according to whether the sound image file contains audio data or not;
Step 2: extracting core frames of the video sets V1 and V2 by adopting a core frame extraction algorithm to obtain a core frame set V 1 k of the video set V1 and a core frame set V 2 k of the video set V2;
Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain original sound image file features, and respectively reconstructing the video features and the text features based on a decoding network to construct a reconstruction loss function;
The extracting video features and text features based on the feature extraction network comprises the following steps:
The core frame sets V 1 k and V 2 k are processed to obtain video features
Converting the audio set A1 into a voice text, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame set V 1 k; the text set T 1 k is processed to obtain text characteristics
Wherein constructing the reconstruction loss function comprises:
Constructing a reconstruction loss function l ' re corresponding to the video set V1, a reconstruction loss function l ' re corresponding to the video set V2 and a reconstruction loss function l ' d corresponding to the text feature;
Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the original sound image file characteristics and the reference sound image file characteristics, constructing a quantized loss function, constructing a model joint loss function by combining the quantized loss function and the reconstructed loss function, minimizing the model joint loss function until the model converges, and finally extracting to obtain sound image file quantized characteristics;
Wherein, the making the reference sound image file set comprises:
adding corresponding disturbance quantity to an original sound image file to generate a reference sound image file set;
The method for constructing the model joint loss function by quantizing the original sound image file characteristics and the reference sound image file characteristics and combining the quantized loss function and the reconstructed loss function comprises the following steps:
Obtaining a quantized feature q t corresponding to the original sound image file feature and a quantized feature q d corresponding to the reference sound image file feature through a feature quantization function;
Constructing a quantization loss function l f according to the quantization characteristics q t and q d;
constructing a model joint loss function according to l' re、l″re、l′d and l f;
Step 5: and a hash algorithm MD5 is adopted for the extracted quantization characteristics of the sound image file to generate a secret key, and the secret key is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.
2. The method of claim 1, wherein in step 2, based on the video sets V1 and V2, a frame image dataset F 1=[f1 1,f1 2,...,f1 t is obtained,Wherein f 1 i, i e [1, t ] represents the i-th frame image of video set V1; f 2 i, i e [1, t ], represents the i-th frame image of video set V2;
In order to preserve the correlation between frame images while removing temporal redundancy between frame images, a core frame of a frame image dataset is extracted as follows:
a1: the cumulative histogram of the frame image is calculated as follows:
Wherein h i represents the cumulative histogram of the ith pixel of the image, and l is the value of the pixel value of the image; n is the total number of image pixels, N k represents the number of pixels with a pixel value of k;
A2: based on the cumulative histogram, the video file is subjected to scene segmentation by the difference between adjacent frames and the difference between the adjacent frames and the scene initial frame, and the formula is as follows:
dj=τ1dj′+τ2dj″
Wherein d j′ represents the difference between the j-th frame image and the j+1st frame image, d j″ represents the difference between the j-th frame image and the first frame image, d j represents the scene boundary frame weight, and τ 1、τ2 is a weight coefficient;
A3: based on the scene boundary frame weight d j, if d j is greater than a prescribed threshold sigma, defining the frame as a boundary frame of a scene, and based on the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:
where v m denotes the core frame set of the mth scene, The i frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, and finally obtaining the core frame set V 1 k of the video set V1 and the core frame set V 2 k of the video set V2.
3. The method for preventing tampering of an audio-visual file according to claim 1, wherein the step 3 comprises the following specific steps:
b1: extracting the spatial similarity of the interior of the image in the core frame set based on the convolutional neural network, and the formula is as follows:
where a l is the output of the layer one network, B l is the weight and bias parameters of the layer I network;
B2: the spatial similarity between images in the core frame set is extracted based on the frame difference time sequence network, and the formula is as follows:
h′t-1=σ(Wdht-1+bd)
σ1=f(Δti)*h′t-1
yt=σ(Wo·ht)
wherein h t-1 represents the image characteristic of the previous frame, h' t-1 represents partial image characteristic information influenced by a frame difference control gate, k t represents a frame difference input gate, and the influence of a frame difference interval on the image characteristic of the frame is controlled; f (·) is the frame difference function; sigma (·), tan h (·) represent activation functions; An output representing the image characteristics of the previous frame; x t denotes the current frame image feature, r t is a reset gate, indicating how much of the previous frame image feature information remains in the current frame; /(I) The state information of the current frame image is memorized, z t represents an update gate, the retention condition of the characteristic information of the current frame image is determined, h t represents the hidden output of the current frame image information, y t represents the characteristic output of the current frame image, and W r,/>W z,Wo,Wd,bd represents frame difference timing network parameters;
Finally, the video core frame sets V 1 k and V 2 k pass through a frame difference network to obtain low-dimensional video features
B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame set V 1 k;
B4: for the text set T 1 k, extracting word vectors of the text to obtain vector representation of the text, and extracting correlation among the text based on a cyclic neural network to finally obtain feature representation of the text set, wherein the formula is as follows:
f′t=σ(W′f[h″t-1,x′t]+b′f)
o′t=σ(W′o[h″t-1,x′t]+b′o)
h′t=o′t*tanh(C′t)
Wherein x' t represents the current text feature, σ (·) and tanh (·) represent the activation function; h 't-1 denotes the output of the last text feature, f' t denotes the recurrent neural network forget gate, and denotes whether the last text feature information is retained to the current text; i' t, C' t represents the input gate of the recurrent neural network, and represents how many current text features are input into the model; o 't and h' t represent recurrent neural network output gates, representing current text feature outputs, W 'f、W′i、W′c、W′o and b' f、b′i、b′c、b′o are model parameters;
Finally, the text set T 1 k passes through a cyclic neural network to obtain low-dimensional text characteristics of the text
B5: video feature for low dimension based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows:
Where h' (. Cndot.) represents the reconstructed network model, θ represents the reconstructed network parameters, and r 1 k represents the video features Features after reconstruction,/>Representing video features/>The reconstructed features;
b6: text feature based on cyclic neural network Reconstructing to make it approach to the text set T 1 k, and the formula:
Where h "(. Cndot.) represents the reconstructed network model, η represents the reconstructed network parameters, Representing the reconstructed text features.
4. A sound image file tamper-proofing method according to claim 3, wherein in step B5, based on the reconstructed characteristics, a corresponding reconstruction loss function is constructed, and the formula is as follows:
Wherein l' re、l″re represents the reconstruction loss function of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.
5. A method for tamper resistance of an audio-visual file according to claim 3, wherein step B6 further comprises constructing a corresponding reconstruction loss function according to the following formula:
Where l' d denotes the reconstruction loss function of the text feature.
6. The method of claim 1, wherein step 3 comprises the following steps:
C1, based on the result The original sound image file characteristics are obtained, and the formula is as follows:
β1+β2=1
wherein, beta 1、β2 represents a weight parameter, Representing text features/>And video feature/>The combined characteristics, [ ] represents the dimension splice of the characteristics, and f k represents the characteristics of the original sound image file;
Step4 comprises the following specific steps:
C2, for original sound image file characteristic f k and reference sound image file characteristic And carrying out quantization to construct a corresponding quantization loss function, wherein the formula is as follows:
qt=Q[fk]
lf=(qt-qd)2
Lall=λ1(l′re+l′d)+λ2l″re-lf
λ1+λ2=1,λ1、λ2>0
Wherein Q [ cndot ] represents a characteristic quantization function, and Q t、qd represents f k, respectively, I f represents a quantization loss function, lambda 1、λ2 represents a weight coefficient, and the model joint loss function is minimized based on an Adam optimization algorithm until the model converges, and finally the quantization characteristic of the sound image file is obtained.
7. The method of claim 6, wherein for the original sound image file, adding a corresponding disturbance amount to generate a reference sound image file set F d, and extracting features of the reference sound image file set by using a feature extraction network, wherein the formula is as follows:
Wherein, And (3) representing the characteristics of the reference sound image file, wherein kappa (·) represents a network characteristic extraction model, the structure is the same as that of a characteristic extraction network, and psi represents network characteristic extraction model parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110533209.4A CN113704829B (en) | 2021-05-19 | 2021-05-19 | Method for preventing sound image file from being tampered |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110533209.4A CN113704829B (en) | 2021-05-19 | 2021-05-19 | Method for preventing sound image file from being tampered |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113704829A CN113704829A (en) | 2021-11-26 |
CN113704829B true CN113704829B (en) | 2024-06-11 |
Family
ID=78648176
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110533209.4A Active CN113704829B (en) | 2021-05-19 | 2021-05-19 | Method for preventing sound image file from being tampered |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113704829B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110880172A (en) * | 2019-11-12 | 2020-03-13 | 中山大学 | Video face tampering detection method and system based on cyclic convolution neural network |
CN111479112A (en) * | 2020-06-23 | 2020-07-31 | 腾讯科技(深圳)有限公司 | Video coding method, device, equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10932009B2 (en) * | 2019-03-08 | 2021-02-23 | Fcb Worldwide, Inc. | Technologies for analyzing and searching for features in image data |
-
2021
- 2021-05-19 CN CN202110533209.4A patent/CN113704829B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110880172A (en) * | 2019-11-12 | 2020-03-13 | 中山大学 | Video face tampering detection method and system based on cyclic convolution neural network |
CN111479112A (en) * | 2020-06-23 | 2020-07-31 | 腾讯科技(深圳)有限公司 | Video coding method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
基于R-LWE的密文域多比特可逆信息隐藏算法;柯彦;张敏情;苏婷婷;;计算机研究与发展(第10期);178-193 * |
Also Published As
Publication number | Publication date |
---|---|
CN113704829A (en) | 2021-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Amerini et al. | Exploiting prediction error inconsistencies through LSTM-based classifiers to detect deepfake videos | |
CN108960063B (en) | Multi-event natural language description method in video facing event relation coding | |
CN109948721B (en) | Video scene classification method based on video description | |
CN108287904A (en) | A kind of document context perception recommendation method decomposed based on socialization convolution matrix | |
CN113627266B (en) | Video pedestrian re-recognition method based on transform space-time modeling | |
CN114723760B (en) | Portrait segmentation model training method and device and portrait segmentation method and device | |
CN112464179A (en) | Short video copyright storage algorithm based on block chain and expression recognition | |
CN114724060A (en) | Method and device for unsupervised video anomaly detection based on mask self-encoder | |
CN113961736A (en) | Method and device for generating image by text, computer equipment and storage medium | |
CN112804558A (en) | Video splitting method, device and equipment | |
CN113421185B (en) | StyleGAN-based mobile terminal face age editing method | |
CN115037926A (en) | Video quality evaluation method, device, equipment and medium thereof | |
CN115393949A (en) | Continuous sign language recognition method and device | |
CN113420179B (en) | Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution | |
CN112528077B (en) | Video face retrieval method and system based on video embedding | |
CN117115718B (en) | Government affair video data processing method, system and computer readable storage medium | |
CN113704829B (en) | Method for preventing sound image file from being tampered | |
Dastbaravardeh et al. | Channel Attention‐Based Approach with Autoencoder Network for Human Action Recognition in Low‐Resolution Frames | |
Li et al. | Face Recognition Based on the Combination of Enhanced Local Texture Feature and DBN under Complex Illumination Conditions. | |
CN113554569B (en) | Face image restoration system based on double memory dictionaries | |
CN112131429A (en) | Video classification method and system based on depth prediction coding network | |
CN112258707A (en) | Intelligent access control system based on face recognition | |
CN113689527A (en) | Training method of face conversion model and face image conversion method | |
CN116883900A (en) | Video authenticity identification method and system based on multidimensional biological characteristics | |
CN116704433A (en) | Self-supervision group behavior recognition method based on context-aware relationship predictive coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |