CN113704829B - Method for preventing sound image file from being tampered - Google Patents

Method for preventing sound image file from being tampered Download PDF

Info

Publication number
CN113704829B
CN113704829B CN202110533209.4A CN202110533209A CN113704829B CN 113704829 B CN113704829 B CN 113704829B CN 202110533209 A CN202110533209 A CN 202110533209A CN 113704829 B CN113704829 B CN 113704829B
Authority
CN
China
Prior art keywords
frame
text
image file
sound image
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110533209.4A
Other languages
Chinese (zh)
Other versions
CN113704829A (en
Inventor
李喆
邱杰峰
陈莹
程莉红
施千里
袁雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CNNC Fujian Nuclear Power Co Ltd
Original Assignee
CNNC Fujian Nuclear Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CNNC Fujian Nuclear Power Co Ltd filed Critical CNNC Fujian Nuclear Power Co Ltd
Priority to CN202110533209.4A priority Critical patent/CN113704829B/en
Publication of CN113704829A publication Critical patent/CN113704829A/en
Application granted granted Critical
Publication of CN113704829B publication Critical patent/CN113704829B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Security & Cryptography (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of deep learning, in particular to a sound image file tamper-proof method, which comprises four steps in order to prevent the occurrence of sound image file tamper events. The invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.

Description

Method for preventing sound image file from being tampered
Technical Field
The invention relates to the technical field of deep learning, in particular to a tamper-proof method for an audio-video file.
Background
Along with the increasing trend of digital file management, the problem of tamper resistance of sound image files is more and more emphasized, and in the existing sound image file tamper resistance technical means, the sound image files are often separated into video files and audio files independently, and corresponding tamper resistance technologies are respectively used.
However, it is often not desirable to face the strong correlation of video and audio files. Therefore, a person skilled in the art provides a tamper-proof method for an audio-visual file to solve the above-mentioned problems in the background art.
Disclosure of Invention
The invention aims to provide an anti-tampering method for an audio-video file, which aims to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a tamper-proof method for an audio-visual file comprises the following steps:
Step 1: acquiring a sound image file data set F t, and dividing the sound image file into a video set V1, an audio set A1 and a video set V2 according to whether the sound image file contains audio data or not;
step 2: extracting a core frame of the video set by adopting a core frame extraction algorithm;
Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain sound image file features, reconstructing the features based on a decoding network, and constructing a reconstruction loss function;
Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the sound image file characteristics and the reference sound image file characteristics, constructing a quantization loss function, constructing a model joint loss function by combining the quantization loss function and the reconstruction loss function, minimizing the model loss function until the model converges, and finally extracting to obtain sound image file quantization characteristics;
Step 5: and the key is generated by adopting a hashing algorithm MD5 to the extracted quantization characteristics of the sound image file, and is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.
As a further aspect of the present invention: in step 2, based on the video sets V1, V2, a frame image dataset F 1=[f1 1,f1 2,...,f1 t is acquired,Wherein f 1 i, ε [1, t ], represent the ith frame image of video set V1; f 2 i, i e [1, t ], represents that the i-th frame image of the video set V2 maintains the correlation between frame images while removing the time sequence redundancy between frame images, and the core frame of the frame image dataset is extracted, specifically comprising the following steps:
a1: the cumulative histogram of the frame image is calculated as follows:
Wherein h i represents the cumulative histogram of the ith pixel of the image, and l is the value of the pixel value of the image; n is the total number of image pixels, N k represents the number of pixels with a pixel value of k;
A2: based on the cumulative histogram, the video file is subjected to scene segmentation by the difference between adjacent frames and the difference between the adjacent frames and the scene initial frame, and the formula is as follows:
dj=τ1dj′2dj
Wherein d j′ represents the difference between the j-th frame image and the j+1st frame image, d j″ represents the difference between the j-th frame image and the first frame image, d j represents the scene boundary frame weight, and τ 1、τ2 is a weight coefficient;
A3: on the basis of the scene boundary frame weight d ', if d' is larger than a specified threshold sigma, defining the frame as a boundary frame of a scene, and on the basis of the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:
where v in denotes the core frame set of the mth scene, The ith frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, the core frame set V 1 k of the video set V1 and the core frame set V 2 k of the video set V2 are finally obtained.
As a further aspect of the present invention: the step 3 comprises the following specific steps:
b1: extracting the spatial similarity of the interior of the image in the core frame set based on the convolutional neural network, and the formula is as follows:
where a l is the output of the layer one network, B l is the weight and bias parameters of the layer I network model;
B2: the spatial similarity between images in the core frame set is extracted based on the frame difference time sequence network, and the formula is as follows:
h′t-1=σ(Wdht-1+bd)
σt=f(Δti)*h′t-1
yt=σ(Wo·ht)
wherein h t-1 represents the image characteristic of the previous frame, h' t-1 represents partial image characteristic information influenced by a frame difference control gate, k t represents a frame difference input gate, and the influence of a frame difference interval on the image characteristic of the frame is controlled; f (·) is the frame difference function; sigma (·), tan h (·) represent activation functions; An output representing the image characteristics of the previous frame; x t denotes the current frame image feature, r t is a reset gate, indicating how much of the previous frame image feature information remains in the current frame; /(I) The state information of the current frame image is memorized, z t represents an update gate, the retention condition of the characteristic information of the current frame image is determined, h t represents the hidden output of the current frame image information, y t represents the characteristic output of the current frame image, and W r,/>W z,Wo,Wd,bd represents frame difference timing network parameters;
Finally, a set of video core frames And/>Obtaining low-dimensional characteristic representation/>, through a frame difference network
B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame V 1 k;
B4: for the text set T 1 k, extracting word vectors of the text to obtain vector representation of the text, and extracting correlation among the text based on a cyclic neural network to finally obtain feature representation of the text set, wherein the formula is as follows:
f′t=σ(W′f[h″t-1,x′t]+b′f)
o′t=σ(W′o[h″t-1,x′t]+b′o)
h′t=o′t*tanh(C′t)
Wherein x' t represents the current text feature, σ (·) and tanh (·) represent the activation function; h 't-1 denotes the output of the last text feature, f' t denotes the recurrent neural network forget gate, and denotes whether the last text feature information is retained to the current text; i' t, C' t represents the input gate of the recurrent neural network, and represents how many current text features are input into the model; o 't and h' t represent recurrent neural network output gates, representing current text feature outputs, W 'f、W′i、W′c、W′o and b' f、b′i、b′c、b′o are model parameters;
finally, the text set T 1 k passes through a cyclic neural network to obtain a low-dimensional characteristic representation of the text
B5: low-dimensional video set feature based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows: /(I)
Wherein h' (. Cndot.) represents a reconstructed network model, the formula is shown in steps B1 and B2, θ represents a reconstructed network parameter, and r 1 k represents a video featureFeatures after reconstruction,/>Representing video features/>The reconstructed features;
b6: file feature based on cyclic neural network Reconstructing to enable the text set feature to approach the text set feature T 1 k, wherein the formula is as follows:
Wherein h "(. Cndot.) represents a reconstructed network model, the formula is shown in step B4, η represents a reconstructed network parameter, Representing the reconstructed text features.
As a further aspect of the present invention: in step B5, based on the reconstructed features, a corresponding loss function is constructed, and the formula is as follows:
wherein l' re、l″re represents the reconstruction loss of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.
As a further aspect of the present invention: step B6 also comprises constructing a corresponding reconstruction loss, wherein the formula is as follows:
Where l' d denotes the reconstruction penalty of the text feature.
As a further aspect of the present invention: step 4 comprises the following specific steps:
C1 compression characteristics based on model The characteristic representation of the sound image file is obtained, and the formula is as follows:
β12=1
wherein beta 1、β2 denotes the weighting parameters of the feature, Representing the characteristics of the audio frequency characteristic and the video frequency characteristic group, [ ] represents the dimension splice of the characteristics, and f k represents the sound image file characteristics extracted by the model;
And C2, quantizing the extracted features f k of the original sound image file and the extracted features f k of the reference sound image file to construct a corresponding loss function, wherein the formula is as follows:
qt=Q[fk]
lf=(qt-qd)2
Lall=λ1(l′re+l′d)+λ2l″re-lf12=1,λ12>0
Wherein Q [. Cndot. ] represents a characteristic quantization function, Q t、qd represents the quantization characteristics of f k、fk, l f represents a quantization loss function, lambda 1、λ2 represents the weight coefficient of the loss function, and model loss function is minimized based on Adam optimization algorithm until the model converges, and finally the extracted quantization characteristics Q t of the sound image file are obtained.
As a further aspect of the present invention: in step C1, for the features of the sound image file extracted by the model, because of the uncertainty of deep learning, in order to avoid the generation of the same features of the close sound image file, for the original sound image file, a corresponding disturbance quantity is added to generate a reference sound image dataset F d, and the features of the disturbance sound image dataset are extracted by adopting the sound image file feature extraction network set in advance, with the following formula:
Wherein, The interference characteristics generated by the reference sound image data set are represented, kappa (·) represents a network characteristic extraction model, the structure is the same as the characteristic extraction network, and psi represents network model parameters.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.
Drawings
Fig. 1 is a schematic flow chart of a tamper-proof method for an audio/video file.
Detailed Description
Referring to fig. 1, in an embodiment of the present invention, a tamper-proof method for an audio/video file includes the following steps:
Step 1: acquiring a sound image file data set F t, and dividing the sound image file into a video set V1, an audio set A1 and a video set V2 according to whether the sound image file contains audio data or not;
step 2: extracting a core frame of the video set by adopting a core frame extraction algorithm;
Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain sound image file features, reconstructing the features based on a decoding network, and constructing a reconstruction loss function;
Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the sound image file characteristics and the reference sound image file characteristics, constructing a quantization loss function, constructing a model joint loss function by combining the quantization loss function and the reconstruction loss function, minimizing the model loss function until the model converges, and finally extracting to obtain sound image file quantization characteristics;
Step 5: and the key is generated by adopting a hashing algorithm MD5 to the extracted quantization characteristics of the sound image file, and is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.
Further, in step 2, based on the video sets V1, V2, a frame image dataset F 1=[f1 1,f1 2,...,f1 t is acquired,Wherein f 1 i, i e [1, t ] represents the i-th frame image of the video set V1, f 2 i, i e [1, t ] represents the i-th frame image of the video set V2, and the method for extracting the core frame of the frame image dataset comprises the following specific steps:
a1: the cumulative histogram of the frame image is calculated as follows:
Wherein h i represents the cumulative histogram of the ith pixel of the image, and l is the value of the pixel value of the image; n is the total number of image pixels, N k represents the number of pixels with a pixel value of k;
A2: based on the cumulative histogram, the video file is subjected to scene segmentation by the difference between adjacent frames and the difference between the adjacent frames and the scene initial frame, and the formula is as follows:
d j=τ1dj′2dj″ 0090, wherein d j′ represents the difference between the j-th frame image and the j+1th frame image, d j″ represents the difference between the j-th frame image and the first frame image, d j represents the scene boundary frame weight, and τ 1、τ2 is a weight coefficient;
A3: on the basis of the scene boundary frame weight d ', if d' is larger than a specified threshold sigma, defining the frame as a boundary frame of a scene, and on the basis of the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:
where v in denotes the core frame set of the mth scene, The ith frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, the core frame set V 1 k of the video set V1 and the core frame set V 2 k of the video set V2 are finally obtained.
Further, the step 3 includes the following specific steps:
b1: extracting the spatial similarity of the interior of the image in the core frame set based on the convolutional neural network, and the formula is as follows:
where a l is the output of the layer one network, B l is the weight and bias parameters of the layer I network model;
B2: the spatial similarity between images in the core frame set is extracted based on the frame difference time sequence network, and the formula is as follows:
h′t-1=σ(Wdht-1+bd)
σt=f(Δti)*h′t-1
yt=σ(Wo·ht)
wherein h t-1 represents the image characteristic of the previous frame, h' t-1 represents partial image characteristic information influenced by a frame difference control gate, k t represents a frame difference input gate, and the influence of a frame difference interval on the image characteristic of the frame is controlled; f (·) is the frame difference function; sigma (·), tan h (·) represent activation functions; An output representing the image characteristics of the previous frame; x t denotes the current frame image feature, r t is a reset gate, indicating how much of the previous frame image feature information remains in the current frame; /(I) The state information of the current frame image is memorized, z t represents an update gate, the retention condition of the characteristic information of the current frame image is determined, h t represents the hidden output of the current frame image information, y t represents the characteristic output of the current frame image, and W r,/>W z,Wo,Wd,bd represents frame difference timing network parameters;
Finally, video core frame set V 1 k and Obtaining low-dimensional characteristic representation/>, through a frame difference network
B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame V 1 k;
B4: for the text set T 1 k, extracting word vectors of the text to obtain vector representation of the text, and extracting correlation among the text based on a cyclic neural network to finally obtain feature representation of the text set, wherein the formula is as follows:
f′t=σ(W′f[h″t-1,x′t]+b′f)
i′t=σ(W′i[h″t-1,x′t]+b′i)
o′t=σ(W′o[h″t-1,x′t]+b′o)
h′t=o′t*tanh(C′t)
Wherein x' t represents the current text feature, σ (·) and tanh (·) represent the activation function; h 't-1 denotes the output of the last text feature, f' t denotes the recurrent neural network forget gate, and denotes whether the last text feature information is retained to the current text; i' t, C' t represents the input gate of the recurrent neural network, and represents how many current text features are input into the model; o 't and h' t represent recurrent neural network output gates, representing current text feature outputs, W 'f、W′i、W′c、W′o and b' f、b′i、b′c、b′o are model parameters;
finally, the text set T 1 k passes through a cyclic neural network to obtain a low-dimensional characteristic representation of the text
B5: low-dimensional video set feature based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows: /(I)
Wherein h' (. Cndot.) represents a reconstructed network model, the formula is shown in steps B1 and B2, θ represents a reconstructed network parameter, and r 1 k represents a video featureFeatures after reconstruction,/>Representing video features/>The reconstructed features;
b6: file feature based on cyclic neural network Reconstructing to enable the text set feature to approach the text set feature T 1 k, wherein the formula is as follows:
Wherein, h' (. Cndot.) represents a reconstructed network model, the formula is shown in the step B4, eta represents a reconstructed network parameter, and r 1 k′ represents a reconstructed text feature.
Further, in step B5, based on the reconstructed feature, a corresponding loss function is constructed, and the formula is as follows:
wherein l' rel″re represents the reconstruction loss of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.
Further, step B6 further includes constructing a corresponding reconstruction loss, where the formula is as follows:
Where l' d denotes the reconstruction penalty of the text feature.
Further, the step 4 comprises the following specific steps:
C1 compression characteristics based on model The characteristic representation of the sound image file is obtained, and the formula is as follows:
β12=1
wherein beta 1、β2 denotes the weighting parameters of the feature, Representing the characteristics of the audio frequency characteristic and the video frequency characteristic group, [ ] represents the dimension splice of the characteristics, and f k represents the sound image file characteristics extracted by the model;
And C2, quantizing the extracted features f k of the original sound image file and the extracted features f k of the reference sound image file to construct a corresponding loss function, wherein the formula is as follows:
qt=Q[fk]
lf=(qt-qd)2
Lall=λ1(l′re+l′d)+λ2l″re-lfl2=1,λl2>0
Wherein Q [. Cndot. ] represents a characteristic quantization function, Q t、qd represents the quantization characteristics of f k、fk, l f represents a quantization loss function, lambda 1、λ2 represents the weight coefficient of the loss function, and model loss function is minimized based on Adam optimization algorithm until the model converges, and finally the extracted quantization characteristics Q t of the sound image file are obtained.
Further, in step C1, for the features of the sound image file extracted by the model, due to the uncertainty of deep learning, in order to avoid the similar features generated by the close sound image file, for the original sound image file, a corresponding disturbance quantity is added to generate a reference sound image dataset F d, and the features of the disturbance sound image dataset are extracted by adopting the sound image file feature extraction network set in the foregoing, and the formula is as follows:
Wherein, The interference characteristics generated by the reference sound image data set are represented, kappa (·) represents a network characteristic extraction model, the structure is the same as the characteristic extraction network, and psi represents network model parameters.
To sum up: the invention provides a sound image file tamper-proof method based on a deep learning technology, which can effectively solve the problem that a video file and an audio file in the sound image file are strongly associated, and in addition, the core frame extraction technology can effectively improve the video feature extraction efficiency, and finally, the sound image file features are solidified based on a blockchain technology, so that the occurrence of sound image file tamper events is effectively prevented.
The foregoing description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical solution of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (7)

1. A tamper-proof method for an audio-visual file is characterized by comprising the following steps:
Step 1: acquiring a sound image file data set F t, and dividing the sound image file into a video set V1, an audio set A1 and a video set V2 according to whether the sound image file contains audio data or not;
Step 2: extracting core frames of the video sets V1 and V2 by adopting a core frame extraction algorithm to obtain a core frame set V 1 k of the video set V1 and a core frame set V 2 k of the video set V2;
Step 3: extracting video features and text features based on a feature extraction network, combining the text features and the video features to obtain original sound image file features, and respectively reconstructing the video features and the text features based on a decoding network to construct a reconstruction loss function;
The extracting video features and text features based on the feature extraction network comprises the following steps:
The core frame sets V 1 k and V 2 k are processed to obtain video features
Converting the audio set A1 into a voice text, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame set V 1 k; the text set T 1 k is processed to obtain text characteristics
Wherein constructing the reconstruction loss function comprises:
Constructing a reconstruction loss function l ' re corresponding to the video set V1, a reconstruction loss function l ' re corresponding to the video set V2 and a reconstruction loss function l ' d corresponding to the text feature;
Step 4: formulating a reference sound image file set, acquiring reference sound image file characteristics based on a characteristic extraction network, quantizing the original sound image file characteristics and the reference sound image file characteristics, constructing a quantized loss function, constructing a model joint loss function by combining the quantized loss function and the reconstructed loss function, minimizing the model joint loss function until the model converges, and finally extracting to obtain sound image file quantized characteristics;
Wherein, the making the reference sound image file set comprises:
adding corresponding disturbance quantity to an original sound image file to generate a reference sound image file set;
The method for constructing the model joint loss function by quantizing the original sound image file characteristics and the reference sound image file characteristics and combining the quantized loss function and the reconstructed loss function comprises the following steps:
Obtaining a quantized feature q t corresponding to the original sound image file feature and a quantized feature q d corresponding to the reference sound image file feature through a feature quantization function;
Constructing a quantization loss function l f according to the quantization characteristics q t and q d;
constructing a model joint loss function according to l' re、l″re、l′d and l f;
Step 5: and a hash algorithm MD5 is adopted for the extracted quantization characteristics of the sound image file to generate a secret key, and the secret key is solidified based on a blockchain technology, so that the occurrence of a sound image file tampering event is effectively prevented.
2. The method of claim 1, wherein in step 2, based on the video sets V1 and V2, a frame image dataset F 1=[f1 1,f1 2,...,f1 t is obtained,Wherein f 1 i, i e [1, t ] represents the i-th frame image of video set V1; f 2 i, i e [1, t ], represents the i-th frame image of video set V2;
In order to preserve the correlation between frame images while removing temporal redundancy between frame images, a core frame of a frame image dataset is extracted as follows:
a1: the cumulative histogram of the frame image is calculated as follows:
Wherein h i represents the cumulative histogram of the ith pixel of the image, and l is the value of the pixel value of the image; n is the total number of image pixels, N k represents the number of pixels with a pixel value of k;
A2: based on the cumulative histogram, the video file is subjected to scene segmentation by the difference between adjacent frames and the difference between the adjacent frames and the scene initial frame, and the formula is as follows:
dj=τ1dj′2dj″
Wherein d j′ represents the difference between the j-th frame image and the j+1st frame image, d j″ represents the difference between the j-th frame image and the first frame image, d j represents the scene boundary frame weight, and τ 1、τ2 is a weight coefficient;
A3: based on the scene boundary frame weight d j, if d j is greater than a prescribed threshold sigma, defining the frame as a boundary frame of a scene, and based on the segmentation scene frame difference, dynamically extracting a core frame of the scene, wherein the formula is as follows:
where v m denotes the core frame set of the mth scene, The i frame representing the mth scene, n representing the scene core frame number, R < DEG > representing the upward rounding function, q representing the total frame number of the mth scene, μ representing the frame weight of the scene, and finally obtaining the core frame set V 1 k of the video set V1 and the core frame set V 2 k of the video set V2.
3. The method for preventing tampering of an audio-visual file according to claim 1, wherein the step 3 comprises the following specific steps:
b1: extracting the spatial similarity of the interior of the image in the core frame set based on the convolutional neural network, and the formula is as follows:
where a l is the output of the layer one network, B l is the weight and bias parameters of the layer I network;
B2: the spatial similarity between images in the core frame set is extracted based on the frame difference time sequence network, and the formula is as follows:
h′t-1=σ(Wdht-1+bd)
σ1=f(Δti)*h′t-1
yt=σ(Wo·ht)
wherein h t-1 represents the image characteristic of the previous frame, h' t-1 represents partial image characteristic information influenced by a frame difference control gate, k t represents a frame difference input gate, and the influence of a frame difference interval on the image characteristic of the frame is controlled; f (·) is the frame difference function; sigma (·), tan h (·) represent activation functions; An output representing the image characteristics of the previous frame; x t denotes the current frame image feature, r t is a reset gate, indicating how much of the previous frame image feature information remains in the current frame; /(I) The state information of the current frame image is memorized, z t represents an update gate, the retention condition of the characteristic information of the current frame image is determined, h t represents the hidden output of the current frame image information, y t represents the characteristic output of the current frame image, and W r,/>W z,Wo,Wd,bd represents frame difference timing network parameters;
Finally, the video core frame sets V 1 k and V 2 k pass through a frame difference network to obtain low-dimensional video features
B3: for the audio set A1, converting the audio set A1 into a voice text based on the existing voice recognition technology, and preprocessing the voice text according to the core frame extraction result of the video set V1 to obtain a text set T 1 k corresponding to the core frame set V 1 k;
B4: for the text set T 1 k, extracting word vectors of the text to obtain vector representation of the text, and extracting correlation among the text based on a cyclic neural network to finally obtain feature representation of the text set, wherein the formula is as follows:
f′t=σ(W′f[h″t-1,x′t]+b′f)
o′t=σ(W′o[h″t-1,x′t]+b′o)
h′t=o′t*tanh(C′t)
Wherein x' t represents the current text feature, σ (·) and tanh (·) represent the activation function; h 't-1 denotes the output of the last text feature, f' t denotes the recurrent neural network forget gate, and denotes whether the last text feature information is retained to the current text; i' t, C' t represents the input gate of the recurrent neural network, and represents how many current text features are input into the model; o 't and h' t represent recurrent neural network output gates, representing current text feature outputs, W 'f、W′i、W′c、W′o and b' f、b′i、b′c、b′o are model parameters;
Finally, the text set T 1 k passes through a cyclic neural network to obtain low-dimensional text characteristics of the text
B5: video feature for low dimension based on frame difference time sequence network and convolutional neural networkAnd (3) reconstructing, wherein the formula is as follows:
Where h' (. Cndot.) represents the reconstructed network model, θ represents the reconstructed network parameters, and r 1 k represents the video features Features after reconstruction,/>Representing video features/>The reconstructed features;
b6: text feature based on cyclic neural network Reconstructing to make it approach to the text set T 1 k, and the formula:
Where h "(. Cndot.) represents the reconstructed network model, η represents the reconstructed network parameters, Representing the reconstructed text features.
4. A sound image file tamper-proofing method according to claim 3, wherein in step B5, based on the reconstructed characteristics, a corresponding reconstruction loss function is constructed, and the formula is as follows:
Wherein l' re、l″re represents the reconstruction loss function of the video sets V1, V2 respectively, Representing the original core frame characteristics of video sets V1, V2, respectively.
5. A method for tamper resistance of an audio-visual file according to claim 3, wherein step B6 further comprises constructing a corresponding reconstruction loss function according to the following formula:
Where l' d denotes the reconstruction loss function of the text feature.
6. The method of claim 1, wherein step 3 comprises the following steps:
C1, based on the result The original sound image file characteristics are obtained, and the formula is as follows:
β12=1
wherein, beta 1、β2 represents a weight parameter, Representing text features/>And video feature/>The combined characteristics, [ ] represents the dimension splice of the characteristics, and f k represents the characteristics of the original sound image file;
Step4 comprises the following specific steps:
C2, for original sound image file characteristic f k and reference sound image file characteristic And carrying out quantization to construct a corresponding quantization loss function, wherein the formula is as follows:
qt=Q[fk]
lf=(qt-qd)2
Lall=λ1(l′re+l′d)+λ2l″re-lf
λ12=1,λ1、λ2>0
Wherein Q [ cndot ] represents a characteristic quantization function, and Q t、qd represents f k, respectively, I f represents a quantization loss function, lambda 1、λ2 represents a weight coefficient, and the model joint loss function is minimized based on an Adam optimization algorithm until the model converges, and finally the quantization characteristic of the sound image file is obtained.
7. The method of claim 6, wherein for the original sound image file, adding a corresponding disturbance amount to generate a reference sound image file set F d, and extracting features of the reference sound image file set by using a feature extraction network, wherein the formula is as follows:
Wherein, And (3) representing the characteristics of the reference sound image file, wherein kappa (·) represents a network characteristic extraction model, the structure is the same as that of a characteristic extraction network, and psi represents network characteristic extraction model parameters.
CN202110533209.4A 2021-05-19 2021-05-19 Method for preventing sound image file from being tampered Active CN113704829B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110533209.4A CN113704829B (en) 2021-05-19 2021-05-19 Method for preventing sound image file from being tampered

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110533209.4A CN113704829B (en) 2021-05-19 2021-05-19 Method for preventing sound image file from being tampered

Publications (2)

Publication Number Publication Date
CN113704829A CN113704829A (en) 2021-11-26
CN113704829B true CN113704829B (en) 2024-06-11

Family

ID=78648176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110533209.4A Active CN113704829B (en) 2021-05-19 2021-05-19 Method for preventing sound image file from being tampered

Country Status (1)

Country Link
CN (1) CN113704829B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880172A (en) * 2019-11-12 2020-03-13 中山大学 Video face tampering detection method and system based on cyclic convolution neural network
CN111479112A (en) * 2020-06-23 2020-07-31 腾讯科技(深圳)有限公司 Video coding method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10932009B2 (en) * 2019-03-08 2021-02-23 Fcb Worldwide, Inc. Technologies for analyzing and searching for features in image data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880172A (en) * 2019-11-12 2020-03-13 中山大学 Video face tampering detection method and system based on cyclic convolution neural network
CN111479112A (en) * 2020-06-23 2020-07-31 腾讯科技(深圳)有限公司 Video coding method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于R-LWE的密文域多比特可逆信息隐藏算法;柯彦;张敏情;苏婷婷;;计算机研究与发展(第10期);178-193 *

Also Published As

Publication number Publication date
CN113704829A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
Amerini et al. Exploiting prediction error inconsistencies through LSTM-based classifiers to detect deepfake videos
CN108960063B (en) Multi-event natural language description method in video facing event relation coding
CN109948721B (en) Video scene classification method based on video description
CN108287904A (en) A kind of document context perception recommendation method decomposed based on socialization convolution matrix
CN113627266B (en) Video pedestrian re-recognition method based on transform space-time modeling
CN114723760B (en) Portrait segmentation model training method and device and portrait segmentation method and device
CN112464179A (en) Short video copyright storage algorithm based on block chain and expression recognition
CN114724060A (en) Method and device for unsupervised video anomaly detection based on mask self-encoder
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN112804558A (en) Video splitting method, device and equipment
CN113421185B (en) StyleGAN-based mobile terminal face age editing method
CN115037926A (en) Video quality evaluation method, device, equipment and medium thereof
CN115393949A (en) Continuous sign language recognition method and device
CN113420179B (en) Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution
CN112528077B (en) Video face retrieval method and system based on video embedding
CN117115718B (en) Government affair video data processing method, system and computer readable storage medium
CN113704829B (en) Method for preventing sound image file from being tampered
Dastbaravardeh et al. Channel Attention‐Based Approach with Autoencoder Network for Human Action Recognition in Low‐Resolution Frames
Li et al. Face Recognition Based on the Combination of Enhanced Local Texture Feature and DBN under Complex Illumination Conditions.
CN113554569B (en) Face image restoration system based on double memory dictionaries
CN112131429A (en) Video classification method and system based on depth prediction coding network
CN112258707A (en) Intelligent access control system based on face recognition
CN113689527A (en) Training method of face conversion model and face image conversion method
CN116883900A (en) Video authenticity identification method and system based on multidimensional biological characteristics
CN116704433A (en) Self-supervision group behavior recognition method based on context-aware relationship predictive coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant