CN105989000A - Audio/video (AV) copy detection method and device - Google Patents
Audio/video (AV) copy detection method and device Download PDFInfo
- Publication number
- CN105989000A CN105989000A CN201510041044.3A CN201510041044A CN105989000A CN 105989000 A CN105989000 A CN 105989000A CN 201510041044 A CN201510041044 A CN 201510041044A CN 105989000 A CN105989000 A CN 105989000A
- Authority
- CN
- China
- Prior art keywords
- video
- audio
- frame
- audiovisual presentation
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Television Signal Processing For Recording (AREA)
Abstract
The invention relates to an audio/video (AV) copy detection method and device. The AV copy detection method includes: acquiring an AV image, performing decoding and pretreatment on the AV image, and performing feature extraction on the acquired audio part and the acquired video frames to obtain corresponding audio features and corresponding image features of the video frames; fusing the audio features and the image features of the video frames corresponding to the AV image to obtain AV fusion features; matching the AV fusion features based on a preset feature library of a reference video to obtain a frame set matching result; and performing copy determination and positioning on the AV image based on the frame set matching result and the reference video. Through the method for fusing the audio and the video, the robustness of the video copy detection system can be enhanced; the audio features and the video features are fused, the execution efficiency of the copy detection system can be greatly improved; and the audio and the video are analyzed at the same time, and then the positioning accuracy of copy segments can be improved.
Description
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of audio frequency and video copy detection method and device.
Background technology
When video image is carried out copy detection, existing scheme mainly uses and is partial to based on content regarding
Frequently copy detection method.Currently mainly there is the video copy detection side of characteristics of image based on key frame of video
Case and the video copy detection scheme combined based on audio and video characteristic testing result, wherein:
The video copy detection scheme of characteristics of image based on key frame of video, main process includes: video
Decoding and pretreatment, video image characteristic extraction, aspect indexing and retrieval, copy judge and location,
Judge whether inquiry video constitutes copy eventually, for being judged to the video of copy, it is judged that the head of copy fragment
Tail, thus this Partial Fragment of labelling is copy fragment.But, this implementation is not due to by audio frequency
Information includes video copy detection scheme in, and audio-frequency information for the image content of video be one important
Supplement, thus, not only weaken the vigorousness of video copy detection system, and for copy fragment
Position location accuracy is the highest, particularly in the case of video pictures change is little.
The video copy detection scheme combined based on audio and video characteristic testing result, compares and closes based on video
The video copy detection scheme of the characteristics of image of key frame, the program contains audio frequency characteristics, such that it is able to fill
Divide and utilize the feature that audio query speed is fast, accuracy is higher.But, because audio and video characteristic is substantially
Differing, existing copy detection scheme carries out video copy detection respectively often by audio frequency and video, and
Merge in result aspect, thus judge to inquire about whether video is copy video.But, in resultant layer
Need to extract more feature in the face of copy detection carries out merging, and need most feature all to complete
Whole copy detection flow process, thus time overhead is relatively big, and add corresponding algorithm complex.
Summary of the invention
The embodiment of the present invention provides a kind of audio frequency and video copy detection method and device, it is intended to improve video copy
Detection efficiency and precision.
The embodiment of the present invention proposes a kind of audio frequency and video copy detection method, including:
Obtain audiovisual presentation, described audiovisual presentation is decoded and pretreatment, obtains described sound and regard
Frequently the audio-frequency unit of image and frame of video;
Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audio frequency and video
Audio frequency characteristics that image is corresponding and the characteristics of image of frame of video;
The audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video merge, and obtain
The audio frequency and video fusion feature of described audiovisual presentation;
Feature database based on default reference video, mates described audio frequency and video fusion feature, obtains
The frame collection matching result of described audiovisual presentation;
Frame collection matching result based on described audiovisual presentation and reference video, to described audiovisual presentation
Carry out copy to judge and location.
The embodiment of the present invention also proposes a kind of audio frequency and video copy detection device, including:
Decoding and pretreatment module, be used for obtaining audiovisual presentation, be decoded described audiovisual presentation
And pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation;
Characteristic extracting module, for carrying out feature carry audio-frequency unit and the frame of video of described audiovisual presentation
Take, obtain audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video;
Fusion Module, for the audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video
Merge, obtain the audio frequency and video fusion feature of described audiovisual presentation;
Matching module, for feature database based on default reference video, to described audio frequency and video fusion feature
Mate, obtain the frame collection matching result of described audiovisual presentation;
Copy determination module, for frame collection matching result based on described audiovisual presentation and reference video,
Described audiovisual presentation carries out copy judge and location.
A kind of audio frequency and video copy detection method of embodiment of the present invention proposition and device, by obtaining audio frequency and video
Image, is decoded and pretreatment described audiovisual presentation, obtains the audio portion of described audiovisual presentation
Divide and frame of video;Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain institute
State audio frequency characteristics corresponding to audiovisual presentation and the characteristics of image of frame of video;Corresponding to described audiovisual presentation
Audio frequency characteristics and the characteristics of image of frame of video merge, the audio frequency and video obtaining described audiovisual presentation are melted
Close feature;Feature database based on default reference video, mates described audio frequency and video fusion feature,
Obtain the frame collection matching result of described audiovisual presentation;Frame collection matching result based on described audiovisual presentation
And reference video, described audiovisual presentation is carried out copy and judges and location, thus utilize audio frequency and video phase
In conjunction with method, not only increase the vigorousness of video copy detection system, and by audio frequency and video are special
Levy and merge, be greatly accelerated the execution efficiency of copy detection system, jointly analyzed by audio frequency and video,
Improve copy fragment positioning precision.
Accompanying drawing explanation
Fig. 1 is the hardware architecture diagram of audio frequency and video copy detection device of the present invention;
Fig. 2 is the schematic flow sheet of audio frequency and video copy detection method first embodiment of the present invention;
Fig. 3 is embodiment of the present invention sound intermediate frequency sub belt energy difference feature extraction schematic flow sheet;
Fig. 4 is the flow process of the Image DCT feature of the frame of video extracting audiovisual presentation in the embodiment of the present invention
Schematic diagram;
Fig. 5 is that in the embodiment of the present invention, characteristics of image and audio frequency characteristics merge schematic diagram;
Fig. 6 is the simhash matching algorithm exemplary plot related in the embodiment of the present invention;
Fig. 7 is the matching algorithm design diagram related in the embodiment of the present invention;
Fig. 8 is the copy location and extension schematic diagram related in the embodiment of the present invention;
Fig. 9 is the schematic flow sheet of audio frequency and video copy detection method the second embodiment of the present invention;
Figure 10 is the high-level schematic functional block diagram of audio frequency and video copy detection device first embodiment of the present invention;
Figure 11 is the high-level schematic functional block diagram of audio frequency and video copy detection device the second embodiment of the present invention.
In order to make technical scheme clearer, clear, make the most in detail below in conjunction with accompanying drawing
State.
Detailed description of the invention
Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limit
Determine the present invention.
The primary solutions of the embodiment of the present invention is: include the audio-frequency information of video in video copy detection
Scheme, utilizes the method that audio frequency and video combine, and is possible not only to strengthen the vigorousness of video copy detection system,
And by audio and video characteristic is merged, greatly speed up the execution efficiency of copy detection system, pass through
Audio frequency and video are analyzed jointly, improve copy fragment positioning precision.
Specifically, the embodiment of the present invention it is considered that existing video copy detection scheme, or only with
The video copy detection scheme of characteristics of image based on key frame of video, not only weakens video copy detection
The vigorousness of system, and the Position location accuracy for copy fragment is the highest;Use based on audio frequency and video
The video copy detection scheme that feature detection result combines, but, enter in the face of copy detection in resultant layer
Row merges needs to extract more feature, and needs most feature all to complete whole copy detection stream
Journey, thus add time overhead, and corresponding algorithm complex and data integration linear correlation, thus
Add algorithm complex.
The present embodiment scheme includes the audio-frequency information of video in video copy detection scheme, utilizes audio frequency and video phase
In conjunction with method, extracted by audio/video decoding and pretreatment, audio and video characteristic, audio and video characteristic merges,
The processing procedures such as copy judgement and location, are possible not only to strengthen the vigorousness of video copy detection system, and
And by audio and video characteristic is merged, greatly speed up the execution efficiency of copy detection system, pass through sound
Video is analyzed jointly, improves copy fragment positioning precision.
Specifically, the audio frequency and video copy detection device that embodiment of the present invention audio frequency and video copy detection scheme relates to
Hardware configuration can be as it is shown in figure 1, this detection device can be carried on PC end, it is also possible to be carried on hands
The mobile terminals such as machine, panel computer, portable handheld device or other there is audio frequency and video copy detection merit
In the electronic equipment of energy, such as apparatus for media playing.
As it is shown in figure 1, this detection device may include that processor 1001, such as CPU, network interface 1004,
User interface 1003, memorizer 1005, communication bus 1002, photographic head 1006.Wherein, communication bus
1002 for realizing the connection communication detecting between these assemblies of device.User interface 1003 can include
Display screen (Display), input block such as keyboard (Keyboard), optional user interface 1003 also may be used
To include the wireline interface of standard, wave point.Network interface 1004 optionally can include having of standard
Line interface, wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, also
Can be stable memorizer (non-volatile memory), such as disk memory.Memorizer 1005
Optionally can also is that the storage device independent of aforementioned processor 1001.
Alternatively, this detection device is when being carried on mobile terminal, it is also possible to include RF (Radio
Frequency, radio frequency) circuit, sensor, voicefrequency circuit, WiFi module etc..Wherein, sensor
Such as optical sensor, motion sensor and other sensors.Specifically, optical sensor can include environment
Optical sensor and proximity transducer, wherein, ambient light sensor can regulate according to the light and shade of ambient light
The brightness of display screen, proximity transducer can cut out display screen and/or the back of the body when mobile terminal moves in one's ear
Light.As the one of motion sensor, Gravity accelerometer can detect in all directions (generally
Three axles) size of acceleration, can detect that size and the direction of gravity time static, can be used for identifying mobile
The application (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating) of terminal attitude, Vibration identification
Correlation function (such as pedometer, percussion) etc.;Certainly, this detection device can also configure gyroscope, gas
Other sensors such as pressure meter, drimeter, thermometer, infrared ray sensor, do not repeat them here.
It will be understood by those skilled in the art that the apparatus structure shown in Fig. 1 is not intended that this detection device
Restriction, can include that ratio illustrates more or less of parts, or combine some parts, or different
Parts arrange.
As it is shown in figure 1, as the memorizer 1005 of a kind of computer-readable storage medium can include operation system
System, network communication module, Subscriber Interface Module SIM and audio frequency and video copy detection application program.
In the detection device shown in Fig. 1, network interface 1004 is mainly used in connecting back-stage management platform,
Data communication is carried out with back-stage management platform;User interface 1003 is mainly used in connecting client, with client
End carries out data communication;And processor 1001 may be used for calling the audio frequency and video of storage in memorizer 1005
Copy detection application program, and perform following operation:
Obtain audiovisual presentation, described audiovisual presentation is decoded and pretreatment, obtains described sound and regard
Frequently the audio-frequency unit of image and frame of video;
Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audio frequency and video
Audio frequency characteristics that image is corresponding and the characteristics of image of frame of video;
The audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video merge, and obtain
The audio frequency and video fusion feature of described audiovisual presentation;
Feature database based on default reference video, mates described audio frequency and video fusion feature, obtains
The frame collection matching result of described audiovisual presentation;
Frame collection matching result based on described audiovisual presentation and reference video, to described audiovisual presentation
Carry out copy to judge and location.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005
Application program can perform following operation:
The audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is turned by Fourier transformation
Change the energy to frequency domain;
The frequency domain energy obtained is divided into some sons being in scheduled frequency range according to logarithmic relationship
Band;
The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains the audio sub-band energy difference of audio frame
Feature;
Carry out the sampling of audio frame according to predetermined space, obtain the sound of the audio-frequency unit of described audiovisual presentation
Frequently sub belt energy difference feature.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005
Application program can perform following operation:
Frame of video to described audiovisual presentation, is converted into its image gray level image and is compressed processing;
Gray level image after processing compression is divided into some sub-blocks;
Calculate the DCT energy value of each sub-block;
The relatively DCT energy value between adjacent two sub-blocks, the Image DCT obtaining described frame of video is special
Levy;
According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005
Application program can perform following operation:
Setting the described audio frequency characteristics feature as M per second 32 bits, the characteristics of image of frame of video is per second
The feature of n 32 bits, wherein, n is the frame per second of video, and n is less than or equal to 60;
One frame of video is corresponded to the mode of some frame audio frames to carry out merging features, obtain product per second
The audio frequency and video fusion feature of raw M 64 bits, wherein, corresponding one of each audio frequency and video fusion feature
The individually audio frequency characteristics of audio frame, one that M/n adjacent audio frequency and video fusion feature correspondence is identical regards
Frequently the characteristics of image of frame.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005
Application program can perform following operation:
Matching list is obtained from the feature database of default reference video;
For each audio frequency and video fusion feature, inquiry and described audio frequency and video fusion feature from described matching list
Between Hamming distance less than the feature of predetermined threshold value, as the similar spy of described audio frequency and video fusion feature
Levy;
Obtain the similar features of audio frequency and video fusion feature, obtain the frame collection matching result of described audiovisual presentation.
In one embodiment, the audio frequency and video copy inspection of storage during processor 1001 calls memorizer 1005
Survey application program and can perform following operation:
The audio/video frames of the reference video corresponding to described similar features carries out time extension, obtains described sound
Audio/video frames corresponding in video image compares the similar fragments that described reference video is constituted;
Based on described similar fragments, calculate audio/video frames corresponding in described audiovisual presentation and reference video
Similarity;
If described similarity is more than setting threshold value, then judge that described audiovisual presentation constitutes copy, and record
The original position of the similar fragments of described audiovisual presentation and final position.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005
Application program can perform following operation:
Described matching list is created in the feature database of described reference video.
The present embodiment passes through such scheme, especially by obtaining audiovisual presentation, to described audiovisual presentation
It is decoded and pretreatment, obtains audio-frequency unit and the frame of video of described audiovisual presentation;Described sound is regarded
Frequently audio-frequency unit and the frame of video of image carries out feature extraction, obtains the audio frequency that described audiovisual presentation is corresponding
Feature and the characteristics of image of frame of video;The audio frequency characteristics corresponding to described audiovisual presentation and the figure of frame of video
As feature merges, obtain the audio frequency and video fusion feature of described audiovisual presentation;Based on default reference
The feature database of video, mates described audio frequency and video fusion feature, obtains the frame of described audiovisual presentation
Collection matching result;Frame collection matching result based on described audiovisual presentation and reference video, to described sound
Video image carries out copy and judges and location, thus the method utilizing audio frequency and video to combine, not only increase
The vigorousness of video copy detection system, and by audio and video characteristic is merged, be greatly accelerated
The execution efficiency of copy detection system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.
Based on above-mentioned hardware configuration, audio frequency and video copy detection method embodiment of the present invention is proposed.
As in figure 2 it is shown, first embodiment of the invention proposes a kind of audio frequency and video copy detection method, including:
Step S101, obtains audiovisual presentation, is decoded described audiovisual presentation and pretreatment,
Audio-frequency unit and frame of video to described audiovisual presentation;
Specifically, first, obtaining and need the audiovisual presentation carrying out copy detection, this audiovisual presentation can
To obtain from this locality, it is also possible to obtained from outside by network.
The audiovisual presentation obtained is decoded and pretreatment, extracts the audio frequency of video, and be downsampled to
Monophonic 5512.5Hz;Extract each frame of video frame by frame, thus obtain the audio-frequency unit of audiovisual presentation
Frame of video with each frame.
Step S102, audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain
Audio frequency characteristics that described audiovisual presentation is corresponding and the characteristics of image of frame of video;
This part carries out feature extraction primarily with respect to audio frequency corresponding to video and all videos frame.Cause
Represent for the easy binary bits of audio frequency characteristics itself, thus often use binary index or
LSH accelerates inquiry.The audio frequency characteristics that the present invention is extracted is audio sub-band energy difference feature, extraction
The characteristics of image of frame of video is DCT (Discrete Cosine Transform, discrete cosine transform) feature.
Wherein, the audio-frequency unit of described audiovisual presentation is carried out feature extraction, obtain described audio frequency and video figure
As the process of corresponding audio frequency characteristics includes:
Each audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is become by Fourier
Change the energy being transformed into frequency domain;The frequency domain energy obtained is divided into some being according to logarithmic relationship
The subband of scheduled frequency range;The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains each sound
Frequently the audio sub-band energy difference feature of frame;Carry out the sampling of audio frame according to predetermined space, obtain described sound
The audio sub-band energy difference feature of the audio-frequency unit of video image.
More specifically, the extraction flow process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:
Having main steps that of the algorithm that the extraction of this audio sub-band energy difference feature relates to:
First, by the time-domain audio shape information (audio frame) of every 0.37 second through Hanning window (Hanning
Window) energy of frequency domain it is transformed into after filtering by Fourier transformation;
Secondly, the frequency domain energy obtained is divided into 33 according to logarithmic relationship (Bark grade) and is positioned at
The subband of human auditory system scope (300Hz~2000Hz), and calculate consecutive frame (being spaced 11 milliseconds) phase
The difference of the absolute value of the energy between adjacent subband, thus each audio frame can be obtained one 32 ratio
Special audio frequency characteristics.
" 1 " therein represents the energy difference of adjacent two subbands of current audio frame more than next audio frame
The energy difference of corresponding adjacent sub-bands, is otherwise 0.
Detailed process is as follows:
In figure 3, input content is a section audio;Output content is several (n that this section audio is corresponding
Individual) audio frequency characteristics.
Wherein, Framing: framing, it may be assumed that by this audio fragment cutting (n) audio frames that are several.
In this example, according to M=2048 audio frame of collection per second, (in other examples, M can also set for other
Value), the audio content that each audio frame comprises 0.37 second (has the weight of 2047/2048 between adjacent audio frame
Folded).
Fourier Transform: Fourier transformation, for turning the shape information (original audio) of time domain
It is changed to the energy information of the different frequency range ripple of frequency domain, it is simple to be analyzed processing.
ABS: take the absolute value (i.e.: only considering amplitude, do not consider direction of vibration) of wave energy information.
Band Division: point band, is divided into 33 mutually between 300Hz-2000Hz by whole frequency domain
Nonoverlapping frequency band (divides according to logarithmic relationship, it may be assumed that frequency is the lowest, frequency belonging to this frequency
Band scope is the least).As such, it is possible to obtain original audio energy on these different frequency bands.
Energy Computation: calculate each audio frame energy value on these 33 frequency bands (every
Individual audio frame obtains 33 energy values).
Bit Derivation: derive bit: 33 above-mentioned energy values are compared (i-th successively
The energy of subband and the energy of i+1 subband compare) obtain the difference of 32 energy values.Relatively
The size of these 32 energy value differences between current audio frame a and next audio frame b.Assume the of a
J energy value difference jth energy value difference than b is big, then the jth position of a is characterized as 1, otherwise, and a
Jth position be characterized as 0.So can obtain the magnitude relationship of 32 energy value differences between a and b,
It is the feature of 32 bits of audio frame a.
Present invention employs this audio frequency characteristics, and carry out adopting of audio frame according to the interval of 1/2048 second
Sample, all can generate the audio frequency characteristics of 2048 32 bits hence for the audio fragment of each second.
The frame of video of described audiovisual presentation is carried out feature extraction, obtains the video that audiovisual presentation is corresponding
The process of the characteristics of image of frame may include that
Each frame of video to described audiovisual presentation, is converted into gray level image by its image and is compressed
Process;Gray level image after processing compression is divided into some sub-blocks;Calculate the DCT energy value of each sub-block;
Relatively the DCT energy value between adjacent two sub-blocks, obtains the Image DCT feature of described frame of video;
According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.
More specifically, the flow process of the Image DCT feature of the frame of video of the present embodiment extraction audiovisual presentation
As shown in Figure 4:
For the feature that internet video picture entire change amplitude is little, the embodiment of the present invention has selected one
Kind of efficient image overall feature is used as the characteristics of image of frame of video: DCT feature.
The algorithm idea of DCT feature is: divide the image into into several sub-blocks, by the most adjacent son
Energy height between block, thus obtain the Energy distribution situation of entire image.Concrete algorithm steps is:
First, coloured image is converted into gray level image and compress (change the ratio of width to height) to wide 64 pixels,
High 32 pixels.
Then, gray level image being divided into 32 sub-blocks (0 as shown in Figure 4~31), each piece comprises 8x8
The image of pixel.
For each sub-block, calculate the DCT energy value of this sub-block.The energy value that selection can carry
Absolute value represents the energy of this sub-block.
Finally, calculate adjacent sub-blocks energy value relative size and obtain the feature of 32 bits.If the
The energy of i sub-block is more than the energy of i+1 sub-block, then ith bit position is 1, is otherwise 0.Especially:
31st sub-block and the 0th sub-block compare.
By said process, each frame of video will obtain the Image DCT feature of 32 bits.
Step S103, the audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video are carried out
Merge, obtain the audio frequency and video fusion feature of described audiovisual presentation;
After the characteristics of image that said process has obtained audio frequency characteristics corresponding to video and frame of video, incite somebody to action
The characteristics of image and the audio frequency characteristics that obtain merge.Concrete fusion method as shown in Figure 5 (wherein:
The longitudinal axis is time shaft).
As it is shown in figure 5, in the present embodiment, setting audio is characterized as that (this value can set M=2048 per second
Fixed) feature of individual 32 bits, and the feature that the characteristics of image of frame of video is n per second 32 bits (n is
The frame per second of video, n is usually no more than 60).
Thus, the present embodiment carries out feature by the way of a frame of video is corresponded to some audio frames
Splicing, it may be assumed that the audio frequency and video fusion feature of 2048 64 bits of generation per second, wherein, each merges
The feature of all corresponding single audio frame of feature, and 2048/n adjacent audio frequency and video fusion feature pair
The Image DCT feature of a frame of video that should be identical.
Merged by the characteristics of image of the above-mentioned audio frequency characteristics corresponding to audiovisual presentation and frame of video,
Obtain the audio frequency and video fusion feature of audiovisual presentation.
Step S104, feature database based on default reference video, described audio frequency and video fusion feature is carried out
Coupling, obtains the frame collection matching result of described audiovisual presentation;
The present embodiment is preset with the feature database of reference video, and creating in the feature database of reference video has
Matching list, to facilitate video individual features to be detected quickly to retrieve.
When audio frequency and video fusion feature is mated, first, from the feature database of default reference video
Obtain matching list;For each audio frequency and video fusion feature, from described matching list, inquiry meets pre-conditioned
Feature, as the similar features of audio frequency and video fusion feature.Such as inquire about from described matching list and regard with sound
Frequently the Hamming distance between fusion feature is less than the feature of predetermined threshold value (such as 3), regards as described sound
Frequently the similar features of fusion feature;Obtain the similar features of all audio frequency and video fusion features, obtain described sound
The frame collection matching result of video image.
More specifically, the present embodiment is considered:
For inquiry video (needing to carry out the video of copy detection) and a reference video, if
By comparing the similarity of both features frame by frame, required time complexity is just all becoming with the two video
Ratio, thus it is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on existing simhash
Technology, it is proposed that a kind of index based on audio frequency and video fusion feature and the matching strategy of inquiry.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into
The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.
All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The principle of this algorithm
Schematic diagram is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then by 64 bits
It is divided into 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to,
In remaining 48 bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.Logical
After crossing twice index search coupling, can in remaining 36 bits, enumerate most 3 discrepant
Position, such that it is able to be substantially reduced the complexity of original algorithm.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,
That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two feature
It is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32
Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodiment
Copy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7
Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, then
Front 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2
Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bits
On all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32
Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, use
Carry out quick search audio frequency and video fusion feature.
Then, by inquiring about the matching list of above-mentioned structure, obtain the similar features of audio frequency and video fusion feature,
Obtain the result of characteristic key.
Step S105, frame collection matching result based on described audiovisual presentation and reference video, to described
Audiovisual presentation carries out copy and judges and location.
According to the result of the characteristic key obtained in said process, and combine video copy fragment localization method,
Thus judge whether to inquire about video as copying video.If it is determined that inquiry video is copy video, be then given
Corresponding copy fragment location.
The present embodiment is considered: for two videos, if between calculating the two video between a frame
Similarity, then can obtain the similarity matrix shown in rightmost in Fig. 8.Thus, find two video phases
Also the line finding similarity to be higher than certain threshold value in similarity matrix it has been converted to like the target of fragment
Section, but this processing mode time overhead strengthens.
The principle that audiovisual presentation carries out in the present embodiment copy judgement and location is: by above-mentioned coupling
Algorithm, can find some points (representing these similarities the highest) the brightest in similarity matrix, such as figure
Bright spot shown in Far Left in 8, and by these point carry out time extension, such that it is able to obtain in Fig. 8
Similar fragments (the most possible copy fragment) shown between, is screened by threshold value afterwards, such that it is able to
Judge whether certain two video constitutes copy, and if constitute copy, then can record this similar fragments
Original position and final position distribution moment.
Specifically, when audiovisual presentation being carried out copy and judging and position, first said process is obtained
The audio/video frames (bright spot shown in corresponding diagram 8 Far Left figure) of reference video corresponding to similar features enter
The row time extends, and obtains the reference video fragment of described reference video, the sound corresponding to described similar features
Audio/video frames in video image carries out time extension, obtains comparing in described audiovisual presentation described reference
The similar fragments (as shown in Fig. 8 middle graph) that video is constituted;Calculate described in described audiovisual presentation similar
Similarity between fragment and described reference video fragment, i.e. calculates similar fragments in audiovisual presentation corresponding
The similarity of the audio/video frames audio/video frames corresponding with reference video fragment, to each audio/video frames obtained
Similarity average;If described similarity is more than setting threshold value, then judge described audiovisual presentation structure
Become copy, and record original position and the final position of the similar fragments of described audiovisual presentation.
It is to say, the audio/video frames that similar fragments is corresponding in calculating audiovisual presentation and reference video
During similarity, to each frame (including the feature of 64 bits) in this similar fragments and reference video segment
Corresponding frame carries out Characteristic Contrast, calculates similarity, averages afterwards, by this meansigma methods and predetermined threshold value
Relatively, if similarity is more than setting threshold value, then judges that described audiovisual presentation constitutes copy, and record institute
State original position and the final position of the similar fragments of audiovisual presentation.
It is exemplified below:
100 frames (i.e. one audio-video sequence) if in similar fragments, between the 10-20 second of inquiry video
100 frames between the 30-40 second of corresponding reference video, then by 100 between the 10-20 second of inquiry video
Each frame correspondence in frame and each frame in 100 frames between the 30-40 second of reference video are compared,
Calculate the similarity of each frame respectively, in the such as first frame 64 bit, have feature and the reference of 50 bits
Frame of video is identical, then the similarity S1=50/64 ≈ 0.78125 of this first frame;With this principle, obtain second
Similarity S2 of frame ..., similarity S100 of 100 frames, each similarity is averaged, obtains phase
Like in fragment, inquire about the similarity of video and reference video, it is assumed that be 0.95, it (is set with setting threshold value
It is 0.9) compare, thus may determine that inquiry video constitutes copy, and record the start bit of this similar fragments
Put and final position.
Judging and in position fixing process at above-mentioned copy, an inquiry video there may be multiple similar fragments
Situation, can string together record by the plurality of similar fragments.
It should be noted that in the present embodiment said process, judging inquiry according to frame collection matching result
When whether video is the copy of certain video in reference video storehouse, it is possible to use other algorithms realize,
Such as: Hough transformation, Smith Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..Logical
Cross these algorithms and find the inquiry video one section sequence most like with certain reference video, and by threshold value
Determine whether to constitute copy.For being judged to the video of copy, it is judged that copy fragment end to end, thus is marked
Remember that this Partial Fragment is for copy fragment.
The present embodiment passes through such scheme, utilizes the method that audio frequency and video combine, not only increases video and copy
The vigorousness of shellfish detecting system, and by being merged by audio and video characteristic, it is greatly accelerated copy inspection
The execution efficiency of examining system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.
As it is shown in figure 9, second embodiment of the invention proposes a kind of audio frequency and video copy detection method, based on upper
State embodiment, before obtaining the step of audiovisual presentation, also include:
Step S100, creates described matching list in the feature database of described reference video.
Specifically, create matching list, be that video individual features the most to be detected can quickly be examined
Rope.
Matching list creates based on reference video, and concrete establishment process is as follows:
First, collect reference video fragment, reference video fragment carried out audio/video decoding and pretreatment,
Obtain audio-frequency unit and the frame of video of reference video.
Then, audio-frequency unit and frame of video to reference video carry out feature extraction, obtain reference video
Audio frequency characteristics and the characteristics of image of frame of video.
Afterwards, reference video being carried out audio and video characteristic fusion, the audio frequency and video obtaining reference video merge spy
Levy.
Finally, audio frequency and video fusion feature based on this reference video creates matching list, for follow-up inquiry
Video carries out aspect indexing retrieval coupling.
Wherein, when audio frequency and video fusion feature based on this reference video creates matching list, based on following former
Reason:
Consider: inquiry video (needing to carry out the video of copy detection) and a reference are regarded
Frequently, if by the similarity comparing both features frame by frame, required time complexity regards with the two
Frequency is all directly proportional, thus is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on
Some simhash technology, it is proposed that a kind of index based on audio frequency and video fusion feature and the coupling plan of inquiry
Slightly.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into
The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.
All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The signal of this algorithm
Figure is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then 64 bits are divided
Become 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, surplus
In 48 remaining bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.By two
After coupling searched in secondary index, most 3 discrepant positions can be enumerated in remaining 36 bits,
Such that it is able to be substantially reduced the complexity of original algorithm.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,
That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two feature
It is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32
Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodiment
Copy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7
Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, then
Front 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2
Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bits
On all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32
Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, use
Carry out quick search audio frequency and video fusion feature.
Accordingly, the functional module embodiment of embodiment of the present invention audio frequency and video copy detection device is proposed.
As shown in Figure 10, first embodiment of the invention proposes a kind of audio frequency and video copy detection device, including:
Decoding and pretreatment module 201, characteristic extracting module 202, Fusion Module 203, matching module 204 and
Copy determination module 205, wherein:
Decoding and pretreatment module 201, be used for obtaining audiovisual presentation, solve described audiovisual presentation
Code and pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation;
Characteristic extracting module 202, for carrying out feature to audio-frequency unit and the frame of video of described audiovisual presentation
Extract, obtain audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video;
Fusion Module 203, special for the image of the audio frequency characteristics corresponding to described audiovisual presentation and frame of video
Levy and merge, obtain the audio frequency and video fusion feature of described audiovisual presentation;
Described audio frequency and video, for feature database based on default reference video, are merged spy by matching module 204
Levy and mate, obtain the frame collection matching result of described audiovisual presentation;
Copy determination module 205, for frame collection matching result based on described audiovisual presentation and with reference to regarding
Frequently, described audiovisual presentation carries out copy judge and location.
Specifically, first, obtaining and need the audiovisual presentation carrying out copy detection, this audiovisual presentation can
To obtain from this locality, it is also possible to obtained from outside by network.
The audiovisual presentation obtained is decoded and pretreatment, extracts the audio frequency of video, and be downsampled to
Monophonic 5512.5Hz;Extract each frame of video frame by frame, thus obtain the audio-frequency unit of audiovisual presentation
Frame of video with each frame.
Afterwards, audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described
Audio frequency characteristics that audiovisual presentation is corresponding and the characteristics of image of frame of video.
This part carries out feature extraction primarily with respect to audio frequency corresponding to video and all videos frame.Cause
Represent for the easy binary bits of audio frequency characteristics itself, thus often use binary index or
LSH accelerates inquiry.The audio frequency characteristics that the present invention is extracted is audio sub-band energy difference feature, extraction
The characteristics of image of frame of video is DCT (Discrete Cosine Transform, discrete cosine transform) feature.
Wherein, the audio-frequency unit of described audiovisual presentation is carried out feature extraction, obtain described audio frequency and video figure
As the process of corresponding audio frequency characteristics includes:
Each audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is become by Fourier
Change the energy being transformed into frequency domain;The frequency domain energy obtained is divided into some being according to logarithmic relationship
The subband of scheduled frequency range;The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains each sound
Frequently the audio sub-band energy difference feature of frame;Carry out the sampling of audio frame according to predetermined space, obtain described sound
The audio sub-band energy difference feature of the audio-frequency unit of video image.
More specifically, the extraction flow process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:
Having main steps that of the algorithm that the extraction of this audio sub-band energy difference feature relates to:
First, by the time-domain audio shape information (audio frame) of every 0.37 second through Hanning window (Hanning
Window) energy of frequency domain it is transformed into after filtering by Fourier transformation;
Secondly, the frequency domain energy obtained is divided into 33 according to logarithmic relationship (Bark grade) and is positioned at
The subband of human auditory system scope (300Hz~2000Hz), and calculate consecutive frame (being spaced 11 milliseconds) phase
The difference of the absolute value of the energy between adjacent subband, thus each audio frame can be obtained one 32 ratio
Special audio frequency characteristics.
" 1 " therein represents the energy difference of adjacent two subbands of current audio frame more than next audio frame
The energy difference of corresponding adjacent sub-bands, is otherwise 0.
Detailed process is as follows:
In figure 3, input content is a section audio;Output content is several (n that this section audio is corresponding
Individual) audio frequency characteristics.
Wherein, Framing: framing, it may be assumed that by this audio fragment cutting (n) audio frames that are several.
According to 2048 audio frames of collection per second in example, each audio frame comprises the audio content (phase of 0.37 second
The overlap of 2047/2048 is had) between adjacent audio frame.
Fourier Transform: Fourier transformation, for turning the shape information (original audio) of time domain
It is changed to the energy information of the different frequency range ripple of frequency domain, it is simple to be analyzed processing.
ABS: take the absolute value (i.e.: only considering amplitude, do not consider direction of vibration) of wave energy information.
Band Division: point band, is divided into 33 mutually between 300Hz-2000Hz by whole frequency domain
Nonoverlapping frequency band (divides according to logarithmic relationship, it may be assumed that frequency is the lowest, frequency belonging to this frequency
Band scope is the least).As such, it is possible to obtain original audio energy on these different frequency bands.
Energy Computation: calculate each audio frame energy value on these 33 frequency bands (every
Individual audio frame obtains 33 energy values).
Bit Derivation: derive bit: 33 above-mentioned energy values are compared (i-th successively
The energy of subband and the energy of i+1 subband compare) obtain the difference of 32 energy values.Relatively
The size of these 32 energy value differences between current audio frame a and next audio frame b.Assume the of a
J energy value difference jth energy value difference than b is big, then the jth position of a is characterized as 1, otherwise, and a
Jth position be characterized as 0.So can obtain the magnitude relationship of 32 energy value differences between a and b,
It is the feature of 32 bits of audio frame a.
Present invention employs this audio frequency characteristics, and carry out adopting of audio frame according to the interval of 1/2048 second
Sample, all can generate the audio frequency characteristics of 2048 32 bits hence for the audio fragment of each second.
The frame of video of described audiovisual presentation is carried out feature extraction, obtains the video that audiovisual presentation is corresponding
The process of the characteristics of image of frame may include that
Each frame of video to described audiovisual presentation, is converted into gray level image by its image and is compressed
Process;Gray level image after processing compression is divided into some sub-blocks;Calculate the DCT energy value of each sub-block;
Relatively the DCT energy value between adjacent two sub-blocks, obtains the Image DCT feature of described frame of video;
According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.
More specifically, the flow process of the Image DCT feature of the frame of video of the present embodiment extraction audiovisual presentation
As shown in Figure 4:
For the feature that internet video picture entire change amplitude is little, the embodiment of the present invention has selected one
Kind of efficient image overall feature is used as the characteristics of image of frame of video: DCT feature.
The algorithm idea of DCT feature is: divide the image into into several sub-blocks, by the most adjacent son
Energy height between block, thus obtain the Energy distribution situation of entire image.Concrete algorithm steps is:
First, coloured image is converted into gray level image and compress (change the ratio of width to height) to wide 64 pixels,
High 32 pixels.
Then, gray level image being divided into 32 sub-blocks (0 as shown in Figure 4~31), each piece comprises 8x8
The image of pixel.
For each sub-block, calculate the DCT energy value of this sub-block.The energy value that selection can carry
Absolute value represents the energy of this sub-block.
Finally, calculate adjacent sub-blocks energy value relative size and obtain the feature of 32 bits.If the
The energy of i sub-block is more than the energy of i+1 sub-block, then ith bit position is 1, is otherwise 0.Especially:
31st sub-block and the 0th sub-block compare.
By said process, each frame of video will obtain the Image DCT feature of 32 bits.
After the characteristics of image that said process has obtained audio frequency characteristics corresponding to video and frame of video, incite somebody to action
The characteristics of image and the audio frequency characteristics that obtain merge.Concrete fusion method as shown in Figure 5 (wherein:
The longitudinal axis is time shaft).
As it is shown in figure 5, in the present embodiment, setting audio is characterized as that (this value can set M=2048 per second
Fixed) feature of individual 32 bits, and the feature that the characteristics of image of frame of video is n per second 32 bits (n is
The frame per second of video, n is usually no more than 60).
Thus, the present embodiment carries out feature by the way of a frame of video is corresponded to some audio frames
Splicing, it may be assumed that the audio frequency and video fusion feature of 2048 64 bits of generation per second, wherein, each merges
The feature of all corresponding single audio frame of feature, and 2048/n adjacent audio frequency and video fusion feature pair
The Image DCT feature of a frame of video that should be identical.
Merged by the characteristics of image of the above-mentioned audio frequency characteristics corresponding to audiovisual presentation and frame of video,
Obtain the audio frequency and video fusion feature of audiovisual presentation.
Afterwards, feature database based on default reference video, described audio frequency and video fusion feature is mated,
Obtain the frame collection matching result of described audiovisual presentation.
The present embodiment is preset with the feature database of reference video, and creating in the feature database of reference video has
Matching list, to facilitate video individual features to be detected quickly to retrieve.
When audio frequency and video fusion feature is mated, first, from the feature database of default reference video
Obtain matching list;For each audio frequency and video fusion feature, from described matching list, inquiry meets pre-conditioned
Feature, as the similar features of audio frequency and video fusion feature.Such as inquire about from described matching list and regard with sound
Frequently the Hamming distance between fusion feature is less than the feature of predetermined threshold value (such as 3), regards as described sound
Frequently the similar features of fusion feature;Obtain the similar features of audio frequency and video fusion feature, obtain described audio frequency and video
The frame collection matching result of image.
More specifically, the present embodiment is considered:
For inquiry video (needing to carry out the video of copy detection) and a reference video, if
By comparing the similarity of both features frame by frame, required time complexity is just all becoming with the two video
Ratio, thus it is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on existing simhash
Technology, it is proposed that a kind of index based on audio frequency and video fusion feature and the matching strategy of inquiry.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into
The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.
All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The principle of this algorithm
Schematic diagram is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then by 64 bits
It is divided into 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to,
In remaining 48 bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.Logical
After crossing twice index search, most 3 discrepant positions can be enumerated in remaining 36 bits,
Such that it is able to be substantially reduced the complexity of original algorithm.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,
That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two feature
It is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32
Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodiment
Copy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7
Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, then
Front 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2
Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bits
On all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32
Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, use
Carry out quick search audio frequency and video fusion feature.
Then, by inquiring about the matching list of above-mentioned structure, obtain the similar features of audio frequency and video fusion feature,
Obtain the result of characteristic key.
According to the result of the characteristic key obtained in said process, and combine video copy fragment localization method,
Thus judge whether to inquire about video as copying video.If it is determined that inquiry video is copy video, be then given
Corresponding copy fragment location.
The present embodiment is considered: for two videos, if calculate between the two video each frame it
Between similarity, then can obtain the similarity matrix shown in rightmost in Fig. 8.Thus, find two to regard
Frequently the target of similar fragments has also been converted to find similarity to be higher than certain threshold value in similarity matrix
Line segment, but this processing mode time overhead strengthens.
The principle that audiovisual presentation carries out in the present embodiment copy judgement and location is: by above-mentioned index
Algorithm, can find some points (representing these similarities the highest) the brightest in similarity matrix, such as figure
Bright spot shown in Far Left in 8, and by these point carry out time extension, such that it is able to obtain in Fig. 8
Similar fragments (the most possible copy fragment) shown between, is screened by threshold value afterwards, such that it is able to
Judge whether certain two video constitutes copy, and if constitute copy, then can record this similar fragments
Original position and final position distribution moment.
Specifically, when audiovisual presentation being carried out copy and judging and position, first said process is obtained
The audio/video frames (bright spot shown in corresponding diagram 8 Far Left figure) of reference video corresponding to similar features enter
The row time extends, and obtains the reference video fragment of described reference video, the sound corresponding to described similar features
Audio/video frames in video image carries out time extension, obtains comparing in described audiovisual presentation described reference
The similar fragments (as shown in Fig. 8 middle graph) that video is constituted;Calculate described in described audiovisual presentation similar
Similarity between fragment and described reference video fragment, i.e. calculates similar fragments in audiovisual presentation corresponding
The similarity of the audio/video frames audio/video frames corresponding with reference video fragment, to each audio/video frames obtained
Similarity average;If described similarity is more than setting threshold value, then judge described audiovisual presentation structure
Become copy, and record original position and the final position of the similar fragments of described audiovisual presentation.
It is to say, the audio/video frames that similar fragments is corresponding in calculating audiovisual presentation and reference video
During similarity, to each frame (including the feature of 64 bits) in this similar fragments and reference video segment
Corresponding frame carries out Characteristic Contrast, calculates similarity, averages afterwards, by this meansigma methods and predetermined threshold value
Relatively, if similarity is more than setting threshold value, then judges that described audiovisual presentation constitutes copy, and record institute
State original position and the final position of the similar fragments of audiovisual presentation.
It is exemplified below:
100 frames (i.e. one audio-video sequence) if in similar fragments, between the 10-20 second of inquiry video
100 frames between the 30-40 second of corresponding reference video, then by 100 between the 10-20 second of inquiry video
Each frame correspondence in frame and each frame in 100 frames between the 30-40 second of reference video are compared,
Calculate the similarity of each frame respectively, in the such as first frame 64 bit, have feature and the reference of 50 bits
Frame of video is identical, then the similarity S1=50/64 ≈ 0.78125 of this first frame;With this principle, obtain second
Similarity S2 of frame ..., similarity S100 of 100 frames, each similarity is averaged, obtains phase
Like in fragment, inquire about the similarity of video and reference video, it is assumed that be 0.95, it (is set with setting threshold value
It is 0.9) compare, thus may determine that inquiry video constitutes copy, and record the start bit of this similar fragments
Put and final position.
Judging and in position fixing process at above-mentioned copy, an inquiry video there may be multiple similar fragments
Situation, can string together record by the plurality of similar fragments.
It should be noted that in the present embodiment said process, judging inquiry according to frame collection matching result
When whether video is the copy of certain video in reference video storehouse, it is possible to use other algorithms realize,
Such as: Hough transformation, Smith Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..Logical
Cross these algorithms and find the inquiry video one section sequence most like with certain reference video, and by threshold value
Determine whether to constitute copy.For being judged to the video of copy, it is judged that copy fragment end to end, thus is marked
Remember that this Partial Fragment is for copy fragment.
The present embodiment passes through such scheme, utilizes the method that audio frequency and video combine, not only increases video and copy
The vigorousness of shellfish detecting system, and by being merged by audio and video characteristic, it is greatly accelerated copy inspection
The execution efficiency of examining system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.
As shown in figure 11, second embodiment of the invention proposes a kind of audio frequency and video copy detection device, based on upper
State embodiment, also include:
Creation module 200, for creating described matching list in the feature database of described reference video.
Specifically, create matching list, be that video individual features the most to be detected can quickly be examined
Rope.
Matching list creates based on reference video, and concrete establishment process is as follows:
First, collect reference video fragment, reference video fragment carried out audio/video decoding and pretreatment,
Obtain audio-frequency unit and the frame of video of reference video.
Then, audio-frequency unit and frame of video to reference video carry out feature extraction, obtain reference video
Audio frequency characteristics and the characteristics of image of frame of video.
Afterwards, reference video being carried out audio and video characteristic fusion, the audio frequency and video obtaining reference video merge spy
Levy.
Finally, audio frequency and video fusion feature based on this reference video creates matching list, for follow-up inquiry
Video carries out aspect indexing retrieval.
Wherein, when audio frequency and video fusion feature based on this reference video creates matching list, based on following former
Reason:
Consider: inquiry video (needing to carry out the video of copy detection) and a reference are regarded
Frequently, if by the similarity comparing both features frame by frame, required time complexity regards with the two
Frequency is all directly proportional, thus is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on
Some simhash technology, it is proposed that a kind of index based on audio frequency and video fusion feature and query strategy.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into
The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.
All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The signal of this algorithm
Figure is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then 64 bits are divided
Become 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, surplus
In 48 remaining bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.By two
After secondary index is searched, most 3 discrepant positions can be enumerated in remaining 36 bits, thus
The complexity of original algorithm can be substantially reduced.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,
That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two feature
It is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32
Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodiment
Copy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7
Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, then
Front 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2
Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bits
On all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32
Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, use
Carry out quick search audio frequency and video fusion feature.
Embodiment of the present invention audio frequency and video copy detection method and device, by obtaining audiovisual presentation, to institute
State audiovisual presentation to be decoded and pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation;
Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audiovisual presentation
Corresponding audio frequency characteristics and the characteristics of image of frame of video;To audio frequency characteristics corresponding to described audiovisual presentation and
The characteristics of image of frame of video merges, and obtains the audio frequency and video fusion feature of described audiovisual presentation;Based on
The feature database of the reference video preset, mates described audio frequency and video fusion feature, obtains described sound and regard
Frequently the frame collection matching result of image;Frame collection matching result based on described audiovisual presentation and reference video,
Described audiovisual presentation carries out copy judge and location, thus utilize the method that audio frequency and video combine, no
Only enhance the vigorousness of video copy detection system, and by audio and video characteristic is merged, greatly
Accelerate greatly the execution efficiency of copy detection system, jointly analyzed by audio frequency and video, improve copy fragment
Positioning precision.
Also, it should be noted in this article, term " include ", " comprising " or its any other become
Body is intended to comprising of nonexcludability, so that include the process of a series of key element, method, article
Or device not only includes those key elements, but also includes other key elements being not expressly set out, or
Also include the key element intrinsic for this process, method, article or device.There is no more restriction
In the case of, statement " including ... " key element limited, it is not excluded that including the mistake of this key element
Journey, method, article or device there is also other identical element.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art is it can be understood that arrive above-mentioned
Embodiment method can add the mode of required general hardware platform by software and realize, naturally it is also possible to logical
Cross hardware, but a lot of in the case of the former is more preferably embodiment.Based on such understanding, the present invention's
The part that prior art is contributed by technical scheme the most in other words can be with the form body of software product
Revealing to come, this computer software product is stored in a storage medium (such as ROM/RAM, magnetic disc, light
Dish) in, including some instructions with so that a station terminal equipment (can be mobile phone, computer, service
Device, or the network equipment etc.) perform the method described in each embodiment of the present invention.
The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention,
Every equivalent structure utilizing description of the invention and accompanying drawing content to be made or flow process conversion, or directly or
Connect and be used in other relevant technical field, be the most in like manner included in the scope of patent protection of the present invention.
Claims (14)
1. an audio frequency and video copy detection method, it is characterised in that including:
Obtain audiovisual presentation, described audiovisual presentation is decoded and pretreatment, obtains described sound and regard
Frequently the audio-frequency unit of image and frame of video;
Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audio frequency and video
Audio frequency characteristics that image is corresponding and the characteristics of image of frame of video;
The audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video merge, and obtain
The audio frequency and video fusion feature of described audiovisual presentation;
Feature database based on default reference video, mates described audio frequency and video fusion feature, obtains
The frame collection matching result of described audiovisual presentation;
Frame collection matching result based on described audiovisual presentation and reference video, to described audiovisual presentation
Carry out copy to judge and location.
Method the most according to claim 1, it is characterised in that described to described audiovisual presentation
Audio-frequency unit carries out feature extraction, and the step obtaining audio frequency characteristics corresponding to described audiovisual presentation includes:
The audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is turned by Fourier transformation
Change the energy to frequency domain;
The frequency domain energy obtained is divided into some sons being in scheduled frequency range according to logarithmic relationship
Band;
The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains the audio sub-band energy difference of audio frame
Feature;
Carry out the sampling of audio frame according to predetermined space, obtain the sound of the audio-frequency unit of described audiovisual presentation
Frequently sub belt energy difference feature.
Method the most according to claim 1, it is characterised in that described to described audiovisual presentation
Frame of video carries out feature extraction, obtains the step of the characteristics of image of frame of video corresponding to described audiovisual presentation
Including:
Each frame of video to described audiovisual presentation, is converted into gray level image by its image and is compressed
Process;
Gray level image after processing compression is divided into some sub-blocks;
Calculate the DCT energy value of each sub-block;
The relatively DCT energy value between adjacent two sub-blocks, the Image DCT obtaining described frame of video is special
Levy;
According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.
4. according to the method described in claim 1,2 or 3, it is characterised in that described described sound is regarded
Frequently image is corresponding audio frequency characteristics and the characteristics of image of frame of video merge, and obtain described audiovisual presentation
The step of audio frequency and video fusion feature include:
Setting the described audio frequency characteristics feature as M per second 32 bits, the characteristics of image of frame of video is per second
The feature of n 32 bits, wherein, n is the frame per second of video, and n is less than or equal to 60;
One frame of video is corresponded to the mode of some frame audio frames to carry out merging features, obtain product per second
The audio frequency and video fusion feature of raw M 64 bits, wherein, corresponding one of each audio frequency and video fusion feature
The individually audio frequency characteristics of audio frame, one that M/n adjacent audio frequency and video fusion feature correspondence is identical regards
Frequently the characteristics of image of frame.
Method the most according to claim 1, it is characterised in that described based on default reference video
Feature database, described audio frequency and video fusion feature is mated, obtains the frame collection of described audiovisual presentation
The step joining result includes:
Matching list is obtained from the feature database of default reference video;
For each audio frequency and video fusion feature, inquiry and described audio frequency and video fusion feature from described matching list
Between Hamming distance less than the feature of predetermined threshold value, as the similar spy of described audio frequency and video fusion feature
Levy;
Obtain the similar features of audio frequency and video fusion feature, obtain the frame collection matching result of described audiovisual presentation.
Method the most according to claim 5, it is characterised in that described based on described audiovisual presentation
Frame collection matching result and reference video, described audiovisual presentation is carried out copy judge and location step
Suddenly include:
The audio/video frames of the reference video corresponding to described similar features carries out time extension, obtains described ginseng
Examining the reference video fragment of video, the audio/video frames in audiovisual presentation corresponding to described similar features enters
The row time extends, and obtains comparing the similar fragments that described reference video is constituted in described audiovisual presentation;
Calculate similar fragments described in described audiovisual presentation similar between described reference video fragment
Degree;
If described similarity is more than setting threshold value, then judge that described audiovisual presentation constitutes copy, and record
The original position of the similar fragments of described audiovisual presentation and final position.
Method the most according to claim 5, it is characterised in that the step of described acquisition audiovisual presentation
Before Zhou, also include:
Described matching list is created in the feature database of described reference video.
8. an audio frequency and video copy detection device, it is characterised in that including:
Decoding and pretreatment module, be used for obtaining audiovisual presentation, be decoded described audiovisual presentation
And pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation;
Characteristic extracting module, for carrying out feature carry audio-frequency unit and the frame of video of described audiovisual presentation
Take, obtain audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video;
Fusion Module, for the audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video
Merge, obtain the audio frequency and video fusion feature of described audiovisual presentation;
Matching module, for feature database based on default reference video, to described audio frequency and video fusion feature
Mate, obtain the frame collection matching result of described audiovisual presentation;
Copy determination module, for frame collection matching result based on described audiovisual presentation and reference video,
Described audiovisual presentation carries out copy judge and location.
Device the most according to claim 8, it is characterised in that
Described characteristic extracting module, the audio frame being additionally operable to the audio-frequency unit to described audiovisual presentation is carried out
Filtering, and the energy of frequency domain it is transformed into by Fourier transformation;By the frequency domain energy that obtains according to right
Number relation is divided into some subbands being in scheduled frequency range;Calculate the exhausted of energy between adjacent sub-bands
Difference to value, obtains the audio sub-band energy difference feature of audio frame;Audio frame is carried out according to predetermined space
Sampling, obtains the audio sub-band energy difference feature of the audio-frequency unit of described audiovisual presentation.
Device the most according to claim 8, it is characterised in that
Described characteristic extracting module, is additionally operable to the frame of video to described audiovisual presentation, is converted by its image
For gray level image and be compressed process;Gray level image after processing compression is divided into some sub-blocks;Calculate
The DCT energy value of each sub-block;The relatively DCT energy value between adjacent two sub-blocks, obtains described
The Image DCT feature of frame of video;According to above-mentioned processing procedure, obtain the frame of video of described audiovisual presentation
Image DCT feature.
Device described in 11. according to Claim 8,9 or 10, it is characterised in that
Described Fusion Module, is additionally operable to the feature setting described audio frequency characteristics as M per second 32 bits, depending on
Frequently the characteristics of image of frame is the feature of n per second 32 bits, and wherein, n is the frame per second of video, and n is less than
Or equal to 60;One frame of video is corresponded to the mode of some frame audio frames to carry out merging features, obtain
The audio frequency and video fusion feature of generation M per second 64 bits, wherein, each audio frequency and video fusion feature is right
Answering the audio frequency characteristics of a single audio frame, M/n adjacent audio frequency and video fusion feature is corresponding identical
The characteristics of image of one frame of video.
12. devices according to claim 8, it is characterised in that
Described matching module, is additionally operable to from the feature database of default reference video obtain matching list;For
Each audio frequency and video fusion feature, inquires about the Chinese between described audio frequency and video fusion feature from described matching list
Prescribed distance is less than the feature of predetermined threshold value, as the similar features of described audio frequency and video fusion feature;Obtain
The similar features of audio frequency and video fusion feature, obtains the frame collection matching result of described audiovisual presentation.
13. devices according to claim 12, it is characterised in that
Described copy determination module, is additionally operable to the audio/video frames of the reference video corresponding to described similar features
The time of carrying out extension, obtains the reference video fragment of described reference video, corresponding to described similar features
Audio/video frames in audiovisual presentation carries out time extension, obtains comparing in described audiovisual presentation described ginseng
Examine the similar fragments that video is constituted;Calculate similar fragments described in described audiovisual presentation to regard with described reference
Frequently the similarity between fragment;If described similarity is more than setting threshold value, then judge described audiovisual presentation
Constitute copy, and record original position and the final position of the similar fragments of described audiovisual presentation.
14. devices according to claim 12, it is characterised in that also include:
Creation module, for creating described matching list in the feature database of described reference video.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510041044.3A CN105989000B (en) | 2015-01-27 | 2015-01-27 | Audio-video copy detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510041044.3A CN105989000B (en) | 2015-01-27 | 2015-01-27 | Audio-video copy detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105989000A true CN105989000A (en) | 2016-10-05 |
CN105989000B CN105989000B (en) | 2019-11-19 |
Family
ID=57034765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510041044.3A Active CN105989000B (en) | 2015-01-27 | 2015-01-27 | Audio-video copy detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105989000B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019895A (en) * | 2017-07-27 | 2019-07-16 | 杭州海康威视数字技术股份有限公司 | A kind of image search method, device and electronic equipment |
CN110110502A (en) * | 2019-04-28 | 2019-08-09 | 深圳市得一微电子有限责任公司 | Anti-copy method, device and the removable storage device of audio file |
CN110222719A (en) * | 2019-05-10 | 2019-09-10 | 中国科学院计算技术研究所 | A kind of character recognition method and system based on multiframe audio-video converged network |
CN111274449A (en) * | 2020-02-18 | 2020-06-12 | 腾讯科技(深圳)有限公司 | Video playing method and device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737135A (en) * | 2012-07-10 | 2012-10-17 | 北京大学 | Video copy detection method and system based on soft cascade model sensitive to deformation |
-
2015
- 2015-01-27 CN CN201510041044.3A patent/CN105989000B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737135A (en) * | 2012-07-10 | 2012-10-17 | 北京大学 | Video copy detection method and system based on soft cascade model sensitive to deformation |
Non-Patent Citations (2)
Title |
---|
WU LIU 等: "Listen, Look, and Gotcha: Instant Video Search with Mobile Phones by Layered Audio-Video Indexing", 《PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 * |
吴思远: "基于内容的重复音视频检测", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019895A (en) * | 2017-07-27 | 2019-07-16 | 杭州海康威视数字技术股份有限公司 | A kind of image search method, device and electronic equipment |
CN110110502A (en) * | 2019-04-28 | 2019-08-09 | 深圳市得一微电子有限责任公司 | Anti-copy method, device and the removable storage device of audio file |
CN110110502B (en) * | 2019-04-28 | 2023-07-14 | 得一微电子股份有限公司 | Anti-copy method and device for audio files and mobile storage device |
CN110222719A (en) * | 2019-05-10 | 2019-09-10 | 中国科学院计算技术研究所 | A kind of character recognition method and system based on multiframe audio-video converged network |
CN110222719B (en) * | 2019-05-10 | 2021-09-24 | 中国科学院计算技术研究所 | Figure identification method and system based on multi-frame audio and video fusion network |
CN111274449A (en) * | 2020-02-18 | 2020-06-12 | 腾讯科技(深圳)有限公司 | Video playing method and device, electronic equipment and storage medium |
CN111274449B (en) * | 2020-02-18 | 2023-08-29 | 腾讯科技(深圳)有限公司 | Video playing method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN105989000B (en) | 2019-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102461066B (en) | Differentiate the method for content signal | |
US8457368B2 (en) | System and method of object recognition and database population for video indexing | |
US20130346412A1 (en) | System and method of detecting common patterns within unstructured data elements retrieved from big data sources | |
CN111770427A (en) | Microphone array detection method, device, equipment and storage medium | |
KR101617649B1 (en) | Recommendation system and method for video interesting section | |
CN109271533A (en) | A kind of multimedia document retrieval method | |
CN105190618A (en) | Acquisition, recovery, and matching of unique information from file-based media for automated file detection | |
CN105989000A (en) | Audio/video (AV) copy detection method and device | |
CN104463177A (en) | Similar face image obtaining method and device | |
CN111222397A (en) | Drawing book identification method and device and robot | |
CN112084959B (en) | Crowd image processing method and device | |
CN113822427A (en) | Model training method, image matching device and storage medium | |
CN113327628B (en) | Audio processing method, device, readable medium and electronic equipment | |
KR20150087034A (en) | Object recognition apparatus using object-content sub information correlation and method therefor | |
CN113114986A (en) | Early warning method based on picture and sound synchronization and related equipment | |
CN113034771B (en) | Gate passing method, device and equipment based on face recognition and computer storage medium | |
CN115131291A (en) | Object counting model training method, device, equipment and storage medium | |
KR101572330B1 (en) | Apparatus and method for near duplicate video clip detection | |
CN103646401B (en) | The method that video finger print extracts is realized based on time gradient and spatial gradient | |
CN113544700A (en) | Neural network training method and device, and associated object detection method and device | |
JP2012185195A (en) | Audio data feature extraction method, audio data collation method, audio data feature extraction program, audio data collation program, audio data feature extraction device, audio data collation device, and audio data collation system | |
CN112560700A (en) | Information association method and device based on motion analysis and electronic equipment | |
CN117608506A (en) | Information display method, information display device, electronic equipment and storage medium | |
CN114722234B (en) | Music recommendation method, device and storage medium based on artificial intelligence | |
CN116775938B (en) | Method, device, electronic equipment and storage medium for retrieving comment video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20211012 Address after: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd. Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |
|
TR01 | Transfer of patent right |