CN105989000B - Audio-video copy detection method and device - Google Patents
Audio-video copy detection method and device Download PDFInfo
- Publication number
- CN105989000B CN105989000B CN201510041044.3A CN201510041044A CN105989000B CN 105989000 B CN105989000 B CN 105989000B CN 201510041044 A CN201510041044 A CN 201510041044A CN 105989000 B CN105989000 B CN 105989000B
- Authority
- CN
- China
- Prior art keywords
- audio
- video
- feature
- frame
- audiovisual presentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The present invention relates to a kind of audio-video copy detection method and devices, its method includes: acquisition audiovisual presentation, audiovisual presentation is decoded and is pre-processed, audio-frequency unit and video frame to obtained audiovisual presentation carry out feature extraction, obtain the characteristics of image of corresponding audio frequency characteristics and video frame;The characteristics of image of the corresponding audio frequency characteristics of audiovisual presentation and video frame is merged, audio-video fusion feature is obtained;Based on the feature database of preset reference video, audio-video fusion feature is matched, obtains frame collection matching result;Based on frame collection matching result and reference video, copy judgement and positioning are carried out to audiovisual presentation.The method that the present invention is combined using audio-video, the robustness of video copy detection system is not only increased, and by merging audio and video characteristic, greatly accelerates the execution efficiency of copy detection system, it is analyzed jointly by audio-video, improves copy segment positioning accuracy.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of audio-video copy detection methods and device.
Background technique
When carrying out copy detection to video image, existing scheme is mainly using the video copy inspection being biased to based on content
Survey method.Mainly there is the video copy detection scheme of the characteristics of image based on key frame of video at present and is examined based on audio and video characteristic
Survey the video copy detection scheme that result combines, in which:
The video copy detection scheme of characteristics of image based on key frame of video, main process include: video decoding and pre-
Processing, video image characteristic are extracted, aspect indexing and retrieval, copy determine and position, final to determine whether inquiry video is constituted
Copy judges to copy segment end to end for the video for being judged to copying, to mark this Partial Fragment for copy segment.But
It is that this implementation is not due to being included in video copy detection scheme for audio-frequency information, and audio-frequency information is for the picture of video
Face content is an important supplement, not only weakens the robustness of video copy detection system as a result, but also for print
The positioning accuracy of section is not high, especially in the case where video pictures variation less.
Based on the video copy detection scheme that audio and video characteristic testing result combines, compared to the figure based on key frame of video
As the video copy detection scheme of feature, the program contains audio frequency characteristics, so as to make full use of audio query speed it is fast,
The higher feature of accuracy.However, because audio and video characteristic it is substantially not identical, existing copy detection scheme often by
Audio-video carries out video copy detection respectively, and is merged in result level, therefore, it is determined that whether inquiry video is copy view
Frequently.However, carrying out fusion in face of copy detection in resultant layer needs to extract more feature, and need most feature all
Entire copy detection process is completed, thus time overhead is larger, and increases corresponding algorithm complexity.
Summary of the invention
The embodiment of the present invention provides a kind of audio-video copy detection method and device, it is intended to improve video copy detection efficiency
And precision.
The embodiment of the present invention proposes a kind of audio-video copy detection method, comprising:
Audiovisual presentation is obtained, the audiovisual presentation is decoded and is pre-processed, the audiovisual presentation is obtained
Audio-frequency unit and video frame;
Audio-frequency unit and video frame to the audiovisual presentation carry out feature extraction, and it is corresponding to obtain the audiovisual presentation
Audio frequency characteristics and video frame characteristics of image;
The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, the sound view is obtained
The audio-video fusion feature of frequency image;
Based on the feature database of preset reference video, the audio-video fusion feature is matched, obtains the sound view
The frame collection matching result of frequency image;
Frame collection matching result and reference video based on the audiovisual presentation, copy the audiovisual presentation
Determine and positions.
The embodiment of the present invention also proposes a kind of audio-video copy detection device, comprising:
Decoding and preprocessing module are decoded and pre-process to the audiovisual presentation for obtaining audiovisual presentation,
Obtain the audio-frequency unit and video frame of the audiovisual presentation;
Characteristic extracting module, for the audiovisual presentation audio-frequency unit and video frame carry out feature extraction, obtain
The characteristics of image of the audiovisual presentation corresponding audio frequency characteristics and video frame;
Fusion Module melts for the characteristics of image to the corresponding audio frequency characteristics of the audiovisual presentation and video frame
It closes, obtains the audio-video fusion feature of the audiovisual presentation;
Matching module matches the audio-video fusion feature for the feature database based on preset reference video,
Obtain the frame collection matching result of the audiovisual presentation;
Determination module is copied, for frame collection matching result and reference video based on the audiovisual presentation, to described
Audiovisual presentation carries out copy judgement and positioning.
A kind of audio-video copy detection method and device that the embodiment of the present invention proposes, it is right by obtaining audiovisual presentation
The audiovisual presentation is decoded and pre-processes, and obtains the audio-frequency unit and video frame of the audiovisual presentation;To the sound
The audio-frequency unit and video frame of video image carry out feature extraction, obtain the corresponding audio frequency characteristics of the audiovisual presentation and video
The characteristics of image of frame;The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, institute is obtained
State the audio-video fusion feature of audiovisual presentation;Based on the feature database of preset reference video, to the audio-video fusion feature
It is matched, obtains the frame collection matching result of the audiovisual presentation;Based on the frame collection matching result of the audiovisual presentation with
And reference video, copy judgement and positioning are carried out to the audiovisual presentation, thus the method combined using audio-video, not only
The robustness of video copy detection system is enhanced, and by merging audio and video characteristic, greatly accelerates copy inspection
The execution efficiency of examining system is analyzed jointly by audio-video, improves copy segment positioning accuracy.
Detailed description of the invention
Fig. 1 is the hardware structural diagram of audio-video copy detection device of the present invention;
Fig. 2 is the flow diagram of audio-video copy detection method first embodiment of the present invention;
Fig. 3 is sound intermediate frequency sub-belt energy difference feature extraction flow diagram of the embodiment of the present invention;
Fig. 4 is the flow diagram that the Image DCT feature of video frame of audiovisual presentation is extracted in the embodiment of the present invention;
Fig. 5 is characteristics of image and audio frequency characteristics fusion schematic diagram in the embodiment of the present invention;
Fig. 6 is simhash matching algorithm exemplary diagram involved in the embodiment of the present invention;
Fig. 7 is matching algorithm design diagram involved in the embodiment of the present invention;
Fig. 8 is the positioning of copy involved in the embodiment of the present invention and extension schematic diagram;
Fig. 9 is the flow diagram of audio-video copy detection method second embodiment of the present invention;
Figure 10 is the functional block diagram of audio-video copy detection device first embodiment of the present invention;
Figure 11 is the functional block diagram of audio-video copy detection device second embodiment of the present invention.
In order to keep technical solution of the present invention clearer, clear, it is described in further detail below in conjunction with attached drawing.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The primary solutions of the embodiment of the present invention are: the audio-frequency information of video being included in video copy detection scheme, benefit
The robustness of video copy detection system not only can be enhanced in the method combined with audio-video, but also by the way that audio-video is special
Sign is merged, and is greatly speeded up the execution efficiency of copy detection system, is analyzed jointly by audio-video, and copy segment positioning is improved
Precision.
Specifically, the embodiment of the present invention it is considered that existing video copy detection scheme or only with based on video close
The video copy detection scheme of the characteristics of image of key frame, not only weakens the robustness of video copy detection system, but also for
The positioning accuracy for copying segment is not high;Using the video copy detection side combined based on audio and video characteristic testing result
Case however, carrying out fusion in face of copy detection in resultant layer needs to extract more feature, and needs most feature all
Entire copy detection process is completed, thus increases time overhead, and corresponding algorithm complexity and data integration are linearly related,
To increase algorithm complexity.
The audio-frequency information of video is included in video copy detection scheme by this embodiment scheme, the side combined using audio-video
Method is extracted by audio/video decoding and pretreatment, audio and video characteristic, audio and video characteristic fusion, copy determines and the processing such as positioning
The robustness of video copy detection system not only can be enhanced in process, but also by merging audio and video characteristic, greatly greatly
The execution efficiency of fast copy detection system, is analyzed jointly by audio-video, improves copy segment positioning accuracy.
Specifically, the hardware knot for the audio-video copy detection device that audio-video of embodiment of the present invention copy detection scheme is related to
Structure can be can also be carried on mobile phone, tablet computer, portable hand-held as shown in Figure 1, the detection device can be carried on the end PC
In the mobile terminals such as equipment or other electronic equipments with audio-video copy detection function, such as apparatus for media playing.
As shown in Figure 1, the detection device may include: processor 1001, such as CPU, network interface 1004, user interface
1003, memory 1005, communication bus 1002, camera 1006.Wherein, communication bus 1002 for realizing detection device this
Connection communication between a little components.User interface 1003 may include display screen (Display), input unit such as keyboard
(Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 is optional
May include standard wireline interface and wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory,
It is also possible to stable memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally may be used also
To be independently of the storage device of aforementioned processor 1001.
Optionally, which, when being carried on mobile terminal, can also include that (Radio Frequency, is penetrated RF
Frequently circuit), sensor, voicefrequency circuit, WiFi module etc..Wherein, sensor such as optical sensor, motion sensor and its
His sensor.Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can root
The brightness of display screen is adjusted according to the light and shade of ambient light, proximity sensor can close aobvious when mobile terminal is moved in one's ear
Display screen and/or backlight.As a kind of motion sensor, gravity accelerometer can detect (generally three in all directions
Axis) acceleration size, can detect that size and the direction of gravity when static, can be used to identify the application of mobile terminal posture
(such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, percussion) etc.;
Certainly, which can also configure the other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor,
Details are not described herein.
It will be understood by those skilled in the art that apparatus structure shown in Fig. 1 does not constitute the restriction to the detection device,
It may include perhaps combining certain components or different component layouts than illustrating more or fewer components.
As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium
Believe module, Subscriber Interface Module SIM and audio-video copy detection application program.
In detection device shown in Fig. 1, network interface 1004 is mainly used for connecting background management platform, with back-stage management
Platform carries out data communication;User interface 1003 is mainly used for connecting client, carries out data communication with client;And processor
1001 can be used for calling the audio-video copy detection application program stored in memory 1005, and execute following operation:
Audiovisual presentation is obtained, the audiovisual presentation is decoded and is pre-processed, the audiovisual presentation is obtained
Audio-frequency unit and video frame;
Audio-frequency unit and video frame to the audiovisual presentation carry out feature extraction, and it is corresponding to obtain the audiovisual presentation
Audio frequency characteristics and video frame characteristics of image;
The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, the sound view is obtained
The audio-video fusion feature of frequency image;
Based on the feature database of preset reference video, the audio-video fusion feature is matched, obtains the sound view
The frame collection matching result of frequency image;
Frame collection matching result and reference video based on the audiovisual presentation, copy the audiovisual presentation
Determine and positions.
In one embodiment, processor 1001 calls the audio-video copy detection application program stored in memory 1005
Following operation can be executed:
The audio frame of the audio-frequency unit of the audiovisual presentation is filtered, and frequency is transformed by Fourier transformation
The energy in domain;
Obtained frequency domain energy is divided into several subbands in scheduled frequency range according to logarithmic relationship;
The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference feature of audio frame;
The sampling that audio frame is carried out according to predetermined space, obtains the audio sub-band energy of the audio-frequency unit of the audiovisual presentation
Measure poor feature.
In one embodiment, processor 1001 calls the audio-video copy detection application program stored in memory 1005
Following operation can be executed:
To the video frame of the audiovisual presentation, gray level image is converted by its image and carries out compression processing;
Several sub-blocks are divided into the gray level image after compression processing;
Calculate the DCT energy value of each sub-block;
DCT energy value between more two neighboring sub-block obtains the Image DCT feature of the video frame;
According to above-mentioned treatment process, the Image DCT feature of the video frame of the audiovisual presentation is obtained.
In one embodiment, processor 1001 calls the audio-video copy detection application program stored in memory 1005
Following operation can be executed:
The audio frequency characteristics are set as the feature of M per second 32 bits, the characteristics of image of video frame is n per second 32 bits
Feature, wherein n be video frame per second, n be less than or equal to 60;
One video frame is corresponded into the mode of several frame audio frames to carry out merging features, obtains generation M per second 64
The audio-video fusion feature of bit, wherein the audio of the corresponding individual audio frame of each audio-video fusion feature is special
Sign, M/n adjacent audio-video fusion feature correspond to the characteristics of image of an identical video frame.
In one embodiment, processor 1001 calls the audio-video copy detection application program stored in memory 1005
Following operation can be executed:
Matching list is obtained from the feature database of preset reference video;
For each audio-video fusion feature, from the Chinese in the matching list between inquiry and the audio-video fusion feature
Prescribed distance is no more than the feature of preset threshold, the similar features as the audio-video fusion feature;
The similar features for obtaining audio-video fusion feature, obtain the frame collection matching result of the audiovisual presentation.
In one embodiment, processor 1001 calls the audio-video copy detection application program stored in memory 1005
Following operation can be executed:
Time extension is carried out to the audio/video frames of the corresponding reference video of the similar features, obtains the audiovisual presentation
In corresponding audio/video frames compare the reference video constitute similar fragments;
Based on the similar fragments, it is similar to reference video to calculate corresponding audio/video frames in the audiovisual presentation
Degree;
If the similarity is greater than given threshold, judge that the audiovisual presentation constitutes copy, and records the sound view
The initial position of the similar fragments of frequency image and final position.
In one embodiment, processor 1001 calls the audio-video copy detection application program stored in memory 1005
Following operation can be executed:
The matching list is created in the feature database of the reference video.
The present embodiment through the above scheme, especially by audiovisual presentation is obtained, is decoded the audiovisual presentation
And pretreatment, obtain the audio-frequency unit and video frame of the audiovisual presentation;To the audio-frequency unit and view of the audiovisual presentation
Frequency frame carries out feature extraction, obtains the characteristics of image of the audiovisual presentation corresponding audio frequency characteristics and video frame;To the sound
The corresponding audio frequency characteristics of video image and the characteristics of image of video frame are merged, and the audio-video for obtaining the audiovisual presentation is melted
Close feature;Based on the feature database of preset reference video, the audio-video fusion feature is matched, the audio-video is obtained
The frame collection matching result of image;Frame collection matching result and reference video based on the audiovisual presentation, to the audio-video
Image carries out copy judgement and positioning, so that the method combined using audio-video, not only increases video copy detection system
Robustness greatly accelerate the execution efficiency of copy detection system, pass through sound and by merging audio and video characteristic
Video is analyzed jointly, improves copy segment positioning accuracy.
Based on above-mentioned hardware configuration, audio-video copy detection method embodiment of the present invention is proposed.
As shown in Fig. 2, first embodiment of the invention proposes a kind of audio-video copy detection method, comprising:
Step S101 obtains audiovisual presentation, the audiovisual presentation is decoded and is pre-processed, and obtains the sound view
The audio-frequency unit and video frame of frequency image;
Specifically, firstly, obtaining the audiovisual presentation for needing to carry out copy detection, which can obtain from local
It takes, can also be obtained by network from outside.
The audiovisual presentation of acquisition is decoded and is pre-processed, the audio of video is extracted, and is downsampled to monophonic
5512.5Hz;The each frame for extracting video frame by frame, to obtain the audio-frequency unit of audiovisual presentation and the video frame of each frame.
Step S102, audio-frequency unit and video frame to the audiovisual presentation carry out feature extraction, obtain the sound view
The characteristics of image of frequency image corresponding audio frequency characteristics and video frame;
The part carries out feature extraction primarily with respect to the corresponding audio of a video and all videos frame.Because audio is special
Sign itself is easy to be indicated with binary bits, so often accelerating to inquire using binary index or LSH.Institute of the present invention
The audio frequency characteristics of extraction are audio sub-band energy difference feature, and the characteristics of image of the video frame of extraction is DCT (Discrete
Cosine Transform, discrete cosine transform) feature.
Wherein, feature extraction is carried out to the audio-frequency unit of the audiovisual presentation, it is corresponding obtains the audiovisual presentation
The process of audio frequency characteristics includes:
Each audio frame of the audio-frequency unit of the audiovisual presentation is filtered, and is transformed by Fourier transformation
The energy of frequency domain;Obtained frequency domain energy is divided into several subbands in scheduled frequency range according to logarithmic relationship;
The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference feature of each audio frame;According to pre-
Fixed interval carries out the sampling of audio frame, obtains the audio sub-band energy difference feature of the audio-frequency unit of the audiovisual presentation.
More specifically, the extraction process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:
The algorithm that the extraction of the audio sub-band energy difference feature is related to has main steps that:
Firstly, every 0.37 second time-domain audio shape information (audio frame) is filtered by Hanning window (Hanning Window)
The energy of frequency domain is transformed into after wave by Fourier transformation;
Secondly, obtained frequency domain energy, which is divided into 33 according to logarithmic relationship (Bark grade), is located at human auditory system model
The subband of (300Hz~2000Hz) is enclosed, and calculates the absolute value of the energy between consecutive frame (11 milliseconds of interval) adjacent sub-bands
Difference, so that the audio frequency characteristics of 32 bits can be obtained to each audio frame.
The energy difference that " 1 " therein represents the two neighboring subband of current audio frame is greater than the corresponding phase of next audio frame
Otherwise the energy difference of adjacent subband is 0.
Detailed process is as follows:
In Fig. 3, input content is a segment of audio;Output content is that several corresponding (n) audios of this section audio are special
Sign.
Wherein, Framing: framing, it may be assumed that by the audio fragment cutting be several (n) audio frames.In this example according to
M=2048 audio frame of acquisition (M can also be other setting values in other examples) per second, each audio frame include 0.37 second
Audio content (overlapping for having 2047/2048 between adjacent audio frame).
Fourier Transform: Fourier transformation, for the shape information (original audio) of time domain to be converted to frequency
The energy information of the different frequency range wave in domain is convenient for analysis processing.
ABS: the absolute value (that is: only considering amplitude, do not consider direction of vibration) of wave energy information is taken.
Band Division: dividing band, and entire frequency domain is divided into 33 frequencies not overlapped between 300Hz-2000Hz
Rate band (is divided, it may be assumed that frequency is lower, and the affiliated frequency band range of the frequency is smaller) according to logarithmic relationship.In this way, available
Energy of the original audio on these different frequency bands.
Energy Computation: energy value (each audio of each audio frame on this 33 frequency bands is calculated
Frame obtains 33 energy values).
Bit Derivation: (the energy of i-th of subband successively export bit: is compared to 33 above-mentioned energy values
The energy of amount and i+1 subband is compared) obtain the difference of 32 energy values.Compare current audio frame a and next sound
The size of this 32 energy value differences between frequency frame b.Assuming that j-th of energy value difference of j-th of energy value difference ratio b of a is big,
Then the jth position feature of a is 1, and otherwise, the jth position feature of a is 0.32 energy value differences is big between a and b available in this way
Small relationship, the as feature of 32 bits of audio frame a.
Present invention employs this audio frequency characteristics, and the sampling of audio frame is carried out according to 1/2048 second interval, thus
The audio frequency characteristics of 2048 32 bits can be all generated for each second audio fragment.
Feature extraction is carried out to the video frame of the audiovisual presentation, obtains the image of the corresponding video frame of audiovisual presentation
The process of feature may include:
To each video frame of the audiovisual presentation, gray level image is converted by its image and carries out compression processing;It is right
Gray level image after compression processing is divided into several sub-blocks;Calculate the DCT energy value of each sub-block;Between more two neighboring sub-block
DCT energy value, obtain the Image DCT feature of the video frame;According to above-mentioned treatment process, the audiovisual presentation is obtained
The Image DCT feature of video frame.
More specifically, the present embodiment extracts process such as Fig. 4 institute of the Image DCT feature of the video frame of audiovisual presentation
Show:
For the little feature of internet video picture entire change amplitude, the embodiment of the present invention has been selected a kind of efficient
Image overall feature as video frame characteristics of image: DCT feature.
The algorithm idea of DCT feature is: several sub-blocks is divided the image into, by comparing the energy between adjacent sub-block
Amount height, to obtain the Energy distribution situation of entire image.Specific algorithm steps are:
Firstly, converting gray level image for color image and compressing and (change the ratio of width to height) to wide 64 pixel, high 32 pixel.
Then, gray level image is divided into 32 sub-blocks (as shown in Figure 4 0~31), each block of image comprising 8x8 pixel.
For each sub-block, the DCT energy value of the sub-block is calculated.Selection can band energy value absolute value represent
The energy of the sub-block.
Finally, calculating adjacent sub-blocks energy value relative size and obtaining the feature of 32 bits.If the energy of the i-th sub-block
Amount is greater than the energy of i+1 sub-block, then otherwise it is 0 that ith bit position, which is 1,.Particularly: the 31st sub-block and the 0th sub-block are compared
Compared with.
By the above process, each video frame will obtain the Image DCT feature of 32 bits.
Step S103 merges the characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame, obtains
To the audio-video fusion feature of the audiovisual presentation;
In the figure that after the above process has obtained the characteristics of image of the corresponding audio frequency characteristics of video and video frame, will be obtained
As feature and audio frequency characteristics are merged.Specific fusion method is as shown in Figure 5 (wherein: the longitudinal axis is time shaft).
As shown in figure 5, in the present embodiment, setting audio frequency characteristics as a 32 ratio of M=2048 per second (value can be set)
Special feature, and the characteristics of image of video frame is that (n is the frame per second of video, and n is usually no more than for the features of n per second 32 bits
60)。
Thus, the present embodiment carries out merging features in such a way that a video frame is corresponded to several audio frames, it may be assumed that
The audio-video fusion feature per second for generating 2048 64 bits, wherein each fusion feature corresponds to an individual audio
The feature of frame, and 2048/n adjacent audio-video fusion feature corresponds to the Image DCT feature of an identical video frame.
It is merged by the above-mentioned characteristics of image to the corresponding audio frequency characteristics of audiovisual presentation and video frame, obtains sound view
The audio-video fusion feature of frequency image.
Step S104 matches the audio-video fusion feature, is obtained based on the feature database of preset reference video
The frame collection matching result of the audiovisual presentation;
The present embodiment is preset with the feature database of reference video, and creation has matching list in the feature database of reference video,
To facilitate video individual features to be detected that can quickly be retrieved.
When being matched to audio-video fusion feature, firstly, obtaining matching from the feature database of preset reference video
Table;For each audio-video fusion feature, inquiry meets the feature of preset condition from the matching list, merges as audio-video
The similar features of feature.For example it is no more than from the Hamming distance in the matching list between inquiry and audio-video fusion feature default
The feature of threshold value (such as 3), the similar features as the audio-video fusion feature;Obtain the phase of all audio-video fusion features
Like feature, the frame collection matching result of the audiovisual presentation is obtained.
More specifically, the present embodiment is considered:
For an inquiry video (video for needing to carry out copy detection) and a reference video, if by comparing frame by frame
Compared with the similarity of the two feature, required time complexity and the two videos are all directly proportional, thus are unfavorable for expanding to big
The case where size databases.Therefore, the present invention is based on existing simhash technologies, propose a kind of special based on audio-video fusion
The index of sign and the matching strategy of inquiry.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry
Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits
Be up to 3 bits are different from this feature).The schematic illustration of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting the Chinese
Prescribed distance is 3, then 64 bits are divided into 4 16 bits, there will necessarily be 16 bits and query characteristics are completely the same.Class
As, in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.Pass through two secondary indexs
After searching matching, most 3 discrepant positions can be enumerated in remaining 36 bit, it is original so as to substantially reduce
Algorithm complexity.
The inquiry characteristic that the 64 bit audio-video fusion features that the present invention uses equally have simhash the same, it may be assumed that need
Find all features (it is relevant to think that the two are characterized in) that 3 bits are at most differed to some 64 feature.In addition, also just like
Under qualifications: i.e.: first 32 of the two correlated characteristics most 2 bits of difference, and latter 32 of the two features are at most
Differ 2 bits.Based on this, the present embodiment copies the way of simhash, but concordance list number is expanded to 24, specific to expand
Exhibition method is as shown in Figure 7:
In matching algorithm design as shown in Figure 7, after consideration the case where 32 most 1 bit differences, then first 32 at most
Have 16 bit differences, then for Fig. 7, in A, B, C, D at least 2 pieces it is completely the same, and in E, F at least one piece it is complete
It is complete consistent, therefore the completely the same matching list of 32 bits can be constructed.Such inquiry table one share C (4,2) * C (2,
1) * 2, because may also be preceding 32 bit at most poor 2.Therefore, 24 sublists, the matching as creation can be constructed altogether
Table is used to quick search audio-video fusion feature.
Then, by inquiring the matching list of above-mentioned building, the similar features of audio-video fusion feature is obtained, feature inspection is obtained
The result of rope.
Step S105, frame collection matching result and reference video based on the audiovisual presentation, to the audio-video figure
As carrying out copy judgement and positioning.
The characteristic key according to obtained in the above process as a result, and combine video copy segment localization method, to sentence
Surely whether inquiry video is copy video.If it is determined that inquiry video is copy video, then corresponding copy segment positioning is provided.
The present embodiment is considered: for two videos, if calculating the similarity between the two videos between frame,
It can obtain similarity matrix shown in rightmost in Fig. 8.To which the target for finding two video similar fragments is also converted to
The line segment that similarity is higher than certain threshold value is found in similarity matrix, however this processing mode time overhead increases.
The principle for carrying out copy judgement and positioning to audiovisual presentation in the present embodiment is:, can by above-mentioned matching algorithm
To find some points most bright in similarity matrix (representing these similarity highests), the bright spot as shown in Far Left in Fig. 8,
And it is put by these and carries out time extension, so as to obtain (the i.e. possible print of similar fragments shown in the centre Fig. 8
Section), it is screened later by threshold value, so as to determine whether certain two video constitutes copy, and if constituting copy,
It can recorde initial position and the final position distribution moment of the similar fragments.
Specifically, to audiovisual presentation carry out copy judgement and positioning when, first to the above process obtain similar spy
The audio/video frames (bright spot shown in 8 Far Left figure of corresponding diagram) for levying corresponding reference video carry out time extension, obtain the ginseng
The reference video segment for examining video carries out time extension to the audio/video frames in the corresponding audiovisual presentation of the similar features,
Obtain the similar fragments constituted in the audiovisual presentation compared to the reference video (as shown in Fig. 8 middle graph);Described in calculating
Similarity between similar fragments described in audiovisual presentation and the reference video segment, i.e., it is similar in calculating audiovisual presentation
The similarity of the corresponding audio/video frames of segment audio/video frames corresponding with reference video segment, to the phase of obtained each audio/video frames
It is averaged like degree;If the similarity is greater than given threshold, judge that the audiovisual presentation constitutes copy, and described in record
The initial position of the similar fragments of audiovisual presentation and final position.
That is, calculate audiovisual presentation in similar fragments corresponding audio/video frames and reference video similarity
When, Characteristic Contrast is carried out with referring to video clip corresponding frame to each frame (feature including 64 bits) in the similar fragments,
Similarity is calculated, is averaged later, by this average value compared with preset threshold, if similarity is greater than given threshold, is judged
The audiovisual presentation constitutes copy, and records initial position and the final position of the similar fragments of the audiovisual presentation.
It is exemplified below:
If in similar fragments, 100 frames (i.e. an audio-video sequence) inquired between the 10-20 second of video are corresponding with reference to view
Each frame in 100 frames between the 10-20 second for inquiring video is then corresponded to and is regarded with reference by 100 frames between the 30-40 second of frequency
Each frame in 100 frames between the 30-40 second of frequency is compared, and calculates separately the similarity of each frame, such as first frame 64
In bit, there is the feature of 50 bits identical as reference video frame, then the similarity S1=50/64 ≈ 0.78125 of the first frame;With
This principle, obtains the similarity S2 ... ... of the second frame, and the similarity S100 of 100 frames is averaged each similarity, obtains phase
Like in segment, the similarity of video and reference video is inquired, it is assumed that it is 0.95, by it compared with given threshold (being set as 0.9), by
This may determine that inquiry video constitutes copy, and record initial position and the final position of the similar fragments.
Determine that an inquiry video, can there may be the situation of multiple similar fragments in position fixing process in above-mentioned copy
Multiple similar fragments are stringed together record.
It should be noted that judging whether inquire video according to frame collection matching result in the present embodiment above process
When being the copy of some video in reference video library, other algorithms can be used also to realize, such as: Hough transformation, Smith
Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..Inquiry video is found by these algorithms and some reference regards
Frequently a most like Duan Xulie, and determine whether to constitute copy by threshold value.For the video for being judged to copying, judge to copy
Segment end to end, thus mark this Partial Fragment for copy segment.
Through the above scheme, the method combined using audio-video not only increases video copy detection system to the present embodiment
The robustness of system, and by merging audio and video characteristic, the execution efficiency of copy detection system is greatly accelerated, is passed through
Audio-video is analyzed jointly, improves copy segment positioning accuracy.
As shown in figure 9, second embodiment of the invention proposes a kind of audio-video copy detection method, based on the above embodiment,
Before the step of obtaining audiovisual presentation, further includes:
Step S100 creates the matching list in the feature database of the reference video.
Specifically, matching list is created, is that video individual features to be detected can be retrieved quickly for convenience.
Matching list is created based on reference video, and specific creation process is as follows:
Firstly, collecting reference video segment, audio/video decoding and pretreatment are carried out to reference video segment, obtained with reference to view
The audio-frequency unit and video frame of frequency.
Then, feature extraction is carried out to the audio-frequency unit of reference video and video frame, obtains the audio frequency characteristics of reference video
With the characteristics of image of video frame.
Later, audio and video characteristic fusion is carried out to reference video, obtains the audio-video fusion feature of reference video.
Finally, the audio-video fusion feature based on the reference video creates matching list, for the progress of subsequent inquiry video
Aspect indexing retrieval matching.
Wherein, when the audio-video fusion feature based on the reference video creates matching list, based on the principle that
It considers: for an inquiry video (video for needing to carry out copy detection) and a reference video, if logical
The similarity for comparing the two feature frame by frame is crossed, required time complexity and the two videos are all directly proportional, thus are unfavorable for
The case where expanding to large scale database.Therefore, the present invention is based on existing simhash technologies, propose a kind of based on sound view
The index of frequency fusion feature and the matching strategy of inquiry.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry
Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits
Be up to 3 bits are different from this feature).The schematic diagram of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting Hamming distance
From being 3, then 64 bits are divided into 4 16 bits, it there will necessarily be 16 bits and query characteristics be completely the same.It is similar
, in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.It is looked by two secondary indexs
It looks for after matching, most 3 discrepant positions can be enumerated in remaining 36 bit, it is original so as to substantially reduce
The complexity of algorithm.
The inquiry characteristic that the 64 bit audio-video fusion features that the present invention uses equally have simhash the same, it may be assumed that need
Find all features (it is relevant to think that the two are characterized in) that 3 bits are at most differed to some 64 feature.In addition, also just like
Under qualifications: i.e.: first 32 of the two correlated characteristics most 2 bits of difference, and latter 32 of the two features are at most
Differ 2 bits.Based on this, the present embodiment copies the way of simhash, but concordance list number is expanded to 24, specific to expand
Exhibition method is as shown in Figure 7:
In matching algorithm design as shown in Figure 7, after consideration the case where 32 most 1 bit differences, then first 32 at most
Have 16 bit differences, then for Fig. 7, in A, B, C, D at least 2 pieces it is completely the same, and in E, F at least one piece it is complete
It is complete consistent, therefore the completely the same matching list of 32 bits can be constructed.Such inquiry table one share C (4,2) * C (2,
1) * 2, because may also be preceding 32 bit at most poor 2.Therefore, 24 sublists, the matching as creation can be constructed altogether
Table is used to quick search audio-video fusion feature.
Accordingly, the functional module embodiment of audio-video of embodiment of the present invention copy detection device is proposed.
As shown in Figure 10, first embodiment of the invention proposes a kind of audio-video copy detection device, comprising: decoding and pre- place
Manage module 201, characteristic extracting module 202, Fusion Module 203, matching module 204 and copy determination module 205, in which:
Decoding and preprocessing module 201 are decoded the audiovisual presentation and locate in advance for obtaining audiovisual presentation
Reason, obtains the audio-frequency unit and video frame of the audiovisual presentation;
Characteristic extracting module 202, for the audiovisual presentation audio-frequency unit and video frame carry out feature extraction, obtain
To the characteristics of image of the audiovisual presentation corresponding audio frequency characteristics and video frame;
Fusion Module 203 is carried out for the characteristics of image to the corresponding audio frequency characteristics of the audiovisual presentation and video frame
Fusion, obtains the audio-video fusion feature of the audiovisual presentation;
Matching module 204, for the feature database based on preset reference video, to audio-video fusion feature progress
Match, obtains the frame collection matching result of the audiovisual presentation;
Determination module 205 is copied, for frame collection matching result and reference video based on the audiovisual presentation, to institute
It states audiovisual presentation and carries out copy judgement and positioning.
Specifically, firstly, obtaining the audiovisual presentation for needing to carry out copy detection, which can obtain from local
It takes, can also be obtained by network from outside.
The audiovisual presentation of acquisition is decoded and is pre-processed, the audio of video is extracted, and is downsampled to monophonic
5512.5Hz;The each frame for extracting video frame by frame, to obtain the audio-frequency unit of audiovisual presentation and the video frame of each frame.
Later, the audio-frequency unit to the audiovisual presentation and video frame carry out feature extraction, obtain the audio-video figure
As the characteristics of image of corresponding audio frequency characteristics and video frame.
The part carries out feature extraction primarily with respect to the corresponding audio of a video and all videos frame.Because audio is special
Sign itself is easy to be indicated with binary bits, so often accelerating to inquire using binary index or LSH.Institute of the present invention
The audio frequency characteristics of extraction are audio sub-band energy difference feature, and the characteristics of image of the video frame of extraction is DCT (Discrete
Cosine Transform, discrete cosine transform) feature.
Wherein, feature extraction is carried out to the audio-frequency unit of the audiovisual presentation, it is corresponding obtains the audiovisual presentation
The process of audio frequency characteristics includes:
Each audio frame of the audio-frequency unit of the audiovisual presentation is filtered, and is transformed by Fourier transformation
The energy of frequency domain;Obtained frequency domain energy is divided into several subbands in scheduled frequency range according to logarithmic relationship;
The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference feature of each audio frame;According to pre-
Fixed interval carries out the sampling of audio frame, obtains the audio sub-band energy difference feature of the audio-frequency unit of the audiovisual presentation.
More specifically, the extraction process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:
The algorithm that the extraction of the audio sub-band energy difference feature is related to has main steps that:
Firstly, every 0.37 second time-domain audio shape information (audio frame) is filtered by Hanning window (Hanning Window)
The energy of frequency domain is transformed into after wave by Fourier transformation;
Secondly, obtained frequency domain energy, which is divided into 33 according to logarithmic relationship (Bark grade), is located at human auditory system model
The subband of (300Hz~2000Hz) is enclosed, and calculates the absolute value of the energy between consecutive frame (11 milliseconds of interval) adjacent sub-bands
Difference, so that the audio frequency characteristics of 32 bits can be obtained to each audio frame.
The energy difference that " 1 " therein represents the two neighboring subband of current audio frame is greater than the corresponding phase of next audio frame
Otherwise the energy difference of adjacent subband is 0.
Detailed process is as follows:
In Fig. 3, input content is a segment of audio;Output content is that several corresponding (n) audios of this section audio are special
Sign.
Wherein, Framing: framing, it may be assumed that by the audio fragment cutting be several (n) audio frames.According to every in example
2048 audio frames of second acquisition, each audio frame include that 0.37 second audio content (has 2047/2048 between adjacent audio frame
Overlapping).
Fourier Transform: Fourier transformation, for the shape information (original audio) of time domain to be converted to frequency
The energy information of the different frequency range wave in domain is convenient for analysis processing.
ABS: the absolute value (that is: only considering amplitude, do not consider direction of vibration) of wave energy information is taken.
Band Division: dividing band, and entire frequency domain is divided into 33 frequencies not overlapped between 300Hz-2000Hz
Rate band (is divided, it may be assumed that frequency is lower, and the affiliated frequency band range of the frequency is smaller) according to logarithmic relationship.In this way, available
Energy of the original audio on these different frequency bands.
Energy Computation: energy value (each audio of each audio frame on this 33 frequency bands is calculated
Frame obtains 33 energy values).
Bit Derivation: (the energy of i-th of subband successively export bit: is compared to 33 above-mentioned energy values
The energy of amount and i+1 subband is compared) obtain the difference of 32 energy values.Compare current audio frame a and next sound
The size of this 32 energy value differences between frequency frame b.Assuming that j-th of energy value difference of j-th of energy value difference ratio b of a is big,
Then the jth position feature of a is 1, and otherwise, the jth position feature of a is 0.32 energy value differences is big between a and b available in this way
Small relationship, the as feature of 32 bits of audio frame a.
Present invention employs this audio frequency characteristics, and the sampling of audio frame is carried out according to 1/2048 second interval, thus
The audio frequency characteristics of 2048 32 bits can be all generated for each second audio fragment.
Feature extraction is carried out to the video frame of the audiovisual presentation, obtains the image of the corresponding video frame of audiovisual presentation
The process of feature may include:
To each video frame of the audiovisual presentation, gray level image is converted by its image and carries out compression processing;It is right
Gray level image after compression processing is divided into several sub-blocks;Calculate the DCT energy value of each sub-block;Between more two neighboring sub-block
DCT energy value, obtain the Image DCT feature of the video frame;According to above-mentioned treatment process, the audiovisual presentation is obtained
The Image DCT feature of video frame.
More specifically, the present embodiment extracts process such as Fig. 4 institute of the Image DCT feature of the video frame of audiovisual presentation
Show:
For the little feature of internet video picture entire change amplitude, the embodiment of the present invention has been selected a kind of efficient
Image overall feature as video frame characteristics of image: DCT feature.
The algorithm idea of DCT feature is: several sub-blocks is divided the image into, by comparing the energy between adjacent sub-block
Amount height, to obtain the Energy distribution situation of entire image.Specific algorithm steps are:
Firstly, converting gray level image for color image and compressing and (change the ratio of width to height) to wide 64 pixel, high 32 pixel.
Then, gray level image is divided into 32 sub-blocks (as shown in Figure 4 0~31), each block of image comprising 8x8 pixel.
For each sub-block, the DCT energy value of the sub-block is calculated.Selection can band energy value absolute value represent
The energy of the sub-block.
Finally, calculating adjacent sub-blocks energy value relative size and obtaining the feature of 32 bits.If the energy of the i-th sub-block
Amount is greater than the energy of i+1 sub-block, then otherwise it is 0 that ith bit position, which is 1,.Particularly: the 31st sub-block and the 0th sub-block are compared
Compared with.
By the above process, each video frame will obtain the Image DCT feature of 32 bits.
In the figure that after the above process has obtained the characteristics of image of the corresponding audio frequency characteristics of video and video frame, will be obtained
As feature and audio frequency characteristics are merged.Specific fusion method is as shown in Figure 5 (wherein: the longitudinal axis is time shaft).
As shown in figure 5, in the present embodiment, setting audio frequency characteristics as a 32 ratio of M=2048 per second (value can be set)
Special feature, and the characteristics of image of video frame is that (n is the frame per second of video, and n is usually no more than for the features of n per second 32 bits
60)。
Thus, the present embodiment carries out merging features in such a way that a video frame is corresponded to several audio frames, it may be assumed that
The audio-video fusion feature per second for generating 2048 64 bits, wherein each fusion feature corresponds to an individual audio
The feature of frame, and 2048/n adjacent audio-video fusion feature corresponds to the Image DCT feature of an identical video frame.
It is merged by the above-mentioned characteristics of image to the corresponding audio frequency characteristics of audiovisual presentation and video frame, obtains sound view
The audio-video fusion feature of frequency image.
Later, the feature database based on preset reference video matches the audio-video fusion feature, obtains described
The frame collection matching result of audiovisual presentation.
The present embodiment is preset with the feature database of reference video, and creation has matching list in the feature database of reference video,
To facilitate video individual features to be detected that can quickly be retrieved.
When being matched to audio-video fusion feature, firstly, obtaining matching from the feature database of preset reference video
Table;For each audio-video fusion feature, inquiry meets the feature of preset condition from the matching list, merges as audio-video
The similar features of feature.For example it is no more than from the Hamming distance in the matching list between inquiry and audio-video fusion feature default
The feature of threshold value (such as 3), the similar features as the audio-video fusion feature;Obtain the similar spy of audio-video fusion feature
Sign, obtains the frame collection matching result of the audiovisual presentation.
More specifically, the present embodiment is considered:
For an inquiry video (video for needing to carry out copy detection) and a reference video, if by comparing frame by frame
Compared with the similarity of the two feature, required time complexity and the two videos are all directly proportional, thus are unfavorable for expanding to big
The case where size databases.Therefore, the present invention is based on existing simhash technologies, propose a kind of special based on audio-video fusion
The index of sign and the matching strategy of inquiry.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry
Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits
Be up to 3 bits are different from this feature).The schematic illustration of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting the Chinese
Prescribed distance is 3, then 64 bits are divided into 4 16 bits, there will necessarily be 16 bits and query characteristics are completely the same.Class
As, in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.Pass through two secondary indexs
After lookup, most 3 discrepant positions can be enumerated, so as to substantially reduce original calculation in remaining 36 bit
The complexity of method.
The inquiry characteristic that the 64 bit audio-video fusion features that the present invention uses equally have simhash the same, it may be assumed that need
Find all features (it is relevant to think that the two are characterized in) that 3 bits are at most differed to some 64 feature.In addition, also just like
Under qualifications: i.e.: first 32 of the two correlated characteristics most 2 bits of difference, and latter 32 of the two features are at most
Differ 2 bits.Based on this, the present embodiment copies the way of simhash, but concordance list number is expanded to 24, specific to expand
Exhibition method is as shown in Figure 7:
In matching algorithm design as shown in Figure 7, after consideration the case where 32 most 1 bit differences, then first 32 at most
Have 16 bit differences, then for Fig. 7, in A, B, C, D at least 2 pieces it is completely the same, and in E, F at least one piece it is complete
It is complete consistent, therefore the completely the same matching list of 32 bits can be constructed.Such inquiry table one share C (4,2) * C (2,
1) * 2, because may also be preceding 32 bit at most poor 2.Therefore, 24 sublists, the matching as creation can be constructed altogether
Table is used to quick search audio-video fusion feature.
Then, by inquiring the matching list of above-mentioned building, the similar features of audio-video fusion feature is obtained, feature inspection is obtained
The result of rope.
The characteristic key according to obtained in the above process as a result, and combine video copy segment localization method, to sentence
Surely whether inquiry video is copy video.If it is determined that inquiry video is copy video, then corresponding copy segment positioning is provided.
The present embodiment is considered: for two videos, if calculated similar between each frame between the two videos
Degree, then can obtain similarity matrix shown in rightmost in Fig. 8.To which the target for finding two video similar fragments also just turns
It has been melted into and has found the line segment that similarity is higher than certain threshold value in similarity matrix, however this processing mode time overhead adds
Greatly.
The principle for carrying out copy judgement and positioning to audiovisual presentation in the present embodiment is:, can by above-mentioned Index Algorithm
To find some points most bright in similarity matrix (representing these similarity highests), the bright spot as shown in Far Left in Fig. 8,
And it is put by these and carries out time extension, so as to obtain (the i.e. possible print of similar fragments shown in the centre Fig. 8
Section), it is screened later by threshold value, so as to determine whether certain two video constitutes copy, and if constituting copy,
It can recorde initial position and the final position distribution moment of the similar fragments.
Specifically, to audiovisual presentation carry out copy judgement and positioning when, first to the above process obtain similar spy
The audio/video frames (bright spot shown in 8 Far Left figure of corresponding diagram) for levying corresponding reference video carry out time extension, obtain the ginseng
The reference video segment for examining video carries out time extension to the audio/video frames in the corresponding audiovisual presentation of the similar features,
Obtain the similar fragments constituted in the audiovisual presentation compared to the reference video (as shown in Fig. 8 middle graph);Described in calculating
Similarity between similar fragments described in audiovisual presentation and the reference video segment, i.e., it is similar in calculating audiovisual presentation
The similarity of the corresponding audio/video frames of segment audio/video frames corresponding with reference video segment, to the phase of obtained each audio/video frames
It is averaged like degree;If the similarity is greater than given threshold, judge that the audiovisual presentation constitutes copy, and described in record
The initial position of the similar fragments of audiovisual presentation and final position.
That is, calculate audiovisual presentation in similar fragments corresponding audio/video frames and reference video similarity
When, Characteristic Contrast is carried out with referring to video clip corresponding frame to each frame (feature including 64 bits) in the similar fragments,
Similarity is calculated, is averaged later, by this average value compared with preset threshold, if similarity is greater than given threshold, is judged
The audiovisual presentation constitutes copy, and records initial position and the final position of the similar fragments of the audiovisual presentation.
It is exemplified below:
If in similar fragments, 100 frames (i.e. an audio-video sequence) inquired between the 10-20 second of video are corresponding with reference to view
Each frame in 100 frames between the 10-20 second for inquiring video is then corresponded to and is regarded with reference by 100 frames between the 30-40 second of frequency
Each frame in 100 frames between the 30-40 second of frequency is compared, and calculates separately the similarity of each frame, such as first frame 64
In bit, there is the feature of 50 bits identical as reference video frame, then the similarity S1=50/64 ≈ 0.78125 of the first frame;With
This principle, obtains the similarity S2 ... ... of the second frame, and the similarity S100 of 100 frames is averaged each similarity, obtains phase
Like in segment, the similarity of video and reference video is inquired, it is assumed that it is 0.95, by it compared with given threshold (being set as 0.9), by
This may determine that inquiry video constitutes copy, and record initial position and the final position of the similar fragments.
Determine that an inquiry video, can there may be the situation of multiple similar fragments in position fixing process in above-mentioned copy
Multiple similar fragments are stringed together record.
It should be noted that judging whether inquire video according to frame collection matching result in the present embodiment above process
When being the copy of some video in reference video library, other algorithms can be used also to realize, such as: Hough transformation, Smith
Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..Inquiry video is found by these algorithms and some reference regards
Frequently a most like Duan Xulie, and determine whether to constitute copy by threshold value.For the video for being judged to copying, judge to copy
Segment end to end, thus mark this Partial Fragment for copy segment.
Through the above scheme, the method combined using audio-video not only increases video copy detection system to the present embodiment
The robustness of system, and by merging audio and video characteristic, the execution efficiency of copy detection system is greatly accelerated, is passed through
Audio-video is analyzed jointly, improves copy segment positioning accuracy.
As shown in figure 11, second embodiment of the invention proposes a kind of audio-video copy detection device, based on the above embodiment,
Further include:
Creation module 200, for creating the matching list in the feature database of the reference video.
Specifically, matching list is created, is that video individual features to be detected can be retrieved quickly for convenience.
Matching list is created based on reference video, and specific creation process is as follows:
Firstly, collecting reference video segment, audio/video decoding and pretreatment are carried out to reference video segment, obtained with reference to view
The audio-frequency unit and video frame of frequency.
Then, feature extraction is carried out to the audio-frequency unit of reference video and video frame, obtains the audio frequency characteristics of reference video
With the characteristics of image of video frame.
Later, audio and video characteristic fusion is carried out to reference video, obtains the audio-video fusion feature of reference video.
Finally, the audio-video fusion feature based on the reference video creates matching list, for the progress of subsequent inquiry video
Aspect indexing retrieval.
Wherein, when the audio-video fusion feature based on the reference video creates matching list, based on the principle that
It considers: for an inquiry video (video for needing to carry out copy detection) and a reference video, if logical
The similarity for comparing the two feature frame by frame is crossed, required time complexity and the two videos are all directly proportional, thus are unfavorable for
The case where expanding to large scale database.Therefore, the present invention is based on existing simhash technologies, propose a kind of based on sound view
The index and query strategy of frequency fusion feature.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry
Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits
Be up to 3 bits are different from this feature).The schematic diagram of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting Hamming distance
From being 3, then 64 bits are divided into 4 16 bits, it there will necessarily be 16 bits and query characteristics be completely the same.It is similar
, in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.It is looked by two secondary indexs
After looking for, most 3 discrepant positions can be enumerated, so as to substantially reduce original algorithm in remaining 36 bit
Complexity.
The inquiry characteristic that the 64 bit audio-video fusion features that the present invention uses equally have simhash the same, it may be assumed that need
Find all features (it is relevant to think that the two are characterized in) that 3 bits are at most differed to some 64 feature.In addition, also just like
Under qualifications: i.e.: first 32 of the two correlated characteristics most 2 bits of difference, and latter 32 of the two features are at most
Differ 2 bits.Based on this, the present embodiment copies the way of simhash, but concordance list number is expanded to 24, specific to expand
Exhibition method is as shown in Figure 7:
In matching algorithm design as shown in Figure 7, after consideration the case where 32 most 1 bit differences, then first 32 at most
Have 16 bit differences, then for Fig. 7, in A, B, C, D at least 2 pieces it is completely the same, and in E, F at least one piece it is complete
It is complete consistent, therefore the completely the same matching list of 32 bits can be constructed.Such inquiry table one share C (4,2) * C (2,
1) * 2, because may also be preceding 32 bit at most poor 2.Therefore, 24 sublists, the matching as creation can be constructed altogether
Table is used to quick search audio-video fusion feature.
Audio-video of embodiment of the present invention copy detection method and device, by obtaining audiovisual presentation, to the audio-video
Image is decoded and pre-processes, and obtains the audio-frequency unit and video frame of the audiovisual presentation;To the audiovisual presentation
Audio-frequency unit and video frame carry out feature extraction, and the image for obtaining the corresponding audio frequency characteristics of the audiovisual presentation and video frame is special
Sign;The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, the audio-video figure is obtained
The audio-video fusion feature of picture;Based on the feature database of preset reference video, the audio-video fusion feature is matched, is obtained
To the frame collection matching result of the audiovisual presentation;Frame collection matching result and reference video based on the audiovisual presentation,
Copy judgement and positioning are carried out to the audiovisual presentation, so that the method combined using audio-video, not only increases video
The robustness of copy detection system, and by merging audio and video characteristic, greatly accelerate holding for copy detection system
Line efficiency is analyzed jointly by audio-video, improves copy segment positioning accuracy.
It should also be noted that, herein, the terms "include", "comprise" or its any other variant are intended to non-
It is exclusive to include, so that the process, method, article or the device that include a series of elements not only include those elements,
It but also including other elements that are not explicitly listed, or further include solid by this process, method, article or device
Some elements.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including
There is also other identical elements in the process, method of the element, article or device.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in a storage medium
In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes
Business device or the network equipment etc.) execute method described in each embodiment of the present invention.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all utilizations
Equivalent structure made by description of the invention and accompanying drawing content or process transformation, are applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (13)
1. a kind of audio-video copy detection method characterized by comprising
Audiovisual presentation is obtained, the audiovisual presentation is decoded and is pre-processed, the audio of the audiovisual presentation is obtained
Part and video frame;
Audio-frequency unit and video frame to the audiovisual presentation carry out feature extraction, obtain the corresponding sound of the audiovisual presentation
The characteristics of image of frequency feature and video frame;
The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, the audio-video figure is obtained
The audio-video fusion feature of picture, including: audio sub-band energy difference feature to the audio-frequency unit of the audiovisual presentation and
The Image DCT feature of video frame is merged, and the audio-video fusion feature of the audiovisual presentation is obtained;
Based on the feature database of preset reference video, the audio-video fusion feature is matched, obtains the audio-video figure
The frame collection matching result of picture, including: from the feature database of preset reference video obtain matching list;For each audio-video
Fusion feature is no more than preset threshold from the Hamming distance in the matching list between inquiry and the audio-video fusion feature
Feature, the similar features as the audio-video fusion feature;The similar features for obtaining audio-video fusion feature, obtain the sound
The frame collection matching result of video image;
Frame collection matching result and reference video based on the audiovisual presentation carry out copy judgement to the audiovisual presentation
And positioning.
2. the method according to claim 1, wherein the audio-frequency unit to the audiovisual presentation carries out spy
The step of sign is extracted, and the audiovisual presentation corresponding audio frequency characteristics are obtained include:
The audio frame of the audio-frequency unit of the audiovisual presentation is filtered, and frequency domain is transformed by Fourier transformation
Energy;
Obtained frequency domain energy is divided into several subbands in scheduled frequency range according to logarithmic relationship;
The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference feature of audio frame;
The sampling that audio frame is carried out according to predetermined space, obtains the audio sub-band energy difference of the audio-frequency unit of the audiovisual presentation
Feature.
3. the method according to claim 1, wherein the video frame to the audiovisual presentation carries out feature
It extracts, the step of obtaining the characteristics of image of the corresponding video frame of the audiovisual presentation includes:
To each video frame of the audiovisual presentation, gray level image is converted by its image and carries out compression processing;
Several sub-blocks are divided into the gray level image after compression processing;
Calculate the DCT energy value of each sub-block;
DCT energy value between more two neighboring sub-block obtains the Image DCT feature of the video frame;
According to above-mentioned treatment process, the Image DCT feature of the video frame of the audiovisual presentation is obtained.
4. method according to claim 1,2 or 3, which is characterized in that described to the corresponding audio of the audiovisual presentation
The step of characteristics of image of feature and video frame is merged, and the audio-video fusion feature of the audiovisual presentation is obtained include:
The audio frequency characteristics are set as the feature of M per second 32 bits, the characteristics of image of video frame is the spy of n per second 32 bits
Sign, wherein n is the frame per second of video, and n is less than or equal to 60;
One video frame is corresponded into the mode of several frame audio frames to carry out merging features, obtains generation M per second 64 bits
Audio-video fusion feature, wherein each audio-video fusion feature correspond to an individual audio frame audio frequency characteristics, phase
M/n adjacent audio-video fusion feature corresponds to the characteristics of image of an identical video frame.
5. the method according to claim 1, wherein the frame collection matching result based on the audiovisual presentation
And reference video, copy judgement is carried out to the audiovisual presentation and includes: the step of positioning
Time extension is carried out to the audio/video frames of the corresponding reference video of the similar features, obtains the reference of the reference video
Video clip carries out time extension to the audio/video frames in the corresponding audiovisual presentation of the similar features, obtains the sound view
The similar fragments constituted in frequency image compared to the reference video;
Calculate the similarity between similar fragments described in the audiovisual presentation and the reference video segment;
If the similarity is greater than given threshold, judge that the audiovisual presentation constitutes copy, and record the audio-video figure
The initial position of the similar fragments of picture and final position.
6. the method according to claim 1, wherein before the step of acquisition audiovisual presentation, further includes:
The matching list is created in the feature database of the reference video.
7. a kind of audio-video copy detection device characterized by comprising
Decoding and preprocessing module are decoded and pre-process to the audiovisual presentation, obtain for obtaining audiovisual presentation
The audio-frequency unit and video frame of the audiovisual presentation;
Characteristic extracting module, for the audiovisual presentation audio-frequency unit and video frame carry out feature extraction, obtain described
The characteristics of image of audiovisual presentation corresponding audio frequency characteristics and video frame;
Fusion Module is merged for the characteristics of image to the corresponding audio frequency characteristics of the audiovisual presentation and video frame, is obtained
To the audio-video fusion feature of the audiovisual presentation, including: to audio of the audio-frequency unit of the audiovisual presentation
Image DCT feature with energy difference feature and video frame is merged, and the audio-video fusion feature of the audiovisual presentation is obtained;
Matching module matches the audio-video fusion feature, obtains for the feature database based on preset reference video
The frame collection matching result of the audiovisual presentation;
Determination module is copied, for frame collection matching result and reference video based on the audiovisual presentation, the sound is regarded
Frequency image carries out copy judgement and positioning;
The matching module is also used to obtain matching list from the feature database of preset reference video;Each audio-video is melted
Feature is closed, the spy of preset threshold is no more than from the Hamming distance in the matching list between inquiry and the audio-video fusion feature
Sign, the similar features as the audio-video fusion feature;The similar features for obtaining audio-video fusion feature obtain the sound view
The frame collection matching result of frequency image.
8. device according to claim 7, which is characterized in that
The characteristic extracting module is also used to be filtered the audio frame of the audio-frequency unit of the audiovisual presentation, and passes through
Fourier transformation is transformed into the energy of frequency domain;Obtained frequency domain energy is divided into several be according to logarithmic relationship to make a reservation for
The subband of frequency range;The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference of audio frame
Feature;The sampling that audio frame is carried out according to predetermined space, obtains the audio sub-band energy of the audio-frequency unit of the audiovisual presentation
Poor feature.
9. device according to claim 7, which is characterized in that
The characteristic extracting module is also used to the video frame to the audiovisual presentation, converts gray level image simultaneously for its image
Carry out compression processing;Several sub-blocks are divided into the gray level image after compression processing;Calculate the DCT energy value of each sub-block;Compare
DCT energy value between two neighboring sub-block obtains the Image DCT feature of the video frame;According to above-mentioned treatment process, obtain
The Image DCT feature of the video frame of the audiovisual presentation.
10. according to device described in claim 7,8 or 9, which is characterized in that
The Fusion Module is also used to set the audio frequency characteristics as the feature of M per second 32 bits, the characteristics of image of video frame
For the feature of n per second 32 bits, wherein n is the frame per second of video, and n is less than or equal to 60;One video frame is corresponded to several
The mode of frame audio frame carries out merging features, obtains the audio-video fusion feature of M 64 bits of generation per second, wherein each
The audio frequency characteristics of all corresponding individual audio frame of a audio-video fusion feature, M/n adjacent audio-video fusion feature pair
Answer the characteristics of image of an identical video frame.
11. device according to claim 7, which is characterized in that
The copy determination module is also used to carry out time expansion to the audio/video frames of the corresponding reference video of the similar features
Exhibition, obtains the reference video segment of the reference video, to the audio/video frames in the corresponding audiovisual presentation of the similar features
Time extension is carried out, the similar fragments constituted in the audiovisual presentation compared to the reference video are obtained;Calculate the sound view
Similarity between similar fragments described in frequency image and the reference video segment;If the similarity is greater than given threshold,
Then judge that the audiovisual presentation constitutes copy, and records initial position and the stop bit of the similar fragments of the audiovisual presentation
It sets.
12. device according to claim 7, which is characterized in that further include:
Creation module, for creating the matching list in the feature database of the reference video.
13. a kind of storage medium, which is characterized in that be stored with computer instruction in the storage medium;The computer instruction
It is performed realization such as audio-video copy detection method described in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510041044.3A CN105989000B (en) | 2015-01-27 | 2015-01-27 | Audio-video copy detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510041044.3A CN105989000B (en) | 2015-01-27 | 2015-01-27 | Audio-video copy detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105989000A CN105989000A (en) | 2016-10-05 |
CN105989000B true CN105989000B (en) | 2019-11-19 |
Family
ID=57034765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510041044.3A Active CN105989000B (en) | 2015-01-27 | 2015-01-27 | Audio-video copy detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105989000B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019895B (en) * | 2017-07-27 | 2021-05-14 | 杭州海康威视数字技术股份有限公司 | Image retrieval method and device and electronic equipment |
CN110110502B (en) * | 2019-04-28 | 2023-07-14 | 得一微电子股份有限公司 | Anti-copy method and device for audio files and mobile storage device |
CN110222719B (en) * | 2019-05-10 | 2021-09-24 | 中国科学院计算技术研究所 | Figure identification method and system based on multi-frame audio and video fusion network |
CN111274449B (en) * | 2020-02-18 | 2023-08-29 | 腾讯科技(深圳)有限公司 | Video playing method, device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737135A (en) * | 2012-07-10 | 2012-10-17 | 北京大学 | Video copy detection method and system based on soft cascade model sensitive to deformation |
-
2015
- 2015-01-27 CN CN201510041044.3A patent/CN105989000B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737135A (en) * | 2012-07-10 | 2012-10-17 | 北京大学 | Video copy detection method and system based on soft cascade model sensitive to deformation |
Non-Patent Citations (2)
Title |
---|
Listen, Look, and Gotcha: Instant Video Search with Mobile Phones by Layered Audio-Video Indexing;Wu Liu 等;《Proceedings of the 21st ACM international conference on Multimedia》;20131025;正文第2-9页 * |
基于内容的重复音视频检测;吴思远;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131115(第11期);第I138-743页,正文第14-16、28-32页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105989000A (en) | 2016-10-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11564001B2 (en) | Media content identification on mobile devices | |
CN111062871B (en) | Image processing method and device, computer equipment and readable storage medium | |
US11482242B2 (en) | Audio recognition method, device and server | |
CN111046235B (en) | Method, system, equipment and medium for searching acoustic image archive based on face recognition | |
CN108989882B (en) | Method and apparatus for outputting music pieces in video | |
CN110246512A (en) | Sound separation method, device and computer readable storage medium | |
US11729458B2 (en) | Media content identification on mobile devices | |
US10757468B2 (en) | Systems and methods for performing playout of multiple media recordings based on a matching segment among the recordings | |
CN105989000B (en) | Audio-video copy detection method and device | |
CN107293307A (en) | Audio-frequency detection and device | |
CN109271533A (en) | A kind of multimedia document retrieval method | |
CN105430494A (en) | Method and device for identifying audio from video in video playback equipment | |
CN109949798A (en) | Commercial detection method and device based on audio | |
CN111818385B (en) | Video processing method, video processing device and terminal equipment | |
CN109117622A (en) | A kind of identity identifying method based on audio-frequency fingerprint | |
US11537636B2 (en) | System and method for using multimedia content as search queries | |
US20130191368A1 (en) | System and method for using multimedia content as search queries | |
Kawale et al. | Analysis and simulation of sound classification system using machine learning techniques | |
CN110019907A (en) | A kind of image search method and device | |
CN113099283B (en) | Method for synchronizing monitoring picture and sound and related equipment | |
CN114677627A (en) | Target clue finding method, device, equipment and medium | |
CN114722234A (en) | Music recommendation method, device and storage medium based on artificial intelligence | |
KR20080107143A (en) | System and method for recommendation of music and moving video based on audio signal processing | |
WO2023160515A1 (en) | Video processing method and apparatus, device and medium | |
CN103955708A (en) | Face photo library fast-reduction method for face synthesis portrait recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20211012 Address after: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd. Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |
|
TR01 | Transfer of patent right |