CN105989000A

CN105989000A - Audio/video (AV) copy detection method and device

Info

Publication number: CN105989000A
Application number: CN201510041044.3A
Authority: CN
Inventors: 钱梦仁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2015-01-27
Filing date: 2015-01-27
Publication date: 2016-10-05
Anticipated expiration: 2035-01-27
Also published as: CN105989000B

Abstract

The invention relates to an audio/video (AV) copy detection method and device. The AV copy detection method includes: acquiring an AV image, performing decoding and pretreatment on the AV image, and performing feature extraction on the acquired audio part and the acquired video frames to obtain corresponding audio features and corresponding image features of the video frames; fusing the audio features and the image features of the video frames corresponding to the AV image to obtain AV fusion features; matching the AV fusion features based on a preset feature library of a reference video to obtain a frame set matching result; and performing copy determination and positioning on the AV image based on the frame set matching result and the reference video. Through the method for fusing the audio and the video, the robustness of the video copy detection system can be enhanced; the audio features and the video features are fused, the execution efficiency of the copy detection system can be greatly improved; and the audio and the video are analyzed at the same time, and then the positioning accuracy of copy segments can be improved.

Description

Audio frequency and video copy detection method and device

Technical field

The present invention relates to field of computer technology, particularly relate to a kind of audio frequency and video copy detection method and device.

Background technology

When video image is carried out copy detection, existing scheme mainly uses and is partial to based on content regarding Frequently copy detection method.Currently mainly there is the video copy detection side of characteristics of image based on key frame of video Case and the video copy detection scheme combined based on audio and video characteristic testing result, wherein:

The video copy detection scheme of characteristics of image based on key frame of video, main process includes: video Decoding and pretreatment, video image characteristic extraction, aspect indexing and retrieval, copy judge and location, Judge whether inquiry video constitutes copy eventually, for being judged to the video of copy, it is judged that the head of copy fragment Tail, thus this Partial Fragment of labelling is copy fragment.But, this implementation is not due to by audio frequency Information includes video copy detection scheme in, and audio-frequency information for the image content of video be one important Supplement, thus, not only weaken the vigorousness of video copy detection system, and for copy fragment Position location accuracy is the highest, particularly in the case of video pictures change is little.

The video copy detection scheme combined based on audio and video characteristic testing result, compares and closes based on video The video copy detection scheme of the characteristics of image of key frame, the program contains audio frequency characteristics, such that it is able to fill Divide and utilize the feature that audio query speed is fast, accuracy is higher.But, because audio and video characteristic is substantially Differing, existing copy detection scheme carries out video copy detection respectively often by audio frequency and video, and Merge in result aspect, thus judge to inquire about whether video is copy video.But, in resultant layer Need to extract more feature in the face of copy detection carries out merging, and need most feature all to complete Whole copy detection flow process, thus time overhead is relatively big, and add corresponding algorithm complex.

Summary of the invention

The embodiment of the present invention provides a kind of audio frequency and video copy detection method and device, it is intended to improve video copy Detection efficiency and precision.

The embodiment of the present invention proposes a kind of audio frequency and video copy detection method, including:

Obtain audiovisual presentation, described audiovisual presentation is decoded and pretreatment, obtains described sound and regard Frequently the audio-frequency unit of image and frame of video；

Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audio frequency and video Audio frequency characteristics that image is corresponding and the characteristics of image of frame of video；

The audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video merge, and obtain The audio frequency and video fusion feature of described audiovisual presentation；

Feature database based on default reference video, mates described audio frequency and video fusion feature, obtains The frame collection matching result of described audiovisual presentation；

Frame collection matching result based on described audiovisual presentation and reference video, to described audiovisual presentation Carry out copy to judge and location.

The embodiment of the present invention also proposes a kind of audio frequency and video copy detection device, including:

Decoding and pretreatment module, be used for obtaining audiovisual presentation, be decoded described audiovisual presentation And pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation；

Characteristic extracting module, for carrying out feature carry audio-frequency unit and the frame of video of described audiovisual presentation Take, obtain audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video；

Fusion Module, for the audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video Merge, obtain the audio frequency and video fusion feature of described audiovisual presentation；

Matching module, for feature database based on default reference video, to described audio frequency and video fusion feature Mate, obtain the frame collection matching result of described audiovisual presentation；

Copy determination module, for frame collection matching result based on described audiovisual presentation and reference video, Described audiovisual presentation carries out copy judge and location.

A kind of audio frequency and video copy detection method of embodiment of the present invention proposition and device, by obtaining audio frequency and video Image, is decoded and pretreatment described audiovisual presentation, obtains the audio portion of described audiovisual presentation Divide and frame of video；Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain institute State audio frequency characteristics corresponding to audiovisual presentation and the characteristics of image of frame of video；Corresponding to described audiovisual presentation Audio frequency characteristics and the characteristics of image of frame of video merge, the audio frequency and video obtaining described audiovisual presentation are melted Close feature；Feature database based on default reference video, mates described audio frequency and video fusion feature, Obtain the frame collection matching result of described audiovisual presentation；Frame collection matching result based on described audiovisual presentation And reference video, described audiovisual presentation is carried out copy and judges and location, thus utilize audio frequency and video phase In conjunction with method, not only increase the vigorousness of video copy detection system, and by audio frequency and video are special Levy and merge, be greatly accelerated the execution efficiency of copy detection system, jointly analyzed by audio frequency and video, Improve copy fragment positioning precision.

Accompanying drawing explanation

Fig. 1 is the hardware architecture diagram of audio frequency and video copy detection device of the present invention；

Fig. 2 is the schematic flow sheet of audio frequency and video copy detection method first embodiment of the present invention；

Fig. 3 is embodiment of the present invention sound intermediate frequency sub belt energy difference feature extraction schematic flow sheet；

Fig. 4 is the flow process of the Image DCT feature of the frame of video extracting audiovisual presentation in the embodiment of the present invention Schematic diagram；

Fig. 5 is that in the embodiment of the present invention, characteristics of image and audio frequency characteristics merge schematic diagram；

Fig. 6 is the simhash matching algorithm exemplary plot related in the embodiment of the present invention；

Fig. 7 is the matching algorithm design diagram related in the embodiment of the present invention；

Fig. 8 is the copy location and extension schematic diagram related in the embodiment of the present invention；

Fig. 9 is the schematic flow sheet of audio frequency and video copy detection method the second embodiment of the present invention；

Figure 10 is the high-level schematic functional block diagram of audio frequency and video copy detection device first embodiment of the present invention；

Figure 11 is the high-level schematic functional block diagram of audio frequency and video copy detection device the second embodiment of the present invention.

In order to make technical scheme clearer, clear, make the most in detail below in conjunction with accompanying drawing State.

Detailed description of the invention

Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limit Determine the present invention.

The primary solutions of the embodiment of the present invention is: include the audio-frequency information of video in video copy detection Scheme, utilizes the method that audio frequency and video combine, and is possible not only to strengthen the vigorousness of video copy detection system, And by audio and video characteristic is merged, greatly speed up the execution efficiency of copy detection system, pass through Audio frequency and video are analyzed jointly, improve copy fragment positioning precision.

Specifically, the embodiment of the present invention it is considered that existing video copy detection scheme, or only with The video copy detection scheme of characteristics of image based on key frame of video, not only weakens video copy detection The vigorousness of system, and the Position location accuracy for copy fragment is the highest；Use based on audio frequency and video The video copy detection scheme that feature detection result combines, but, enter in the face of copy detection in resultant layer Row merges needs to extract more feature, and needs most feature all to complete whole copy detection stream Journey, thus add time overhead, and corresponding algorithm complex and data integration linear correlation, thus Add algorithm complex.

The present embodiment scheme includes the audio-frequency information of video in video copy detection scheme, utilizes audio frequency and video phase In conjunction with method, extracted by audio/video decoding and pretreatment, audio and video characteristic, audio and video characteristic merges, The processing procedures such as copy judgement and location, are possible not only to strengthen the vigorousness of video copy detection system, and And by audio and video characteristic is merged, greatly speed up the execution efficiency of copy detection system, pass through sound Video is analyzed jointly, improves copy fragment positioning precision.

Specifically, the audio frequency and video copy detection device that embodiment of the present invention audio frequency and video copy detection scheme relates to Hardware configuration can be as it is shown in figure 1, this detection device can be carried on PC end, it is also possible to be carried on hands The mobile terminals such as machine, panel computer, portable handheld device or other there is audio frequency and video copy detection merit In the electronic equipment of energy, such as apparatus for media playing.

As it is shown in figure 1, this detection device may include that processor 1001, such as CPU, network interface 1004, User interface 1003, memorizer 1005, communication bus 1002, photographic head 1006.Wherein, communication bus 1002 for realizing the connection communication detecting between these assemblies of device.User interface 1003 can include Display screen (Display), input block such as keyboard (Keyboard), optional user interface 1003 also may be used To include the wireline interface of standard, wave point.Network interface 1004 optionally can include having of standard Line interface, wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, also Can be stable memorizer (non-volatile memory), such as disk memory.Memorizer 1005 Optionally can also is that the storage device independent of aforementioned processor 1001.

Alternatively, this detection device is when being carried on mobile terminal, it is also possible to include RF (Radio Frequency, radio frequency) circuit, sensor, voicefrequency circuit, WiFi module etc..Wherein, sensor Such as optical sensor, motion sensor and other sensors.Specifically, optical sensor can include environment Optical sensor and proximity transducer, wherein, ambient light sensor can regulate according to the light and shade of ambient light The brightness of display screen, proximity transducer can cut out display screen and/or the back of the body when mobile terminal moves in one's ear Light.As the one of motion sensor, Gravity accelerometer can detect in all directions (generally Three axles) size of acceleration, can detect that size and the direction of gravity time static, can be used for identifying mobile The application (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating) of terminal attitude, Vibration identification Correlation function (such as pedometer, percussion) etc.；Certainly, this detection device can also configure gyroscope, gas Other sensors such as pressure meter, drimeter, thermometer, infrared ray sensor, do not repeat them here.

It will be understood by those skilled in the art that the apparatus structure shown in Fig. 1 is not intended that this detection device Restriction, can include that ratio illustrates more or less of parts, or combine some parts, or different Parts arrange.

As it is shown in figure 1, as the memorizer 1005 of a kind of computer-readable storage medium can include operation system System, network communication module, Subscriber Interface Module SIM and audio frequency and video copy detection application program.

In the detection device shown in Fig. 1, network interface 1004 is mainly used in connecting back-stage management platform, Data communication is carried out with back-stage management platform；User interface 1003 is mainly used in connecting client, with client End carries out data communication；And processor 1001 may be used for calling the audio frequency and video of storage in memorizer 1005 Copy detection application program, and perform following operation:

In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005 Application program can perform following operation:

The audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is turned by Fourier transformation Change the energy to frequency domain；

The frequency domain energy obtained is divided into some sons being in scheduled frequency range according to logarithmic relationship Band；

The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains the audio sub-band energy difference of audio frame Feature；

Carry out the sampling of audio frame according to predetermined space, obtain the sound of the audio-frequency unit of described audiovisual presentation Frequently sub belt energy difference feature.

Frame of video to described audiovisual presentation, is converted into its image gray level image and is compressed processing；

Gray level image after processing compression is divided into some sub-blocks；

Calculate the DCT energy value of each sub-block；

The relatively DCT energy value between adjacent two sub-blocks, the Image DCT obtaining described frame of video is special Levy；

According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.

Setting the described audio frequency characteristics feature as M per second 32 bits, the characteristics of image of frame of video is per second The feature of n 32 bits, wherein, n is the frame per second of video, and n is less than or equal to 60；

One frame of video is corresponded to the mode of some frame audio frames to carry out merging features, obtain product per second The audio frequency and video fusion feature of raw M 64 bits, wherein, corresponding one of each audio frequency and video fusion feature The individually audio frequency characteristics of audio frame, one that M/n adjacent audio frequency and video fusion feature correspondence is identical regards Frequently the characteristics of image of frame.

Matching list is obtained from the feature database of default reference video；

For each audio frequency and video fusion feature, inquiry and described audio frequency and video fusion feature from described matching list Between Hamming distance less than the feature of predetermined threshold value, as the similar spy of described audio frequency and video fusion feature Levy；

Obtain the similar features of audio frequency and video fusion feature, obtain the frame collection matching result of described audiovisual presentation.

In one embodiment, the audio frequency and video copy inspection of storage during processor 1001 calls memorizer 1005 Survey application program and can perform following operation:

The audio/video frames of the reference video corresponding to described similar features carries out time extension, obtains described sound Audio/video frames corresponding in video image compares the similar fragments that described reference video is constituted；

Based on described similar fragments, calculate audio/video frames corresponding in described audiovisual presentation and reference video Similarity；

If described similarity is more than setting threshold value, then judge that described audiovisual presentation constitutes copy, and record The original position of the similar fragments of described audiovisual presentation and final position.

Described matching list is created in the feature database of described reference video.

The present embodiment passes through such scheme, especially by obtaining audiovisual presentation, to described audiovisual presentation It is decoded and pretreatment, obtains audio-frequency unit and the frame of video of described audiovisual presentation；Described sound is regarded Frequently audio-frequency unit and the frame of video of image carries out feature extraction, obtains the audio frequency that described audiovisual presentation is corresponding Feature and the characteristics of image of frame of video；The audio frequency characteristics corresponding to described audiovisual presentation and the figure of frame of video As feature merges, obtain the audio frequency and video fusion feature of described audiovisual presentation；Based on default reference The feature database of video, mates described audio frequency and video fusion feature, obtains the frame of described audiovisual presentation Collection matching result；Frame collection matching result based on described audiovisual presentation and reference video, to described sound Video image carries out copy and judges and location, thus the method utilizing audio frequency and video to combine, not only increase The vigorousness of video copy detection system, and by audio and video characteristic is merged, be greatly accelerated The execution efficiency of copy detection system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.

Based on above-mentioned hardware configuration, audio frequency and video copy detection method embodiment of the present invention is proposed.

As in figure 2 it is shown, first embodiment of the invention proposes a kind of audio frequency and video copy detection method, including:

Step S101, obtains audiovisual presentation, is decoded described audiovisual presentation and pretreatment, Audio-frequency unit and frame of video to described audiovisual presentation；

Specifically, first, obtaining and need the audiovisual presentation carrying out copy detection, this audiovisual presentation can To obtain from this locality, it is also possible to obtained from outside by network.

The audiovisual presentation obtained is decoded and pretreatment, extracts the audio frequency of video, and be downsampled to Monophonic 5512.5Hz；Extract each frame of video frame by frame, thus obtain the audio-frequency unit of audiovisual presentation Frame of video with each frame.

Step S102, audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain Audio frequency characteristics that described audiovisual presentation is corresponding and the characteristics of image of frame of video；

This part carries out feature extraction primarily with respect to audio frequency corresponding to video and all videos frame.Cause Represent for the easy binary bits of audio frequency characteristics itself, thus often use binary index or LSH accelerates inquiry.The audio frequency characteristics that the present invention is extracted is audio sub-band energy difference feature, extraction The characteristics of image of frame of video is DCT (Discrete Cosine Transform, discrete cosine transform) feature.

Wherein, the audio-frequency unit of described audiovisual presentation is carried out feature extraction, obtain described audio frequency and video figure As the process of corresponding audio frequency characteristics includes:

Each audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is become by Fourier Change the energy being transformed into frequency domain；The frequency domain energy obtained is divided into some being according to logarithmic relationship The subband of scheduled frequency range；The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains each sound Frequently the audio sub-band energy difference feature of frame；Carry out the sampling of audio frame according to predetermined space, obtain described sound The audio sub-band energy difference feature of the audio-frequency unit of video image.

More specifically, the extraction flow process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:

Having main steps that of the algorithm that the extraction of this audio sub-band energy difference feature relates to:

First, by the time-domain audio shape information (audio frame) of every 0.37 second through Hanning window (Hanning Window) energy of frequency domain it is transformed into after filtering by Fourier transformation；

Secondly, the frequency domain energy obtained is divided into 33 according to logarithmic relationship (Bark grade) and is positioned at The subband of human auditory system scope (300Hz～2000Hz), and calculate consecutive frame (being spaced 11 milliseconds) phase The difference of the absolute value of the energy between adjacent subband, thus each audio frame can be obtained one 32 ratio Special audio frequency characteristics.

" 1 " therein represents the energy difference of adjacent two subbands of current audio frame more than next audio frame The energy difference of corresponding adjacent sub-bands, is otherwise 0.

Detailed process is as follows:

In figure 3, input content is a section audio；Output content is several (n that this section audio is corresponding Individual) audio frequency characteristics.

Wherein, Framing: framing, it may be assumed that by this audio fragment cutting (n) audio frames that are several. In this example, according to M=2048 audio frame of collection per second, (in other examples, M can also set for other Value), the audio content that each audio frame comprises 0.37 second (has the weight of 2047/2048 between adjacent audio frame Folded).

Fourier Transform: Fourier transformation, for turning the shape information (original audio) of time domain It is changed to the energy information of the different frequency range ripple of frequency domain, it is simple to be analyzed processing.

ABS: take the absolute value (i.e.: only considering amplitude, do not consider direction of vibration) of wave energy information.

Band Division: point band, is divided into 33 mutually between 300Hz-2000Hz by whole frequency domain Nonoverlapping frequency band (divides according to logarithmic relationship, it may be assumed that frequency is the lowest, frequency belonging to this frequency Band scope is the least).As such, it is possible to obtain original audio energy on these different frequency bands.

Energy Computation: calculate each audio frame energy value on these 33 frequency bands (every Individual audio frame obtains 33 energy values).

Bit Derivation: derive bit: 33 above-mentioned energy values are compared (i-th successively The energy of subband and the energy of i+1 subband compare) obtain the difference of 32 energy values.Relatively The size of these 32 energy value differences between current audio frame a and next audio frame b.Assume the of a J energy value difference jth energy value difference than b is big, then the jth position of a is characterized as 1, otherwise, and a Jth position be characterized as 0.So can obtain the magnitude relationship of 32 energy value differences between a and b, It is the feature of 32 bits of audio frame a.

Present invention employs this audio frequency characteristics, and carry out adopting of audio frame according to the interval of 1/2048 second Sample, all can generate the audio frequency characteristics of 2048 32 bits hence for the audio fragment of each second.

The frame of video of described audiovisual presentation is carried out feature extraction, obtains the video that audiovisual presentation is corresponding The process of the characteristics of image of frame may include that

Each frame of video to described audiovisual presentation, is converted into gray level image by its image and is compressed Process；Gray level image after processing compression is divided into some sub-blocks；Calculate the DCT energy value of each sub-block； Relatively the DCT energy value between adjacent two sub-blocks, obtains the Image DCT feature of described frame of video； According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.

More specifically, the flow process of the Image DCT feature of the frame of video of the present embodiment extraction audiovisual presentation As shown in Figure 4:

For the feature that internet video picture entire change amplitude is little, the embodiment of the present invention has selected one Kind of efficient image overall feature is used as the characteristics of image of frame of video: DCT feature.

The algorithm idea of DCT feature is: divide the image into into several sub-blocks, by the most adjacent son Energy height between block, thus obtain the Energy distribution situation of entire image.Concrete algorithm steps is:

First, coloured image is converted into gray level image and compress (change the ratio of width to height) to wide 64 pixels, High 32 pixels.

Then, gray level image being divided into 32 sub-blocks (0 as shown in Figure 4～31), each piece comprises 8x8 The image of pixel.

For each sub-block, calculate the DCT energy value of this sub-block.The energy value that selection can carry Absolute value represents the energy of this sub-block.

Finally, calculate adjacent sub-blocks energy value relative size and obtain the feature of 32 bits.If the The energy of i sub-block is more than the energy of i+1 sub-block, then ith bit position is 1, is otherwise 0.Especially: 31st sub-block and the 0th sub-block compare.

By said process, each frame of video will obtain the Image DCT feature of 32 bits.

Step S103, the audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video are carried out Merge, obtain the audio frequency and video fusion feature of described audiovisual presentation；

After the characteristics of image that said process has obtained audio frequency characteristics corresponding to video and frame of video, incite somebody to action The characteristics of image and the audio frequency characteristics that obtain merge.Concrete fusion method as shown in Figure 5 (wherein: The longitudinal axis is time shaft).

As it is shown in figure 5, in the present embodiment, setting audio is characterized as that (this value can set M=2048 per second Fixed) feature of individual 32 bits, and the feature that the characteristics of image of frame of video is n per second 32 bits (n is The frame per second of video, n is usually no more than 60).

Thus, the present embodiment carries out feature by the way of a frame of video is corresponded to some audio frames Splicing, it may be assumed that the audio frequency and video fusion feature of 2048 64 bits of generation per second, wherein, each merges The feature of all corresponding single audio frame of feature, and 2048/n adjacent audio frequency and video fusion feature pair The Image DCT feature of a frame of video that should be identical.

Merged by the characteristics of image of the above-mentioned audio frequency characteristics corresponding to audiovisual presentation and frame of video, Obtain the audio frequency and video fusion feature of audiovisual presentation.

Step S104, feature database based on default reference video, described audio frequency and video fusion feature is carried out Coupling, obtains the frame collection matching result of described audiovisual presentation；

The present embodiment is preset with the feature database of reference video, and creating in the feature database of reference video has Matching list, to facilitate video individual features to be detected quickly to retrieve.

When audio frequency and video fusion feature is mated, first, from the feature database of default reference video Obtain matching list；For each audio frequency and video fusion feature, from described matching list, inquiry meets pre-conditioned Feature, as the similar features of audio frequency and video fusion feature.Such as inquire about from described matching list and regard with sound Frequently the Hamming distance between fusion feature is less than the feature of predetermined threshold value (such as 3), regards as described sound Frequently the similar features of fusion feature；Obtain the similar features of all audio frequency and video fusion features, obtain described sound The frame collection matching result of video image.

More specifically, the present embodiment is considered:

For inquiry video (needing to carry out the video of copy detection) and a reference video, if By comparing the similarity of both features frame by frame, required time complexity is just all becoming with the two video Ratio, thus it is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on existing simhash Technology, it is proposed that a kind of index based on audio frequency and video fusion feature and the matching strategy of inquiry.

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc. All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The principle of this algorithm Schematic diagram is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then by 64 bits It is divided into 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, In remaining 48 bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.Logical After crossing twice index search coupling, can in remaining 36 bits, enumerate most 3 discrepant Position, such that it is able to be substantially reduced the complexity of original algorithm.

The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally, That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two feature It is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32 Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodiment Copy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7 Shown in:

In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, then Front 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2 Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bits On all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32 Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, use Carry out quick search audio frequency and video fusion feature.

Then, by inquiring about the matching list of above-mentioned structure, obtain the similar features of audio frequency and video fusion feature, Obtain the result of characteristic key.

Step S105, frame collection matching result based on described audiovisual presentation and reference video, to described Audiovisual presentation carries out copy and judges and location.

According to the result of the characteristic key obtained in said process, and combine video copy fragment localization method, Thus judge whether to inquire about video as copying video.If it is determined that inquiry video is copy video, be then given Corresponding copy fragment location.

The present embodiment is considered: for two videos, if between calculating the two video between a frame Similarity, then can obtain the similarity matrix shown in rightmost in Fig. 8.Thus, find two video phases Also the line finding similarity to be higher than certain threshold value in similarity matrix it has been converted to like the target of fragment Section, but this processing mode time overhead strengthens.

The principle that audiovisual presentation carries out in the present embodiment copy judgement and location is: by above-mentioned coupling Algorithm, can find some points (representing these similarities the highest) the brightest in similarity matrix, such as figure Bright spot shown in Far Left in 8, and by these point carry out time extension, such that it is able to obtain in Fig. 8 Similar fragments (the most possible copy fragment) shown between, is screened by threshold value afterwards, such that it is able to Judge whether certain two video constitutes copy, and if constitute copy, then can record this similar fragments Original position and final position distribution moment.

Specifically, when audiovisual presentation being carried out copy and judging and position, first said process is obtained The audio/video frames (bright spot shown in corresponding diagram 8 Far Left figure) of reference video corresponding to similar features enter The row time extends, and obtains the reference video fragment of described reference video, the sound corresponding to described similar features Audio/video frames in video image carries out time extension, obtains comparing in described audiovisual presentation described reference The similar fragments (as shown in Fig. 8 middle graph) that video is constituted；Calculate described in described audiovisual presentation similar Similarity between fragment and described reference video fragment, i.e. calculates similar fragments in audiovisual presentation corresponding The similarity of the audio/video frames audio/video frames corresponding with reference video fragment, to each audio/video frames obtained Similarity average；If described similarity is more than setting threshold value, then judge described audiovisual presentation structure Become copy, and record original position and the final position of the similar fragments of described audiovisual presentation.

It is to say, the audio/video frames that similar fragments is corresponding in calculating audiovisual presentation and reference video During similarity, to each frame (including the feature of 64 bits) in this similar fragments and reference video segment Corresponding frame carries out Characteristic Contrast, calculates similarity, averages afterwards, by this meansigma methods and predetermined threshold value Relatively, if similarity is more than setting threshold value, then judges that described audiovisual presentation constitutes copy, and record institute State original position and the final position of the similar fragments of audiovisual presentation.

It is exemplified below:

100 frames (i.e. one audio-video sequence) if in similar fragments, between the 10-20 second of inquiry video 100 frames between the 30-40 second of corresponding reference video, then by 100 between the 10-20 second of inquiry video Each frame correspondence in frame and each frame in 100 frames between the 30-40 second of reference video are compared, Calculate the similarity of each frame respectively, in the such as first frame 64 bit, have feature and the reference of 50 bits Frame of video is identical, then the similarity S1=50/64 ≈ 0.78125 of this first frame；With this principle, obtain second Similarity S2 of frame ..., similarity S100 of 100 frames, each similarity is averaged, obtains phase Like in fragment, inquire about the similarity of video and reference video, it is assumed that be 0.95, it (is set with setting threshold value It is 0.9) compare, thus may determine that inquiry video constitutes copy, and record the start bit of this similar fragments Put and final position.

Judging and in position fixing process at above-mentioned copy, an inquiry video there may be multiple similar fragments Situation, can string together record by the plurality of similar fragments.

It should be noted that in the present embodiment said process, judging inquiry according to frame collection matching result When whether video is the copy of certain video in reference video storehouse, it is possible to use other algorithms realize, Such as: Hough transformation, Smith Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..Logical Cross these algorithms and find the inquiry video one section sequence most like with certain reference video, and by threshold value Determine whether to constitute copy.For being judged to the video of copy, it is judged that copy fragment end to end, thus is marked Remember that this Partial Fragment is for copy fragment.

The present embodiment passes through such scheme, utilizes the method that audio frequency and video combine, not only increases video and copy The vigorousness of shellfish detecting system, and by being merged by audio and video characteristic, it is greatly accelerated copy inspection The execution efficiency of examining system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.

As it is shown in figure 9, second embodiment of the invention proposes a kind of audio frequency and video copy detection method, based on upper State embodiment, before obtaining the step of audiovisual presentation, also include:

Step S100, creates described matching list in the feature database of described reference video.

Specifically, create matching list, be that video individual features the most to be detected can quickly be examined Rope.

Matching list creates based on reference video, and concrete establishment process is as follows:

First, collect reference video fragment, reference video fragment carried out audio/video decoding and pretreatment, Obtain audio-frequency unit and the frame of video of reference video.

Then, audio-frequency unit and frame of video to reference video carry out feature extraction, obtain reference video Audio frequency characteristics and the characteristics of image of frame of video.

Afterwards, reference video being carried out audio and video characteristic fusion, the audio frequency and video obtaining reference video merge spy Levy.

Finally, audio frequency and video fusion feature based on this reference video creates matching list, for follow-up inquiry Video carries out aspect indexing retrieval coupling.

Wherein, when audio frequency and video fusion feature based on this reference video creates matching list, based on following former Reason:

Consider: inquiry video (needing to carry out the video of copy detection) and a reference are regarded Frequently, if by the similarity comparing both features frame by frame, required time complexity regards with the two Frequency is all directly proportional, thus is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on Some simhash technology, it is proposed that a kind of index based on audio frequency and video fusion feature and the coupling plan of inquiry Slightly.

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc. All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The signal of this algorithm Figure is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then 64 bits are divided Become 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, surplus In 48 remaining bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.By two After coupling searched in secondary index, most 3 discrepant positions can be enumerated in remaining 36 bits, Such that it is able to be substantially reduced the complexity of original algorithm.

Accordingly, the functional module embodiment of embodiment of the present invention audio frequency and video copy detection device is proposed.

As shown in Figure 10, first embodiment of the invention proposes a kind of audio frequency and video copy detection device, including: Decoding and pretreatment module 201, characteristic extracting module 202, Fusion Module 203, matching module 204 and Copy determination module 205, wherein:

Decoding and pretreatment module 201, be used for obtaining audiovisual presentation, solve described audiovisual presentation Code and pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation；

Characteristic extracting module 202, for carrying out feature to audio-frequency unit and the frame of video of described audiovisual presentation Extract, obtain audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video；

Fusion Module 203, special for the image of the audio frequency characteristics corresponding to described audiovisual presentation and frame of video Levy and merge, obtain the audio frequency and video fusion feature of described audiovisual presentation；

Described audio frequency and video, for feature database based on default reference video, are merged spy by matching module 204 Levy and mate, obtain the frame collection matching result of described audiovisual presentation；

Copy determination module 205, for frame collection matching result based on described audiovisual presentation and with reference to regarding Frequently, described audiovisual presentation carries out copy judge and location.

Afterwards, audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described Audio frequency characteristics that audiovisual presentation is corresponding and the characteristics of image of frame of video.

Detailed process is as follows:

Wherein, Framing: framing, it may be assumed that by this audio fragment cutting (n) audio frames that are several. According to 2048 audio frames of collection per second in example, each audio frame comprises the audio content (phase of 0.37 second The overlap of 2047/2048 is had) between adjacent audio frame.

Afterwards, feature database based on default reference video, described audio frequency and video fusion feature is mated, Obtain the frame collection matching result of described audiovisual presentation.

When audio frequency and video fusion feature is mated, first, from the feature database of default reference video Obtain matching list；For each audio frequency and video fusion feature, from described matching list, inquiry meets pre-conditioned Feature, as the similar features of audio frequency and video fusion feature.Such as inquire about from described matching list and regard with sound Frequently the Hamming distance between fusion feature is less than the feature of predetermined threshold value (such as 3), regards as described sound Frequently the similar features of fusion feature；Obtain the similar features of audio frequency and video fusion feature, obtain described audio frequency and video The frame collection matching result of image.

More specifically, the present embodiment is considered:

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc. All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The principle of this algorithm Schematic diagram is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then by 64 bits It is divided into 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, In remaining 48 bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.Logical After crossing twice index search, most 3 discrepant positions can be enumerated in remaining 36 bits, Such that it is able to be substantially reduced the complexity of original algorithm.

The present embodiment is considered: for two videos, if calculate between the two video each frame it Between similarity, then can obtain the similarity matrix shown in rightmost in Fig. 8.Thus, find two to regard Frequently the target of similar fragments has also been converted to find similarity to be higher than certain threshold value in similarity matrix Line segment, but this processing mode time overhead strengthens.

The principle that audiovisual presentation carries out in the present embodiment copy judgement and location is: by above-mentioned index Algorithm, can find some points (representing these similarities the highest) the brightest in similarity matrix, such as figure Bright spot shown in Far Left in 8, and by these point carry out time extension, such that it is able to obtain in Fig. 8 Similar fragments (the most possible copy fragment) shown between, is screened by threshold value afterwards, such that it is able to Judge whether certain two video constitutes copy, and if constitute copy, then can record this similar fragments Original position and final position distribution moment.

It is exemplified below:

As shown in figure 11, second embodiment of the invention proposes a kind of audio frequency and video copy detection device, based on upper State embodiment, also include:

Creation module 200, for creating described matching list in the feature database of described reference video.

Finally, audio frequency and video fusion feature based on this reference video creates matching list, for follow-up inquiry Video carries out aspect indexing retrieval.

Consider: inquiry video (needing to carry out the video of copy detection) and a reference are regarded Frequently, if by the similarity comparing both features frame by frame, required time complexity regards with the two Frequency is all directly proportional, thus is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on Some simhash technology, it is proposed that a kind of index based on audio frequency and video fusion feature and query strategy.

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking into The 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc. All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The signal of this algorithm Figure is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then 64 bits are divided Become 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, surplus In 48 remaining bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.By two After secondary index is searched, most 3 discrepant positions can be enumerated in remaining 36 bits, thus The complexity of original algorithm can be substantially reduced.

Embodiment of the present invention audio frequency and video copy detection method and device, by obtaining audiovisual presentation, to institute State audiovisual presentation to be decoded and pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation； Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audiovisual presentation Corresponding audio frequency characteristics and the characteristics of image of frame of video；To audio frequency characteristics corresponding to described audiovisual presentation and The characteristics of image of frame of video merges, and obtains the audio frequency and video fusion feature of described audiovisual presentation；Based on The feature database of the reference video preset, mates described audio frequency and video fusion feature, obtains described sound and regard Frequently the frame collection matching result of image；Frame collection matching result based on described audiovisual presentation and reference video, Described audiovisual presentation carries out copy judge and location, thus utilize the method that audio frequency and video combine, no Only enhance the vigorousness of video copy detection system, and by audio and video characteristic is merged, greatly Accelerate greatly the execution efficiency of copy detection system, jointly analyzed by audio frequency and video, improve copy fragment Positioning precision.

Also, it should be noted in this article, term " include ", " comprising " or its any other become Body is intended to comprising of nonexcludability, so that include the process of a series of key element, method, article Or device not only includes those key elements, but also includes other key elements being not expressly set out, or Also include the key element intrinsic for this process, method, article or device.There is no more restriction In the case of, statement " including ... " key element limited, it is not excluded that including the mistake of this key element Journey, method, article or device there is also other identical element.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

Through the above description of the embodiments, those skilled in the art is it can be understood that arrive above-mentioned Embodiment method can add the mode of required general hardware platform by software and realize, naturally it is also possible to logical Cross hardware, but a lot of in the case of the former is more preferably embodiment.Based on such understanding, the present invention's The part that prior art is contributed by technical scheme the most in other words can be with the form body of software product Revealing to come, this computer software product is stored in a storage medium (such as ROM/RAM, magnetic disc, light Dish) in, including some instructions with so that a station terminal equipment (can be mobile phone, computer, service Device, or the network equipment etc.) perform the method described in each embodiment of the present invention.

The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention, Every equivalent structure utilizing description of the invention and accompanying drawing content to be made or flow process conversion, or directly or Connect and be used in other relevant technical field, be the most in like manner included in the scope of patent protection of the present invention.

Claims

1. an audio frequency and video copy detection method, it is characterised in that including:

Method the most according to claim 1, it is characterised in that described to described audiovisual presentation Audio-frequency unit carries out feature extraction, and the step obtaining audio frequency characteristics corresponding to described audiovisual presentation includes:

Method the most according to claim 1, it is characterised in that described to described audiovisual presentation Frame of video carries out feature extraction, obtains the step of the characteristics of image of frame of video corresponding to described audiovisual presentation Including:

Each frame of video to described audiovisual presentation, is converted into gray level image by its image and is compressed Process；

Calculate the DCT energy value of each sub-block；

4. according to the method described in claim 1,2 or 3, it is characterised in that described described sound is regarded Frequently image is corresponding audio frequency characteristics and the characteristics of image of frame of video merge, and obtain described audiovisual presentation The step of audio frequency and video fusion feature include:

Method the most according to claim 1, it is characterised in that described based on default reference video Feature database, described audio frequency and video fusion feature is mated, obtains the frame collection of described audiovisual presentation The step joining result includes:

Method the most according to claim 5, it is characterised in that described based on described audiovisual presentation Frame collection matching result and reference video, described audiovisual presentation is carried out copy judge and location step Suddenly include:

The audio/video frames of the reference video corresponding to described similar features carries out time extension, obtains described ginseng Examining the reference video fragment of video, the audio/video frames in audiovisual presentation corresponding to described similar features enters The row time extends, and obtains comparing the similar fragments that described reference video is constituted in described audiovisual presentation；

Calculate similar fragments described in described audiovisual presentation similar between described reference video fragment Degree；

Method the most according to claim 5, it is characterised in that the step of described acquisition audiovisual presentation Before Zhou, also include:

8. an audio frequency and video copy detection device, it is characterised in that including:

Device the most according to claim 8, it is characterised in that

Described characteristic extracting module, the audio frame being additionally operable to the audio-frequency unit to described audiovisual presentation is carried out Filtering, and the energy of frequency domain it is transformed into by Fourier transformation；By the frequency domain energy that obtains according to right Number relation is divided into some subbands being in scheduled frequency range；Calculate the exhausted of energy between adjacent sub-bands Difference to value, obtains the audio sub-band energy difference feature of audio frame；Audio frame is carried out according to predetermined space Sampling, obtains the audio sub-band energy difference feature of the audio-frequency unit of described audiovisual presentation.

Device the most according to claim 8, it is characterised in that

Described characteristic extracting module, is additionally operable to the frame of video to described audiovisual presentation, is converted by its image For gray level image and be compressed process；Gray level image after processing compression is divided into some sub-blocks；Calculate The DCT energy value of each sub-block；The relatively DCT energy value between adjacent two sub-blocks, obtains described The Image DCT feature of frame of video；According to above-mentioned processing procedure, obtain the frame of video of described audiovisual presentation Image DCT feature.

Device described in 11. according to Claim 8,9 or 10, it is characterised in that

Described Fusion Module, is additionally operable to the feature setting described audio frequency characteristics as M per second 32 bits, depending on Frequently the characteristics of image of frame is the feature of n per second 32 bits, and wherein, n is the frame per second of video, and n is less than Or equal to 60；One frame of video is corresponded to the mode of some frame audio frames to carry out merging features, obtain The audio frequency and video fusion feature of generation M per second 64 bits, wherein, each audio frequency and video fusion feature is right Answering the audio frequency characteristics of a single audio frame, M/n adjacent audio frequency and video fusion feature is corresponding identical The characteristics of image of one frame of video.

12. devices according to claim 8, it is characterised in that

Described matching module, is additionally operable to from the feature database of default reference video obtain matching list；For Each audio frequency and video fusion feature, inquires about the Chinese between described audio frequency and video fusion feature from described matching list Prescribed distance is less than the feature of predetermined threshold value, as the similar features of described audio frequency and video fusion feature；Obtain The similar features of audio frequency and video fusion feature, obtains the frame collection matching result of described audiovisual presentation.

13. devices according to claim 12, it is characterised in that

Described copy determination module, is additionally operable to the audio/video frames of the reference video corresponding to described similar features The time of carrying out extension, obtains the reference video fragment of described reference video, corresponding to described similar features Audio/video frames in audiovisual presentation carries out time extension, obtains comparing in described audiovisual presentation described ginseng Examine the similar fragments that video is constituted；Calculate similar fragments described in described audiovisual presentation to regard with described reference Frequently the similarity between fragment；If described similarity is more than setting threshold value, then judge described audiovisual presentation Constitute copy, and record original position and the final position of the similar fragments of described audiovisual presentation.

14. devices according to claim 12, it is characterised in that also include:

Creation module, for creating described matching list in the feature database of described reference video.