CN105989000B

CN105989000B - Audio-video copy detection method and device

Info

Publication number: CN105989000B
Application number: CN201510041044.3A
Authority: CN
Inventors: 钱梦仁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2015-01-27
Filing date: 2015-01-27
Publication date: 2019-11-19
Anticipated expiration: 2035-01-27
Also published as: CN105989000A

Abstract

The present invention relates to a kind of audio-video copy detection method and devices, its method includes: acquisition audiovisual presentation, audiovisual presentation is decoded and is pre-processed, audio-frequency unit and video frame to obtained audiovisual presentation carry out feature extraction, obtain the characteristics of image of corresponding audio frequency characteristics and video frame；The characteristics of image of the corresponding audio frequency characteristics of audiovisual presentation and video frame is merged, audio-video fusion feature is obtained；Based on the feature database of preset reference video, audio-video fusion feature is matched, obtains frame collection matching result；Based on frame collection matching result and reference video, copy judgement and positioning are carried out to audiovisual presentation.The method that the present invention is combined using audio-video, the robustness of video copy detection system is not only increased, and by merging audio and video characteristic, greatly accelerates the execution efficiency of copy detection system, it is analyzed jointly by audio-video, improves copy segment positioning accuracy.

Description

Audio-video copy detection method and device

Technical field

The present invention relates to field of computer technology more particularly to a kind of audio-video copy detection methods and device.

Background technique

When carrying out copy detection to video image, existing scheme is mainly using the video copy inspection being biased to based on content Survey method.Mainly there is the video copy detection scheme of the characteristics of image based on key frame of video at present and is examined based on audio and video characteristic Survey the video copy detection scheme that result combines, in which:

The video copy detection scheme of characteristics of image based on key frame of video, main process include: video decoding and pre- Processing, video image characteristic are extracted, aspect indexing and retrieval, copy determine and position, final to determine whether inquiry video is constituted Copy judges to copy segment end to end for the video for being judged to copying, to mark this Partial Fragment for copy segment.But It is that this implementation is not due to being included in video copy detection scheme for audio-frequency information, and audio-frequency information is for the picture of video Face content is an important supplement, not only weakens the robustness of video copy detection system as a result, but also for print The positioning accuracy of section is not high, especially in the case where video pictures variation less.

Based on the video copy detection scheme that audio and video characteristic testing result combines, compared to the figure based on key frame of video As the video copy detection scheme of feature, the program contains audio frequency characteristics, so as to make full use of audio query speed it is fast, The higher feature of accuracy.However, because audio and video characteristic it is substantially not identical, existing copy detection scheme often by Audio-video carries out video copy detection respectively, and is merged in result level, therefore, it is determined that whether inquiry video is copy view Frequently.However, carrying out fusion in face of copy detection in resultant layer needs to extract more feature, and need most feature all Entire copy detection process is completed, thus time overhead is larger, and increases corresponding algorithm complexity.

Summary of the invention

The embodiment of the present invention provides a kind of audio-video copy detection method and device, it is intended to improve video copy detection efficiency And precision.

The embodiment of the present invention proposes a kind of audio-video copy detection method, comprising:

Audiovisual presentation is obtained, the audiovisual presentation is decoded and is pre-processed, the audiovisual presentation is obtained Audio-frequency unit and video frame；

Audio-frequency unit and video frame to the audiovisual presentation carry out feature extraction, and it is corresponding to obtain the audiovisual presentation Audio frequency characteristics and video frame characteristics of image；

The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, the sound view is obtained The audio-video fusion feature of frequency image；

Based on the feature database of preset reference video, the audio-video fusion feature is matched, obtains the sound view The frame collection matching result of frequency image；

Frame collection matching result and reference video based on the audiovisual presentation, copy the audiovisual presentation Determine and positions.

The embodiment of the present invention also proposes a kind of audio-video copy detection device, comprising:

Decoding and preprocessing module are decoded and pre-process to the audiovisual presentation for obtaining audiovisual presentation, Obtain the audio-frequency unit and video frame of the audiovisual presentation；

Characteristic extracting module, for the audiovisual presentation audio-frequency unit and video frame carry out feature extraction, obtain The characteristics of image of the audiovisual presentation corresponding audio frequency characteristics and video frame；

Fusion Module melts for the characteristics of image to the corresponding audio frequency characteristics of the audiovisual presentation and video frame It closes, obtains the audio-video fusion feature of the audiovisual presentation；

Matching module matches the audio-video fusion feature for the feature database based on preset reference video, Obtain the frame collection matching result of the audiovisual presentation；

Determination module is copied, for frame collection matching result and reference video based on the audiovisual presentation, to described Audiovisual presentation carries out copy judgement and positioning.

A kind of audio-video copy detection method and device that the embodiment of the present invention proposes, it is right by obtaining audiovisual presentation The audiovisual presentation is decoded and pre-processes, and obtains the audio-frequency unit and video frame of the audiovisual presentation；To the sound The audio-frequency unit and video frame of video image carry out feature extraction, obtain the corresponding audio frequency characteristics of the audiovisual presentation and video The characteristics of image of frame；The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, institute is obtained State the audio-video fusion feature of audiovisual presentation；Based on the feature database of preset reference video, to the audio-video fusion feature It is matched, obtains the frame collection matching result of the audiovisual presentation；Based on the frame collection matching result of the audiovisual presentation with And reference video, copy judgement and positioning are carried out to the audiovisual presentation, thus the method combined using audio-video, not only The robustness of video copy detection system is enhanced, and by merging audio and video characteristic, greatly accelerates copy inspection The execution efficiency of examining system is analyzed jointly by audio-video, improves copy segment positioning accuracy.

Detailed description of the invention

Fig. 1 is the hardware structural diagram of audio-video copy detection device of the present invention；

Fig. 2 is the flow diagram of audio-video copy detection method first embodiment of the present invention；

Fig. 3 is sound intermediate frequency sub-belt energy difference feature extraction flow diagram of the embodiment of the present invention；

Fig. 4 is the flow diagram that the Image DCT feature of video frame of audiovisual presentation is extracted in the embodiment of the present invention；

Fig. 5 is characteristics of image and audio frequency characteristics fusion schematic diagram in the embodiment of the present invention；

Fig. 6 is simhash matching algorithm exemplary diagram involved in the embodiment of the present invention；

Fig. 7 is matching algorithm design diagram involved in the embodiment of the present invention；

Fig. 8 is the positioning of copy involved in the embodiment of the present invention and extension schematic diagram；

Fig. 9 is the flow diagram of audio-video copy detection method second embodiment of the present invention；

Figure 10 is the functional block diagram of audio-video copy detection device first embodiment of the present invention；

Figure 11 is the functional block diagram of audio-video copy detection device second embodiment of the present invention.

In order to keep technical solution of the present invention clearer, clear, it is described in further detail below in conjunction with attached drawing.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The primary solutions of the embodiment of the present invention are: the audio-frequency information of video being included in video copy detection scheme, benefit The robustness of video copy detection system not only can be enhanced in the method combined with audio-video, but also by the way that audio-video is special Sign is merged, and is greatly speeded up the execution efficiency of copy detection system, is analyzed jointly by audio-video, and copy segment positioning is improved Precision.

Specifically, the embodiment of the present invention it is considered that existing video copy detection scheme or only with based on video close The video copy detection scheme of the characteristics of image of key frame, not only weakens the robustness of video copy detection system, but also for The positioning accuracy for copying segment is not high；Using the video copy detection side combined based on audio and video characteristic testing result Case however, carrying out fusion in face of copy detection in resultant layer needs to extract more feature, and needs most feature all Entire copy detection process is completed, thus increases time overhead, and corresponding algorithm complexity and data integration are linearly related, To increase algorithm complexity.

The audio-frequency information of video is included in video copy detection scheme by this embodiment scheme, the side combined using audio-video Method is extracted by audio/video decoding and pretreatment, audio and video characteristic, audio and video characteristic fusion, copy determines and the processing such as positioning The robustness of video copy detection system not only can be enhanced in process, but also by merging audio and video characteristic, greatly greatly The execution efficiency of fast copy detection system, is analyzed jointly by audio-video, improves copy segment positioning accuracy.

Specifically, the hardware knot for the audio-video copy detection device that audio-video of embodiment of the present invention copy detection scheme is related to Structure can be can also be carried on mobile phone, tablet computer, portable hand-held as shown in Figure 1, the detection device can be carried on the end PC In the mobile terminals such as equipment or other electronic equipments with audio-video copy detection function, such as apparatus for media playing.

As shown in Figure 1, the detection device may include: processor 1001, such as CPU, network interface 1004, user interface 1003, memory 1005, communication bus 1002, camera 1006.Wherein, communication bus 1002 for realizing detection device this Connection communication between a little components.User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 is optional May include standard wireline interface and wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, It is also possible to stable memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally may be used also To be independently of the storage device of aforementioned processor 1001.

Optionally, which, when being carried on mobile terminal, can also include that (Radio Frequency, is penetrated RF Frequently circuit), sensor, voicefrequency circuit, WiFi module etc..Wherein, sensor such as optical sensor, motion sensor and its His sensor.Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can root The brightness of display screen is adjusted according to the light and shade of ambient light, proximity sensor can close aobvious when mobile terminal is moved in one's ear Display screen and/or backlight.As a kind of motion sensor, gravity accelerometer can detect (generally three in all directions Axis) acceleration size, can detect that size and the direction of gravity when static, can be used to identify the application of mobile terminal posture (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, percussion) etc.； Certainly, which can also configure the other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, Details are not described herein.

It will be understood by those skilled in the art that apparatus structure shown in Fig. 1 does not constitute the restriction to the detection device, It may include perhaps combining certain components or different component layouts than illustrating more or fewer components.

As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, Subscriber Interface Module SIM and audio-video copy detection application program.

In detection device shown in Fig. 1, network interface 1004 is mainly used for connecting background management platform, with back-stage management Platform carries out data communication；User interface 1003 is mainly used for connecting client, carries out data communication with client；And processor 1001 can be used for calling the audio-video copy detection application program stored in memory 1005, and execute following operation:

In one embodiment, processor 1001 calls the audio-video copy detection application program stored in memory 1005 Following operation can be executed:

The audio frame of the audio-frequency unit of the audiovisual presentation is filtered, and frequency is transformed by Fourier transformation The energy in domain；

Obtained frequency domain energy is divided into several subbands in scheduled frequency range according to logarithmic relationship；

The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference feature of audio frame；

The sampling that audio frame is carried out according to predetermined space, obtains the audio sub-band energy of the audio-frequency unit of the audiovisual presentation Measure poor feature.

To the video frame of the audiovisual presentation, gray level image is converted by its image and carries out compression processing；

Several sub-blocks are divided into the gray level image after compression processing；

Calculate the DCT energy value of each sub-block；

DCT energy value between more two neighboring sub-block obtains the Image DCT feature of the video frame；

According to above-mentioned treatment process, the Image DCT feature of the video frame of the audiovisual presentation is obtained.

The audio frequency characteristics are set as the feature of M per second 32 bits, the characteristics of image of video frame is n per second 32 bits Feature, wherein n be video frame per second, n be less than or equal to 60；

One video frame is corresponded into the mode of several frame audio frames to carry out merging features, obtains generation M per second 64 The audio-video fusion feature of bit, wherein the audio of the corresponding individual audio frame of each audio-video fusion feature is special Sign, M/n adjacent audio-video fusion feature correspond to the characteristics of image of an identical video frame.

Matching list is obtained from the feature database of preset reference video；

For each audio-video fusion feature, from the Chinese in the matching list between inquiry and the audio-video fusion feature Prescribed distance is no more than the feature of preset threshold, the similar features as the audio-video fusion feature；

The similar features for obtaining audio-video fusion feature, obtain the frame collection matching result of the audiovisual presentation.

Time extension is carried out to the audio/video frames of the corresponding reference video of the similar features, obtains the audiovisual presentation In corresponding audio/video frames compare the reference video constitute similar fragments；

Based on the similar fragments, it is similar to reference video to calculate corresponding audio/video frames in the audiovisual presentation Degree；

If the similarity is greater than given threshold, judge that the audiovisual presentation constitutes copy, and records the sound view The initial position of the similar fragments of frequency image and final position.

The matching list is created in the feature database of the reference video.

The present embodiment through the above scheme, especially by audiovisual presentation is obtained, is decoded the audiovisual presentation And pretreatment, obtain the audio-frequency unit and video frame of the audiovisual presentation；To the audio-frequency unit and view of the audiovisual presentation Frequency frame carries out feature extraction, obtains the characteristics of image of the audiovisual presentation corresponding audio frequency characteristics and video frame；To the sound The corresponding audio frequency characteristics of video image and the characteristics of image of video frame are merged, and the audio-video for obtaining the audiovisual presentation is melted Close feature；Based on the feature database of preset reference video, the audio-video fusion feature is matched, the audio-video is obtained The frame collection matching result of image；Frame collection matching result and reference video based on the audiovisual presentation, to the audio-video Image carries out copy judgement and positioning, so that the method combined using audio-video, not only increases video copy detection system Robustness greatly accelerate the execution efficiency of copy detection system, pass through sound and by merging audio and video characteristic Video is analyzed jointly, improves copy segment positioning accuracy.

Based on above-mentioned hardware configuration, audio-video copy detection method embodiment of the present invention is proposed.

As shown in Fig. 2, first embodiment of the invention proposes a kind of audio-video copy detection method, comprising:

Step S101 obtains audiovisual presentation, the audiovisual presentation is decoded and is pre-processed, and obtains the sound view The audio-frequency unit and video frame of frequency image；

Specifically, firstly, obtaining the audiovisual presentation for needing to carry out copy detection, which can obtain from local It takes, can also be obtained by network from outside.

The audiovisual presentation of acquisition is decoded and is pre-processed, the audio of video is extracted, and is downsampled to monophonic 5512.5Hz；The each frame for extracting video frame by frame, to obtain the audio-frequency unit of audiovisual presentation and the video frame of each frame.

Step S102, audio-frequency unit and video frame to the audiovisual presentation carry out feature extraction, obtain the sound view The characteristics of image of frequency image corresponding audio frequency characteristics and video frame；

The part carries out feature extraction primarily with respect to the corresponding audio of a video and all videos frame.Because audio is special Sign itself is easy to be indicated with binary bits, so often accelerating to inquire using binary index or LSH.Institute of the present invention The audio frequency characteristics of extraction are audio sub-band energy difference feature, and the characteristics of image of the video frame of extraction is DCT (Discrete Cosine Transform, discrete cosine transform) feature.

Wherein, feature extraction is carried out to the audio-frequency unit of the audiovisual presentation, it is corresponding obtains the audiovisual presentation The process of audio frequency characteristics includes:

Each audio frame of the audio-frequency unit of the audiovisual presentation is filtered, and is transformed by Fourier transformation The energy of frequency domain；Obtained frequency domain energy is divided into several subbands in scheduled frequency range according to logarithmic relationship； The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference feature of each audio frame；According to pre- Fixed interval carries out the sampling of audio frame, obtains the audio sub-band energy difference feature of the audio-frequency unit of the audiovisual presentation.

More specifically, the extraction process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:

The algorithm that the extraction of the audio sub-band energy difference feature is related to has main steps that:

Firstly, every 0.37 second time-domain audio shape information (audio frame) is filtered by Hanning window (Hanning Window) The energy of frequency domain is transformed into after wave by Fourier transformation；

Secondly, obtained frequency domain energy, which is divided into 33 according to logarithmic relationship (Bark grade), is located at human auditory system model The subband of (300Hz~2000Hz) is enclosed, and calculates the absolute value of the energy between consecutive frame (11 milliseconds of interval) adjacent sub-bands Difference, so that the audio frequency characteristics of 32 bits can be obtained to each audio frame.

The energy difference that " 1 " therein represents the two neighboring subband of current audio frame is greater than the corresponding phase of next audio frame Otherwise the energy difference of adjacent subband is 0.

Detailed process is as follows:

In Fig. 3, input content is a segment of audio；Output content is that several corresponding (n) audios of this section audio are special Sign.

Wherein, Framing: framing, it may be assumed that by the audio fragment cutting be several (n) audio frames.In this example according to M=2048 audio frame of acquisition (M can also be other setting values in other examples) per second, each audio frame include 0.37 second Audio content (overlapping for having 2047/2048 between adjacent audio frame).

Fourier Transform: Fourier transformation, for the shape information (original audio) of time domain to be converted to frequency The energy information of the different frequency range wave in domain is convenient for analysis processing.

ABS: the absolute value (that is: only considering amplitude, do not consider direction of vibration) of wave energy information is taken.

Band Division: dividing band, and entire frequency domain is divided into 33 frequencies not overlapped between 300Hz-2000Hz Rate band (is divided, it may be assumed that frequency is lower, and the affiliated frequency band range of the frequency is smaller) according to logarithmic relationship.In this way, available Energy of the original audio on these different frequency bands.

Energy Computation: energy value (each audio of each audio frame on this 33 frequency bands is calculated Frame obtains 33 energy values).

Bit Derivation: (the energy of i-th of subband successively export bit: is compared to 33 above-mentioned energy values The energy of amount and i+1 subband is compared) obtain the difference of 32 energy values.Compare current audio frame a and next sound The size of this 32 energy value differences between frequency frame b.Assuming that j-th of energy value difference of j-th of energy value difference ratio b of a is big, Then the jth position feature of a is 1, and otherwise, the jth position feature of a is 0.32 energy value differences is big between a and b available in this way Small relationship, the as feature of 32 bits of audio frame a.

Present invention employs this audio frequency characteristics, and the sampling of audio frame is carried out according to 1/2048 second interval, thus The audio frequency characteristics of 2048 32 bits can be all generated for each second audio fragment.

Feature extraction is carried out to the video frame of the audiovisual presentation, obtains the image of the corresponding video frame of audiovisual presentation The process of feature may include:

To each video frame of the audiovisual presentation, gray level image is converted by its image and carries out compression processing；It is right Gray level image after compression processing is divided into several sub-blocks；Calculate the DCT energy value of each sub-block；Between more two neighboring sub-block DCT energy value, obtain the Image DCT feature of the video frame；According to above-mentioned treatment process, the audiovisual presentation is obtained The Image DCT feature of video frame.

More specifically, the present embodiment extracts process such as Fig. 4 institute of the Image DCT feature of the video frame of audiovisual presentation Show:

For the little feature of internet video picture entire change amplitude, the embodiment of the present invention has been selected a kind of efficient Image overall feature as video frame characteristics of image: DCT feature.

The algorithm idea of DCT feature is: several sub-blocks is divided the image into, by comparing the energy between adjacent sub-block Amount height, to obtain the Energy distribution situation of entire image.Specific algorithm steps are:

Firstly, converting gray level image for color image and compressing and (change the ratio of width to height) to wide 64 pixel, high 32 pixel.

Then, gray level image is divided into 32 sub-blocks (as shown in Figure 4 0~31), each block of image comprising 8x8 pixel.

For each sub-block, the DCT energy value of the sub-block is calculated.Selection can band energy value absolute value represent The energy of the sub-block.

Finally, calculating adjacent sub-blocks energy value relative size and obtaining the feature of 32 bits.If the energy of the i-th sub-block Amount is greater than the energy of i+1 sub-block, then otherwise it is 0 that ith bit position, which is 1,.Particularly: the 31st sub-block and the 0th sub-block are compared Compared with.

By the above process, each video frame will obtain the Image DCT feature of 32 bits.

Step S103 merges the characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame, obtains To the audio-video fusion feature of the audiovisual presentation；

In the figure that after the above process has obtained the characteristics of image of the corresponding audio frequency characteristics of video and video frame, will be obtained As feature and audio frequency characteristics are merged.Specific fusion method is as shown in Figure 5 (wherein: the longitudinal axis is time shaft).

As shown in figure 5, in the present embodiment, setting audio frequency characteristics as a 32 ratio of M=2048 per second (value can be set) Special feature, and the characteristics of image of video frame is that (n is the frame per second of video, and n is usually no more than for the features of n per second 32 bits 60)。

Thus, the present embodiment carries out merging features in such a way that a video frame is corresponded to several audio frames, it may be assumed that The audio-video fusion feature per second for generating 2048 64 bits, wherein each fusion feature corresponds to an individual audio The feature of frame, and 2048/n adjacent audio-video fusion feature corresponds to the Image DCT feature of an identical video frame.

It is merged by the above-mentioned characteristics of image to the corresponding audio frequency characteristics of audiovisual presentation and video frame, obtains sound view The audio-video fusion feature of frequency image.

Step S104 matches the audio-video fusion feature, is obtained based on the feature database of preset reference video The frame collection matching result of the audiovisual presentation；

The present embodiment is preset with the feature database of reference video, and creation has matching list in the feature database of reference video, To facilitate video individual features to be detected that can quickly be retrieved.

When being matched to audio-video fusion feature, firstly, obtaining matching from the feature database of preset reference video Table；For each audio-video fusion feature, inquiry meets the feature of preset condition from the matching list, merges as audio-video The similar features of feature.For example it is no more than from the Hamming distance in the matching list between inquiry and audio-video fusion feature default The feature of threshold value (such as 3), the similar features as the audio-video fusion feature；Obtain the phase of all audio-video fusion features Like feature, the frame collection matching result of the audiovisual presentation is obtained.

More specifically, the present embodiment is considered:

For an inquiry video (video for needing to carry out copy detection) and a reference video, if by comparing frame by frame Compared with the similarity of the two feature, required time complexity and the two videos are all directly proportional, thus are unfavorable for expanding to big The case where size databases.Therefore, the present invention is based on existing simhash technologies, propose a kind of special based on audio-video fusion The index of sign and the matching strategy of inquiry.

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits Be up to 3 bits are different from this feature).The schematic illustration of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting the Chinese Prescribed distance is 3, then 64 bits are divided into 4 16 bits, there will necessarily be 16 bits and query characteristics are completely the same.Class As, in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.Pass through two secondary indexs After searching matching, most 3 discrepant positions can be enumerated in remaining 36 bit, it is original so as to substantially reduce Algorithm complexity.

The inquiry characteristic that the 64 bit audio-video fusion features that the present invention uses equally have simhash the same, it may be assumed that need Find all features (it is relevant to think that the two are characterized in) that 3 bits are at most differed to some 64 feature.In addition, also just like Under qualifications: i.e.: first 32 of the two correlated characteristics most 2 bits of difference, and latter 32 of the two features are at most Differ 2 bits.Based on this, the present embodiment copies the way of simhash, but concordance list number is expanded to 24, specific to expand Exhibition method is as shown in Figure 7:

In matching algorithm design as shown in Figure 7, after consideration the case where 32 most 1 bit differences, then first 32 at most Have 16 bit differences, then for Fig. 7, in A, B, C, D at least 2 pieces it is completely the same, and in E, F at least one piece it is complete It is complete consistent, therefore the completely the same matching list of 32 bits can be constructed.Such inquiry table one share C (4,2) * C (2, 1) * 2, because may also be preceding 32 bit at most poor 2.Therefore, 24 sublists, the matching as creation can be constructed altogether Table is used to quick search audio-video fusion feature.

Then, by inquiring the matching list of above-mentioned building, the similar features of audio-video fusion feature is obtained, feature inspection is obtained The result of rope.

Step S105, frame collection matching result and reference video based on the audiovisual presentation, to the audio-video figure As carrying out copy judgement and positioning.

The characteristic key according to obtained in the above process as a result, and combine video copy segment localization method, to sentence Surely whether inquiry video is copy video.If it is determined that inquiry video is copy video, then corresponding copy segment positioning is provided.

The present embodiment is considered: for two videos, if calculating the similarity between the two videos between frame, It can obtain similarity matrix shown in rightmost in Fig. 8.To which the target for finding two video similar fragments is also converted to The line segment that similarity is higher than certain threshold value is found in similarity matrix, however this processing mode time overhead increases.

The principle for carrying out copy judgement and positioning to audiovisual presentation in the present embodiment is:, can by above-mentioned matching algorithm To find some points most bright in similarity matrix (representing these similarity highests), the bright spot as shown in Far Left in Fig. 8, And it is put by these and carries out time extension, so as to obtain (the i.e. possible print of similar fragments shown in the centre Fig. 8 Section), it is screened later by threshold value, so as to determine whether certain two video constitutes copy, and if constituting copy, It can recorde initial position and the final position distribution moment of the similar fragments.

Specifically, to audiovisual presentation carry out copy judgement and positioning when, first to the above process obtain similar spy The audio/video frames (bright spot shown in 8 Far Left figure of corresponding diagram) for levying corresponding reference video carry out time extension, obtain the ginseng The reference video segment for examining video carries out time extension to the audio/video frames in the corresponding audiovisual presentation of the similar features, Obtain the similar fragments constituted in the audiovisual presentation compared to the reference video (as shown in Fig. 8 middle graph)；Described in calculating Similarity between similar fragments described in audiovisual presentation and the reference video segment, i.e., it is similar in calculating audiovisual presentation The similarity of the corresponding audio/video frames of segment audio/video frames corresponding with reference video segment, to the phase of obtained each audio/video frames It is averaged like degree；If the similarity is greater than given threshold, judge that the audiovisual presentation constitutes copy, and described in record The initial position of the similar fragments of audiovisual presentation and final position.

That is, calculate audiovisual presentation in similar fragments corresponding audio/video frames and reference video similarity When, Characteristic Contrast is carried out with referring to video clip corresponding frame to each frame (feature including 64 bits) in the similar fragments, Similarity is calculated, is averaged later, by this average value compared with preset threshold, if similarity is greater than given threshold, is judged The audiovisual presentation constitutes copy, and records initial position and the final position of the similar fragments of the audiovisual presentation.

It is exemplified below:

If in similar fragments, 100 frames (i.e. an audio-video sequence) inquired between the 10-20 second of video are corresponding with reference to view Each frame in 100 frames between the 10-20 second for inquiring video is then corresponded to and is regarded with reference by 100 frames between the 30-40 second of frequency Each frame in 100 frames between the 30-40 second of frequency is compared, and calculates separately the similarity of each frame, such as first frame 64 In bit, there is the feature of 50 bits identical as reference video frame, then the similarity S1=50/64 ≈ 0.78125 of the first frame；With This principle, obtains the similarity S2 ... ... of the second frame, and the similarity S100 of 100 frames is averaged each similarity, obtains phase Like in segment, the similarity of video and reference video is inquired, it is assumed that it is 0.95, by it compared with given threshold (being set as 0.9), by This may determine that inquiry video constitutes copy, and record initial position and the final position of the similar fragments.

Determine that an inquiry video, can there may be the situation of multiple similar fragments in position fixing process in above-mentioned copy Multiple similar fragments are stringed together record.

It should be noted that judging whether inquire video according to frame collection matching result in the present embodiment above process When being the copy of some video in reference video library, other algorithms can be used also to realize, such as: Hough transformation, Smith Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..Inquiry video is found by these algorithms and some reference regards Frequently a most like Duan Xulie, and determine whether to constitute copy by threshold value.For the video for being judged to copying, judge to copy Segment end to end, thus mark this Partial Fragment for copy segment.

Through the above scheme, the method combined using audio-video not only increases video copy detection system to the present embodiment The robustness of system, and by merging audio and video characteristic, the execution efficiency of copy detection system is greatly accelerated, is passed through Audio-video is analyzed jointly, improves copy segment positioning accuracy.

As shown in figure 9, second embodiment of the invention proposes a kind of audio-video copy detection method, based on the above embodiment, Before the step of obtaining audiovisual presentation, further includes:

Step S100 creates the matching list in the feature database of the reference video.

Specifically, matching list is created, is that video individual features to be detected can be retrieved quickly for convenience.

Matching list is created based on reference video, and specific creation process is as follows:

Firstly, collecting reference video segment, audio/video decoding and pretreatment are carried out to reference video segment, obtained with reference to view The audio-frequency unit and video frame of frequency.

Then, feature extraction is carried out to the audio-frequency unit of reference video and video frame, obtains the audio frequency characteristics of reference video With the characteristics of image of video frame.

Later, audio and video characteristic fusion is carried out to reference video, obtains the audio-video fusion feature of reference video.

Finally, the audio-video fusion feature based on the reference video creates matching list, for the progress of subsequent inquiry video Aspect indexing retrieval matching.

Wherein, when the audio-video fusion feature based on the reference video creates matching list, based on the principle that

It considers: for an inquiry video (video for needing to carry out copy detection) and a reference video, if logical The similarity for comparing the two feature frame by frame is crossed, required time complexity and the two videos are all directly proportional, thus are unfavorable for The case where expanding to large scale database.Therefore, the present invention is based on existing simhash technologies, propose a kind of based on sound view The index of frequency fusion feature and the matching strategy of inquiry.

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits Be up to 3 bits are different from this feature).The schematic diagram of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting Hamming distance From being 3, then 64 bits are divided into 4 16 bits, it there will necessarily be 16 bits and query characteristics be completely the same.It is similar , in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.It is looked by two secondary indexs It looks for after matching, most 3 discrepant positions can be enumerated in remaining 36 bit, it is original so as to substantially reduce The complexity of algorithm.

Accordingly, the functional module embodiment of audio-video of embodiment of the present invention copy detection device is proposed.

As shown in Figure 10, first embodiment of the invention proposes a kind of audio-video copy detection device, comprising: decoding and pre- place Manage module 201, characteristic extracting module 202, Fusion Module 203, matching module 204 and copy determination module 205, in which:

Decoding and preprocessing module 201 are decoded the audiovisual presentation and locate in advance for obtaining audiovisual presentation Reason, obtains the audio-frequency unit and video frame of the audiovisual presentation；

Characteristic extracting module 202, for the audiovisual presentation audio-frequency unit and video frame carry out feature extraction, obtain To the characteristics of image of the audiovisual presentation corresponding audio frequency characteristics and video frame；

Fusion Module 203 is carried out for the characteristics of image to the corresponding audio frequency characteristics of the audiovisual presentation and video frame Fusion, obtains the audio-video fusion feature of the audiovisual presentation；

Matching module 204, for the feature database based on preset reference video, to audio-video fusion feature progress Match, obtains the frame collection matching result of the audiovisual presentation；

Determination module 205 is copied, for frame collection matching result and reference video based on the audiovisual presentation, to institute It states audiovisual presentation and carries out copy judgement and positioning.

Later, the audio-frequency unit to the audiovisual presentation and video frame carry out feature extraction, obtain the audio-video figure As the characteristics of image of corresponding audio frequency characteristics and video frame.

Detailed process is as follows:

Wherein, Framing: framing, it may be assumed that by the audio fragment cutting be several (n) audio frames.According to every in example 2048 audio frames of second acquisition, each audio frame include that 0.37 second audio content (has 2047/2048 between adjacent audio frame Overlapping).

Later, the feature database based on preset reference video matches the audio-video fusion feature, obtains described The frame collection matching result of audiovisual presentation.

When being matched to audio-video fusion feature, firstly, obtaining matching from the feature database of preset reference video Table；For each audio-video fusion feature, inquiry meets the feature of preset condition from the matching list, merges as audio-video The similar features of feature.For example it is no more than from the Hamming distance in the matching list between inquiry and audio-video fusion feature default The feature of threshold value (such as 3), the similar features as the audio-video fusion feature；Obtain the similar spy of audio-video fusion feature Sign, obtains the frame collection matching result of the audiovisual presentation.

More specifically, the present embodiment is considered:

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits Be up to 3 bits are different from this feature).The schematic illustration of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting the Chinese Prescribed distance is 3, then 64 bits are divided into 4 16 bits, there will necessarily be 16 bits and query characteristics are completely the same.Class As, in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.Pass through two secondary indexs After lookup, most 3 discrepant positions can be enumerated, so as to substantially reduce original calculation in remaining 36 bit The complexity of method.

The present embodiment is considered: for two videos, if calculated similar between each frame between the two videos Degree, then can obtain similarity matrix shown in rightmost in Fig. 8.To which the target for finding two video similar fragments also just turns It has been melted into and has found the line segment that similarity is higher than certain threshold value in similarity matrix, however this processing mode time overhead adds Greatly.

The principle for carrying out copy judgement and positioning to audiovisual presentation in the present embodiment is:, can by above-mentioned Index Algorithm To find some points most bright in similarity matrix (representing these similarity highests), the bright spot as shown in Far Left in Fig. 8, And it is put by these and carries out time extension, so as to obtain (the i.e. possible print of similar fragments shown in the centre Fig. 8 Section), it is screened later by threshold value, so as to determine whether certain two video constitutes copy, and if constituting copy, It can recorde initial position and the final position distribution moment of the similar fragments.

It is exemplified below:

As shown in figure 11, second embodiment of the invention proposes a kind of audio-video copy detection device, based on the above embodiment, Further include:

Creation module 200, for creating the matching list in the feature database of the reference video.

Finally, the audio-video fusion feature based on the reference video creates matching list, for the progress of subsequent inquiry video Aspect indexing retrieval.

It considers: for an inquiry video (video for needing to carry out copy detection) and a reference video, if logical The similarity for comparing the two feature frame by frame is crossed, required time complexity and the two videos are all directly proportional, thus are unfavorable for The case where expanding to large scale database.Therefore, the present invention is based on existing simhash technologies, propose a kind of based on sound view The index and query strategy of frequency fusion feature.

Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for one 64 of inquiry Bit feature quickly finds all features with the Hamming distance of this 64 bit feature less than or equal to 3 (i.e. in 64 bits Be up to 3 bits are different from this feature).The schematic diagram of the algorithm is as shown in Figure 6.For 64 bit datas, if limiting Hamming distance From being 3, then 64 bits are divided into 4 16 bits, it there will necessarily be 16 bits and query characteristics be completely the same.It is similar , in remaining 48 bit, piecemeal and the query characteristics for certainly existing 12 bits are completely the same.It is looked by two secondary indexs After looking for, most 3 discrepant positions can be enumerated, so as to substantially reduce original algorithm in remaining 36 bit Complexity.

Audio-video of embodiment of the present invention copy detection method and device, by obtaining audiovisual presentation, to the audio-video Image is decoded and pre-processes, and obtains the audio-frequency unit and video frame of the audiovisual presentation；To the audiovisual presentation Audio-frequency unit and video frame carry out feature extraction, and the image for obtaining the corresponding audio frequency characteristics of the audiovisual presentation and video frame is special Sign；The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, the audio-video figure is obtained The audio-video fusion feature of picture；Based on the feature database of preset reference video, the audio-video fusion feature is matched, is obtained To the frame collection matching result of the audiovisual presentation；Frame collection matching result and reference video based on the audiovisual presentation, Copy judgement and positioning are carried out to the audiovisual presentation, so that the method combined using audio-video, not only increases video The robustness of copy detection system, and by merging audio and video characteristic, greatly accelerate holding for copy detection system Line efficiency is analyzed jointly by audio-video, improves copy segment positioning accuracy.

It should also be noted that, herein, the terms "include", "comprise" or its any other variant are intended to non- It is exclusive to include, so that the process, method, article or the device that include a series of elements not only include those elements, It but also including other elements that are not explicitly listed, or further include solid by this process, method, article or device Some elements.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including There is also other identical elements in the process, method of the element, article or device.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes Business device or the network equipment etc.) execute method described in each embodiment of the present invention.

The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all utilizations Equivalent structure made by description of the invention and accompanying drawing content or process transformation, are applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of audio-video copy detection method characterized by comprising

Audiovisual presentation is obtained, the audiovisual presentation is decoded and is pre-processed, the audio of the audiovisual presentation is obtained Part and video frame；

Audio-frequency unit and video frame to the audiovisual presentation carry out feature extraction, obtain the corresponding sound of the audiovisual presentation The characteristics of image of frequency feature and video frame；

The characteristics of image of the corresponding audio frequency characteristics of the audiovisual presentation and video frame is merged, the audio-video figure is obtained The audio-video fusion feature of picture, including: audio sub-band energy difference feature to the audio-frequency unit of the audiovisual presentation and The Image DCT feature of video frame is merged, and the audio-video fusion feature of the audiovisual presentation is obtained；

Based on the feature database of preset reference video, the audio-video fusion feature is matched, obtains the audio-video figure The frame collection matching result of picture, including: from the feature database of preset reference video obtain matching list；For each audio-video Fusion feature is no more than preset threshold from the Hamming distance in the matching list between inquiry and the audio-video fusion feature Feature, the similar features as the audio-video fusion feature；The similar features for obtaining audio-video fusion feature, obtain the sound The frame collection matching result of video image；

Frame collection matching result and reference video based on the audiovisual presentation carry out copy judgement to the audiovisual presentation And positioning.

2. the method according to claim 1, wherein the audio-frequency unit to the audiovisual presentation carries out spy The step of sign is extracted, and the audiovisual presentation corresponding audio frequency characteristics are obtained include:

The audio frame of the audio-frequency unit of the audiovisual presentation is filtered, and frequency domain is transformed by Fourier transformation Energy；

The sampling that audio frame is carried out according to predetermined space, obtains the audio sub-band energy difference of the audio-frequency unit of the audiovisual presentation Feature.

3. the method according to claim 1, wherein the video frame to the audiovisual presentation carries out feature It extracts, the step of obtaining the characteristics of image of the corresponding video frame of the audiovisual presentation includes:

To each video frame of the audiovisual presentation, gray level image is converted by its image and carries out compression processing；

Calculate the DCT energy value of each sub-block；

4. method according to claim 1,2 or 3, which is characterized in that described to the corresponding audio of the audiovisual presentation The step of characteristics of image of feature and video frame is merged, and the audio-video fusion feature of the audiovisual presentation is obtained include:

The audio frequency characteristics are set as the feature of M per second 32 bits, the characteristics of image of video frame is the spy of n per second 32 bits Sign, wherein n is the frame per second of video, and n is less than or equal to 60；

One video frame is corresponded into the mode of several frame audio frames to carry out merging features, obtains generation M per second 64 bits Audio-video fusion feature, wherein each audio-video fusion feature correspond to an individual audio frame audio frequency characteristics, phase M/n adjacent audio-video fusion feature corresponds to the characteristics of image of an identical video frame.

5. the method according to claim 1, wherein the frame collection matching result based on the audiovisual presentation And reference video, copy judgement is carried out to the audiovisual presentation and includes: the step of positioning

Time extension is carried out to the audio/video frames of the corresponding reference video of the similar features, obtains the reference of the reference video Video clip carries out time extension to the audio/video frames in the corresponding audiovisual presentation of the similar features, obtains the sound view The similar fragments constituted in frequency image compared to the reference video；

Calculate the similarity between similar fragments described in the audiovisual presentation and the reference video segment；

If the similarity is greater than given threshold, judge that the audiovisual presentation constitutes copy, and record the audio-video figure The initial position of the similar fragments of picture and final position.

6. the method according to claim 1, wherein before the step of acquisition audiovisual presentation, further includes:

The matching list is created in the feature database of the reference video.

7. a kind of audio-video copy detection device characterized by comprising

Decoding and preprocessing module are decoded and pre-process to the audiovisual presentation, obtain for obtaining audiovisual presentation The audio-frequency unit and video frame of the audiovisual presentation；

Characteristic extracting module, for the audiovisual presentation audio-frequency unit and video frame carry out feature extraction, obtain described The characteristics of image of audiovisual presentation corresponding audio frequency characteristics and video frame；

Fusion Module is merged for the characteristics of image to the corresponding audio frequency characteristics of the audiovisual presentation and video frame, is obtained To the audio-video fusion feature of the audiovisual presentation, including: to audio of the audio-frequency unit of the audiovisual presentation Image DCT feature with energy difference feature and video frame is merged, and the audio-video fusion feature of the audiovisual presentation is obtained；

Matching module matches the audio-video fusion feature, obtains for the feature database based on preset reference video The frame collection matching result of the audiovisual presentation；

Determination module is copied, for frame collection matching result and reference video based on the audiovisual presentation, the sound is regarded Frequency image carries out copy judgement and positioning；

The matching module is also used to obtain matching list from the feature database of preset reference video；Each audio-video is melted Feature is closed, the spy of preset threshold is no more than from the Hamming distance in the matching list between inquiry and the audio-video fusion feature Sign, the similar features as the audio-video fusion feature；The similar features for obtaining audio-video fusion feature obtain the sound view The frame collection matching result of frequency image.

8. device according to claim 7, which is characterized in that

The characteristic extracting module is also used to be filtered the audio frame of the audio-frequency unit of the audiovisual presentation, and passes through Fourier transformation is transformed into the energy of frequency domain；Obtained frequency domain energy is divided into several be according to logarithmic relationship to make a reservation for The subband of frequency range；The difference for calculating the absolute value of the energy between adjacent sub-bands, obtains the audio sub-band energy difference of audio frame Feature；The sampling that audio frame is carried out according to predetermined space, obtains the audio sub-band energy of the audio-frequency unit of the audiovisual presentation Poor feature.

9. device according to claim 7, which is characterized in that

The characteristic extracting module is also used to the video frame to the audiovisual presentation, converts gray level image simultaneously for its image Carry out compression processing；Several sub-blocks are divided into the gray level image after compression processing；Calculate the DCT energy value of each sub-block；Compare DCT energy value between two neighboring sub-block obtains the Image DCT feature of the video frame；According to above-mentioned treatment process, obtain The Image DCT feature of the video frame of the audiovisual presentation.

10. according to device described in claim 7,8 or 9, which is characterized in that

The Fusion Module is also used to set the audio frequency characteristics as the feature of M per second 32 bits, the characteristics of image of video frame For the feature of n per second 32 bits, wherein n is the frame per second of video, and n is less than or equal to 60；One video frame is corresponded to several The mode of frame audio frame carries out merging features, obtains the audio-video fusion feature of M 64 bits of generation per second, wherein each The audio frequency characteristics of all corresponding individual audio frame of a audio-video fusion feature, M/n adjacent audio-video fusion feature pair Answer the characteristics of image of an identical video frame.

11. device according to claim 7, which is characterized in that

The copy determination module is also used to carry out time expansion to the audio/video frames of the corresponding reference video of the similar features Exhibition, obtains the reference video segment of the reference video, to the audio/video frames in the corresponding audiovisual presentation of the similar features Time extension is carried out, the similar fragments constituted in the audiovisual presentation compared to the reference video are obtained；Calculate the sound view Similarity between similar fragments described in frequency image and the reference video segment；If the similarity is greater than given threshold, Then judge that the audiovisual presentation constitutes copy, and records initial position and the stop bit of the similar fragments of the audiovisual presentation It sets.

12. device according to claim 7, which is characterized in that further include:

Creation module, for creating the matching list in the feature database of the reference video.

13. a kind of storage medium, which is characterized in that be stored with computer instruction in the storage medium；The computer instruction It is performed realization such as audio-video copy detection method described in any one of claims 1 to 6.