Disclosure of Invention
The invention mainly aims to provide a video similarity detection method, a video similarity detection device, a terminal device and a computer readable storage medium, and aims to solve the technical problem that the efficiency of obtaining a similarity detection result is low by adopting the existing video similarity detection method in the prior art.
In order to achieve the above object, the present invention provides a video similarity detection method, including the following steps:
when a video to be detected is obtained, determining a video frame to be detected in the video to be detected;
screening out a video segment to be detected from the video to be detected by using the video frame to be detected;
matching a selected preset video segment which meets preset conditions with the video segment to be detected in a preset video library with the video segment to be detected to obtain a video pair;
and carrying out similarity detection on the video pair to obtain a similarity detection result of the video to be detected.
Optionally, the step of screening out the video segment to be detected from the video to be detected by using the video frame to be detected includes:
determining a hash value of the video frame to be detected;
dividing the video to be detected into a plurality of video segments based on the hash value of the video frame to be detected;
sorting the plurality of video segments according to time length from high to low to obtain an ordered video segment set;
and determining the video segment to be detected at the front end in the ordered video segment set.
Optionally, the video frames to be detected are arranged according to a preset sequence; the step of dividing the video to be detected into a plurality of video segments based on the hash value of the video frame to be detected comprises:
determining a hash difference value of adjacent video frames in the video frames to be detected which are arranged according to the preset sequence based on the hash values of the video frames to be detected which are arranged according to the preset sequence;
determining a cutting frame in the video frames to be detected which are arranged according to the preset sequence based on the Hash difference value;
and dividing the video to be detected into a plurality of video segments by using the cutting frame.
Optionally, before the step of pairing the selected preset video segment in the preset video library, which satisfies the preset condition with the video segment to be detected, with the video segment to be detected to obtain a video pair, the method further includes:
determining a first average hash value of the video segment to be detected based on the hash value of the video frame to be detected in the video segment to be detected;
acquiring a second average hash value of a preset video segment corresponding to each preset video in a preset video library;
establishing a similar vector pool by using the first average hash value and the second average hash value;
the step of pairing the selected preset video segment which meets the preset condition with the video segment to be detected in the preset video library with the video segment to be detected to obtain a video pair comprises the following steps:
determining a selected preset video segment in the preset video library by using the similar vector pool;
and matching the video segment to be detected with the selected preset video segment to obtain a video pair.
Optionally, the step of determining the selected preset video segment in the preset video library by using the similarity vector pool includes:
determining a selected similarity vector with the similarity greater than or equal to a first preset threshold value in the similarity vector pool;
and determining the preset video segment corresponding to the preset video library and the selected similarity vector as the selected preset video segment.
Optionally, the step of performing similarity detection on the video pair to obtain a similarity detection result of the video to be detected includes:
dividing the video pairs of which the selected preset video segments belong to the same preset video into a video group to obtain a plurality of video groups;
acquiring starting time information of video segments included in each video pair in each video group in the plurality of video groups;
determining the starting time difference between the video segment to be detected and the selected preset video segment included in each video pair in each video group based on the starting time information of the video segment included in each video pair in each video group;
dividing the video pairs with the same starting time difference in each video group into a sub-video group to obtain a plurality of sub-video groups corresponding to the plurality of video groups respectively;
acquiring a first hash difference value of a video segment to be detected included in each video pair in each sub video group in the plurality of sub video groups and a second hash difference value of a selected preset video segment;
determining the hash difference similarity between a video segment to be detected and a selected preset video segment in each video pair in each sub-video group based on the first hash difference value and the second hash difference value;
determining the area of which the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold as a similar area of each video pair in each sub-video group;
obtaining similarity detection results corresponding to the plurality of sub-video groups based on the similar regions of the video pairs in each sub-video group;
and obtaining a similarity detection result of the video to be detected based on the similarity detection results corresponding to the plurality of sub-video groups.
Optionally, before the step of determining the area where the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold as the similar area of each video pair in each sub-video group, the method further includes:
acquiring a first audio of a video segment to be detected and a second audio of a selected preset video segment in each video pair in the plurality of sub-video groups;
determining the audio similarity between a video segment to be detected and a selected preset video segment in each video pair in each sub-video group based on the first audio and the second audio;
the step of determining the area in which the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold as the similar area of each video pair in each sub-video group includes:
and determining the area in which the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold and the audio similarity of each video pair in each sub-video group is greater than or equal to a third preset threshold as the similar area of each video pair in each sub-video group.
In addition, to achieve the above object, the present invention further provides a video similarity detection apparatus, including:
the acquisition module is used for determining a video frame to be detected in the video to be detected when the video to be detected is acquired;
the screening module is used for screening a video segment to be detected from the video to be detected by utilizing the video frame to be detected;
the matching module is used for matching a selected preset video segment which meets preset conditions with the video segment to be detected in a preset video library with the video segment to be detected so as to obtain a video pair;
and the detection module is used for carrying out similarity detection on the video pair so as to obtain a similarity detection result of the video to be detected.
In addition, to achieve the above object, the present invention further provides a terminal device, including: the video similarity detection method comprises a memory, a processor and a video similarity detection program stored on the memory and running on the processor, wherein the video similarity detection program realizes the steps of the video similarity detection method according to any one of the above items when being executed by the processor.
Furthermore, to achieve the above object, the present invention further provides a computer-readable storage medium, having a video similarity detection program stored thereon, where the video similarity detection program, when executed by a processor, implements the steps of the video similarity detection method according to any one of the above items.
The technical scheme of the invention provides a video similarity detection method, which comprises the steps of determining a video frame to be detected in a video to be detected when the video to be detected is obtained; screening out a video segment to be detected from the video to be detected by using the video frame to be detected; matching a selected preset video segment which meets preset conditions with the video segment to be detected in a preset video library with the video segment to be detected to obtain a video pair; and carrying out similarity detection on the video pair to obtain a similarity detection result of the video to be detected.
In the existing video similarity detection method, all video frames of a video to be detected and all video frames of a preset video are obtained and processed to obtain image feature points, the image feature points corresponding to the video to be detected are compared with the image feature points corresponding to the preset video to obtain a comparison result, based on the comparison result, a similarity detection result is obtained, feature point extraction needs to be carried out on all video frames of the video to be detected and the preset video, the data processing amount is large, more calculation time is consumed, the speed for obtaining the similarity detection result is low, and the efficiency is low; according to the method and the device, only part of video frames, namely the video frames to be detected, need to be processed, and meanwhile, part of video segments to be detected of the videos to be detected are screened out for processing, so that the data processing amount is greatly reduced, and the speed for obtaining the similarity detection result is high. Therefore, the video similarity detection method improves the efficiency of video similarity detection.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a terminal device in a hardware operating environment according to an embodiment of the present invention.
The terminal device may be a User Equipment (UE) such as a Mobile phone, a smart phone, a laptop, a digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a handheld device, a vehicle mounted device, a wearable device, a computing device or other processing device connected to a wireless modem, a Mobile Station (MS), etc. The terminal device may be referred to as a user terminal, a portable terminal, a desktop terminal, etc.
In general, a terminal device includes: at least one processor 301, a memory 302, and a video similarity detection program stored on the memory and executable on the processor, the video similarity detection program configured to implement the steps of the video similarity detection method as described above.
The processor 301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 301 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. The processor 301 may further include an AI (Artificial Intelligence) processor for processing operations related to the video similarity detection method, so that the video similarity detection method model can be trained and learned autonomously, thereby improving efficiency and accuracy.
Memory 302 may include one or more computer-readable storage media, which may be non-transitory. Memory 302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 302 is used to store at least one instruction for execution by processor 301 to implement the video similarity detection method provided by the method embodiments herein.
In some embodiments, the terminal may further include: a communication interface 303 and at least one peripheral device. The processor 301, the memory 302 and the communication interface 303 may be connected by a bus or signal lines. Various peripheral devices may be connected to communication interface 303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 304, a display screen 305, and a power source 306.
The communication interface 303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 301 and the memory 302. In some embodiments, processor 301, memory 302, and communication interface 303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 301, the memory 302 and the communication interface 303 may be implemented on a single chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 304 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 304 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 304 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 304 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 305 is a touch display screen, the display screen 305 also has the ability to capture touch signals on or over the surface of the display screen 305. The touch signal may be input to the processor 301 as a control signal for processing. At this point, the display screen 305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 305 may be one, the front panel of the electronic device; in other embodiments, the display screens 305 may be at least two, respectively disposed on different surfaces of the electronic device or in a folded design; in still other embodiments, the display screen 305 may be a flexible display screen disposed on a curved surface or a folded surface of the electronic device. Even further, the display screen 305 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 305 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The power supply 306 is used to power various components in the electronic device. The power source 306 may be alternating current, direct current, disposable or rechargeable. When the power source 306 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the terminal device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a video similarity detection program is stored on the computer-readable storage medium, and when being executed by a processor, the video similarity detection program implements the steps of the video similarity detection method described above. Therefore, a detailed description thereof will be omitted. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. It is determined that the program instructions may be deployed to be executed on one terminal device, or on multiple terminal devices located at one site, or distributed across multiple sites and interconnected by a communication network, as examples.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
At present, the existing video similarity detection methods include the following methods:
1. with the basic information: the detection is performed only through basic information of the video, for example, a similarity detection result of the video is obtained jointly by using one or more kinds of information such as a title, a description, a cover picture, a duration, a resolution and the like.
2. Comparison with Md 5: and calculating md5 according to the video files, and further comparing md5 to obtain a video similarity detection result.
3. And (3) image comparison: and extracting a plurality of frames from the head and the tail of the video respectively (all the frames cannot be used because the calculation consumption is overlarge), comparing the pictures, and converting the video similarity detection into the image similarity detection problem.
4. Detecting characteristic points: image feature points (such as SIFT) are detected from the frames, the feature points are used as video features, and feature point comparison is carried out on different videos to obtain a video similarity detection result.
5. Deep learning: the characteristic that the neural network can carry out feature coding on the videos is utilized, a unique N-dimensional feature vector is generated for each video, and similarity comparison is carried out by using the feature vectors so as to obtain a video similarity detection result.
However, the above-mentioned several techniques have problems:
1. for the utilization of basic information, although the processing speed is high, a large amount of data can be easily compared, the probability of misjudgment and missed judgment is high because the information cannot represent the real content of the video, and is easy to modify and change, thereby bypassing the detection.
The Md5 detection is effective for completely consistent videos, and Md5 changes and has poor adaptability as long as the videos are subjected to fine operations such as editing, transcoding, resolution frame rate adjustment, watermarking and the like.
3. For the scheme of extracting frames and then performing image comparison, the video frames extracted from the fixed position cannot carry video overall information, and only the similarity of two videos at the fixed position can be indicated.
4. For the characteristic point scheme, the characteristic points are very helpful for describing picture information, so that the similar ranges of two videos are conveniently and accurately positioned, but a large amount of characteristic point data are difficult to compare with multiple videos, the time complexity of an algorithm is extremely high, and the requirement of a large-scale to-be-processed video in reality cannot be met.
5. For deep learning, the feature vectors extracted by the neural network can well describe one video, but because semantic information is contained in the extraction process, the two feature vectors are very similar to each other as long as the contents contained in the two videos are basically identical (for example, street dance or news broadcast). The situations of mixed cut videos (a plurality of videos are scattered into a plurality of sections and then edited into one section again), intercepted videos (A videos are intercepted from a certain part of B videos) and the like cannot be effectively detected, the situation can be determined to be no repetition, and news broadcast of different contents can be mistakenly determined to be repetition.
Based on the hardware structure, the embodiment of the video similarity detection method is provided.
Referring to fig. 2, fig. 2 is a schematic flowchart of a video similarity detection method according to a first embodiment of the present invention, where the method is used for a terminal device, and the method includes the following steps:
step S11: when the video to be detected is obtained, determining a video frame to be detected in the video to be detected.
It should be noted that the execution main body of the present invention is a terminal device, the terminal device is installed with a video similarity detection program, and the video similarity detection method of the present invention is implemented when the terminal device executes the video similarity detection program. The terminal device refers to the above-described structure, and details are not repeated here.
In specific application, one video to be detected can be adopted, a plurality of videos to be detected can also be adopted, and when one video to be detected is adopted, the video similarity detection method is executed on the video to be detected; when a plurality of videos to be detected are detected, the video similarity detection method is respectively carried out on each video to be detected; when the video similarity detection is performed on each video to be detected in the plurality of videos to be detected, the detection may be performed in parallel. The embodiment of the present invention will be explained with a video to be detected as one.
In specific application, the video to be detected generally comprises 45 frames or more per second, all video frames are processed, data operation amount is not greatly increased, and improvement of similarity detection efficiency is not facilitated, so that the video frames to be detected, namely a part of video frames, need to be determined from all the video frames of the video to be detected. Generally, in a video to be detected, video frames are extracted at equal time intervals, for example, the video to be detected includes 90000 frames, where each second is 45 frames, that is, the duration of the video to be detected is 2000s, and the 10 th frame may be determined as the video frame to be detected in every 10 consecutive frames in the 90000 frames, that is, 9000 frames are determined as the video frame to be detected. Preferably, 4 frames are taken every second, and the user can set other numerical values according to the requirement of the user.
Step S12: screening out a video segment to be detected from the video to be detected by using the video frame to be detected;
specifically, step S12 includes: determining a hash value of the video frame to be detected; dividing the video to be detected into a plurality of video segments based on the hash value of the video frame to be detected; sorting the plurality of video segments according to time length from high to low to obtain an ordered video segment set; and determining the video segment to be detected at the front end in the ordered video segment set.
The video frames to be detected are arranged according to a preset sequence; the step of dividing the video to be detected into a plurality of video segments based on the hash value of the video frame to be detected comprises: determining a hash difference value of adjacent video frames in the video frames to be detected which are arranged according to the preset sequence based on the hash values of the video frames to be detected which are arranged according to the preset sequence; determining a cutting frame in the video frames to be detected which are arranged according to the preset sequence based on the Hash difference value; and dividing the video to be detected into a plurality of video segments by using the cutting frame.
It should be noted that the video frames to be detected have a time sequence, that is, the preset sequence is a time sequence. Generally, perceptual hashing or mean hashing (the perceptual hashing is used in the present invention) may be used to calculate a hash value of a video frame to be detected, so as to obtain the hash value of the video frame to be detected. The video frame to be detected can be converted into an 8 × 8 preprocessed image, hash value calculation is performed on the preprocessed image to obtain a hash value of the video frame to be detected, and the hash value of the video frame to be detected can be a 64-dimensional floating point number vector, that is, the obtained hash value is a hash value vector.
Arranging video frames to be detected according to a preset sequence (time sequence), arranging hash values corresponding to the video frames to be detected according to the preset sequence, and calculating a hash difference value of two adjacent video frames to be detected from a first video frame to be detected: comparing 64-dimensional hash value vectors of adjacent video frames to be detected, and determining the number of different numbers as the hash difference value. Traversing all hash difference values from the first video frame to be detected according to the time sequence of the video frames to be detected; when a certain hash difference value is determined to exceed a preset difference value (the preset difference value is preferably 24 in the application), a new hash difference value is solved between the next frame and the first frame to be detected, the next frame is a cut frame, the next frame to be detected is used as the new first frame to be detected, the above steps are repeatedly executed until all hash difference values are traversed, and all cut frames are determined. And dividing the video frame to be detected into a plurality of video segments by taking all the cutting frames as cutting points.
Generally, the obtained number of the plurality of video segments is large, and only a part of the video segments with long time length needs to be selected, that is, 10 video segments with the longest time length are selected as the video segments to be detected; the user can determine other numbers according to the requirement, and in the invention, the number of the video segments to be detected is 10, which is a better choice. When the number of the video segments is less than 10, the video segments are all the video segments to be detected, and when the number of the video segments exceeds 10, the 10 video segments with the maximum duration are selected as the video segments to be detected.
Step S13: and matching the selected preset video segment meeting the preset condition with the video segment to be detected in a preset video library with the video segment to be detected to obtain a video pair.
Specifically, before step S13, the method further includes: determining a first average hash value of the video segment to be detected based on the hash value of the video frame to be detected in the video segment to be detected; acquiring a second average hash value of a preset video segment corresponding to each preset video in a preset video library; establishing a similar vector pool by using the first average hash value and the second average hash value; accordingly, step S13 includes: determining a selected preset video segment in the preset video library by using the similar vector pool; and matching the video segment to be detected with the selected preset video segment to obtain a video pair. Wherein the step of determining the selected preset video segment in the preset video library by using the similarity vector pool comprises: determining a selected similarity vector with the similarity greater than or equal to a first preset threshold value in the similarity vector pool; and determining the preset video segment corresponding to the preset video library and the selected similarity vector as the selected preset video segment.
When determining the average hash value (the first average hash value and the second average hash value), a 64-dimensional average hash value vector may be obtained by using a 64-dimensional hash value vector. The preset video library can be determined by a user according to needs, all preset videos in the preset video library are the preset videos processed by the video similarity detection method, all related data information (hash values, hash difference values, a similar vector pool and the like) in the preset video library is calculated and stored, and corresponding data in the preset video library are directly acquired when the step S13 is performed, so that only the video to be detected needs to be processed when the video to be detected is subjected to similarity detection, the data in the preset video library does not need to be processed, and a large amount of data processing time is reduced.
In a specific application, the preset video library may include a relatively large number of videos, where the videos in the preset video library are the preset videos, and similarity calculation needs to be performed on a preset video segment corresponding to the preset video in the preset video library and a video segment to be detected, so as to obtain a similar vector pool. For each video segment to be detected, calculating a similarity vector of an average hash value of the video segment to be detected and an average hash value of a preset video segment through a similarity operation framework (such as faiss), and taking the first M (M is a natural number different from 0, and M is a preferred choice) similarity vectors with the highest value as initial selection similarity vectors; and obtaining the similarity vector pool based on the respective initially selected similarity vectors of the plurality of video segments to be detected, wherein the similarity vectors can be cosine similarity vectors. Generally, when the video to be detected corresponds to N video segments (N is a natural number different from 0, preferably, 10 is taken as N in the present invention, as described above), the number of the preset video segments corresponding to each preset video in the preset video library is also 10.
It is understood that the average hash value of the preset video segments in the preset video library is in the form of a vector, and the vector has a spatial relationship. For a video segment to be detected, when the average hash value (vector) of the video segment to be detected is determined, the vector space information of the average hash value of the video segment to be detected can be utilized to determine a primary preset video segment slightly larger than the value M in a preset video library, then the similarity vector between the video segment to be detected and the primary preset video segment is calculated to obtain a plurality of similarity vectors, and the former M similarity vectors are determined in the similarity vectors to obtain the similarity vector pool.
For example, the preset video library includes 1000 preset videos, each preset video has 10 preset video segments, and one to-be-detected video has 10 to-be-detected video segments; for each video segment to be detected, based on the vector space information of the average hash value, 300 primary selection preset video segments are determined from 10000(1000 × 10 ═ 10000) preset video segments, the similarity vectors of the video segment to be detected and the 300 primary selection preset video segments are calculated to obtain 300 similarity vectors, and based on the 300 similarity vectors, the front 100 with the maximum value is determined as the primary selection similarity vector; and finally, determining 1000 primary selection similarity vectors based on all the video segments to be detected, and obtaining the similarity vector pool based on the 1000 primary selection similarity vectors.
In addition, in the present invention, the first preset threshold may be 0.9, and the similarity vector with the similarity greater than or equal to 0.9 is the selected similarity vector; each selected similarity vector corresponds to a video segment to be detected and a preset video segment, and the preset video segment corresponding to the selected similarity vector is the selected preset video segment. And establishing a pairing relation between each video segment to be detected and the corresponding selected preset video segment to obtain the video pair.
For example, similarity vectors are calculated for the video segment a to be detected and the preset video segments a1, a2, a3 and a4, and 4 similarities are obtained: a is 0.91 degree similar to a 1: a is 0.93 similar to a 2: a is 0.89 similar to a 3: a similarity between a and a1 is 0.90, and the selected preset video segments screened out are a1, a2 and a4 based on a first preset threshold value of 0.9, wherein the obtained video pairs are: three pairs of a and a1, a and a2, and a 4.
Step S14: and carrying out similarity detection on the video pair to obtain a similarity detection result of the video to be detected.
Specifically, step S14 includes: dividing the video pairs of which the selected preset video segments belong to the same preset video into a video group to obtain a plurality of video groups; acquiring starting time information of video segments included in each video pair in each video group in the plurality of video groups; determining the starting time difference between the video segment to be detected and the selected preset video segment included in each video pair in each video group based on the starting time information of the video segment included in each video pair in each video group; dividing the video pairs with the same starting time difference in each video group into a sub-video group to obtain a plurality of sub-video groups corresponding to the plurality of video groups respectively; acquiring a first hash difference value of a video segment to be detected included in each video pair in each sub video group in the plurality of sub video groups and a second hash difference value of a selected preset video segment; determining the hash difference similarity between a video segment to be detected and a selected preset video segment in each video pair in each sub-video group based on the first hash difference value and the second hash difference value; determining the area of which the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold as a similar area of each video pair in each sub-video group; obtaining similarity detection results corresponding to the plurality of sub-video groups based on the similar regions of the video pairs in each sub-video group; and obtaining a similarity detection result of the video to be detected based on the similarity detection results corresponding to the plurality of sub-video groups.
In addition, before the step of determining the area where the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold as the similar area of each video pair in each sub-video group, the method further includes: acquiring a first audio of a video segment to be detected and a second audio of a selected preset video segment in each video pair in the plurality of sub-video groups; determining the audio similarity between a video segment to be detected and a selected preset video segment in each video pair in each sub-video group based on the first audio and the second audio; correspondingly, the step of determining the area in which the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold as the similar area of each video pair in each sub-video group includes: and determining the area in which the hash difference similarity of each video pair in each sub-video group is greater than or equal to a second preset threshold and the audio similarity of each video pair in each sub-video group is greater than or equal to a third preset threshold as the similar area of each video pair in each sub-video group.
It should be noted that the video pairs usually include a large number of video pairs, and the present invention is exemplified by a plurality of video pairs obtained based on one video to be detected. Since the preset video library usually includes preset video segments corresponding to a plurality of preset videos, the selected preset video segment in the obtained video pair may be preset video segments in different preset videos, and at this time, a video pair in which the selected preset video segment in the video pair belongs to the same preset video needs to be divided into one video group to obtain a plurality of video groups.
For example, the preset video library includes preset video segments corresponding to 2 preset videos (B video and C video), and the obtained video pairs may relate to all 2 preset videos, so that the video pairs are divided into two video groups: the video pair in the first video group comprises the video segment of the video to be detected and the selected preset video segment corresponding to the B video, the video pair in the second video group comprises the video segment of the video to be detected and the selected preset video segment corresponding to the C video, the first video group does not comprise the video segment of the C video, and meanwhile, the second video group does not comprise the video segment of the B video.
The video segment to be detected in one video pair and the corresponding selected preset video segment have starting time information, and the starting time difference corresponding to the video pair is determined based on the starting time information. Calculating the starting time difference of all video pairs in one video group, screening out video pairs with the same starting time difference from the video group to form a sub-video group, and discarding the video pairs with different starting time differences; and traversing all the video groups to obtain all the sub-video groups.
When the step S12 is performed to obtain the hash difference, the hash difference may be stored, and when the step S13 is performed, the first hash difference of the to-be-detected video segment included in each video pair in each sub video group and the second hash difference of the selected preset video segment (where both hash differences are in time order, that is, related to the time order of the corresponding video frame) are directly obtained from the stored hash differences.
For any video pair in each sub-video group, aligning the starting time information of a video segment to be detected included in the video pair with the starting time information of a selected preset video pair, and performing similarity (which can be cosine similarity) matching of a hash difference backwards, namely determining the hash difference similarity, and taking a point lower than a second preset threshold as a difference point when the similarity is lower than the second preset threshold (0.8 is taken in the invention); and aligning the end time information of the video segment to be detected included in the video pair with the end time information of the selected preset video pair, and performing similarity matching of the hash difference value forward, namely determining the hash difference similarity, and when the similarity is lower than a second preset threshold (0.8 is taken in the invention), taking the point lower than the second preset threshold as another difference point, wherein the area between the two difference points is the similar area of the video pair. Wherein, when the matching process is performed, the acceleration can be performed by using a dichotomy.
In addition, referring to the description in the previous paragraph, it is further required to determine an audio similarity (which may be a cosine similarity) by using the first audio of the video segment to be detected and the audio sequence of the selected preset video segment included in the video pair; and matching the audio similarity by using the same method, wherein when the difference point is determined, the difference point needs to satisfy two constraints that the hash difference similarity is greater than or equal to a second preset threshold, and the audio similarity is greater than or equal to a third preset threshold, otherwise, the difference point is not calculated.
It should be noted that the audios (the first audio and the second audio) are the audios corresponding to the video frames to be detected included in the video pairs in the sub-video group, and not all the audios. For example, the video frame to be detected is 4 frames per second, and the first audio corresponding to the video segment to be detected is the sampled audio obtained by sampling the audio at the time corresponding to the 4 frames of the 4 frames per second.
And determining all similar regions corresponding to the plurality of sub-video groups based on the obtained similar regions of the video pairs in each sub-video group, obtaining similarity detection results corresponding to the plurality of sub-video groups based on all similar regions, and integrating the similarity detection results corresponding to the plurality of sub-video groups, wherein the similarity detection results are the similarity detection results of the video to be detected.
In a specific application, the similarity detection result of the video to be detected may include the similarity of the repeated portion of the video to be detected (the similar regions of the video pairs in the plurality of sub-video groups), the proportion of the repeated portion of the video to be detected (the time length ratio of the repeated portion to the video to be detected), and the proportion of the repeated portion of the video to be detected in the detected video (the time length ratio of the repeated portion to the comparison video, which is a preset video in the preset video library). The similarity detection result can also comprise a time starting point, a time ending point and a time length of the repeated part in the original video, and also comprises a time starting point, a time ending point and a time length of the repeated part in the comparison video.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating a video similarity detection result of a video to be detected according to the present invention; the left picture is a schematic diagram of a video to be detected (original video), and the right picture is a schematic diagram of a comparison video (repeated video, a preset video in a preset video library). Wherein, the three values in the corresponding frame are respectively: similarity of a repeated part of the video to be detected (a similar area of a video pair in the last sub-video group), proportion of the repeated part of the video to be detected (a time length ratio of the repeated part to the video to be detected) and proportion of the repeated part of the video to be detected in the detected video (a time length ratio of the repeated part to the comparison video, wherein the comparison video is a preset video in a preset video library); the number in the corresponding frame represents the similarity of a certain repeated part; the initial time points of a certain repeated part of the number in the corresponding frame in the video to be detected and the comparison video; sixthly, the numbers in the corresponding boxes represent the duration of a certain repeated part in the video to be detected and the comparison video.
The technical scheme of the invention provides a video similarity detection method, which comprises the steps of determining a video frame to be detected in a video to be detected when the video to be detected is obtained; screening out a video segment to be detected from the video to be detected by using the video frame to be detected; matching a selected preset video segment which meets preset conditions with the video segment to be detected in a preset video library with the video segment to be detected to obtain a video pair; and carrying out similarity detection on the video pair to obtain a similarity detection result of the video to be detected.
In the existing video similarity detection method, all video frames of a video to be detected and all video frames of a preset video are obtained and processed to obtain image feature points, the image feature points corresponding to the video to be detected are compared with the image feature points corresponding to the preset video to obtain a comparison result, based on the comparison result, a similarity detection result is obtained, feature point extraction needs to be carried out on all video frames of the video to be detected and the preset video, the data processing amount is large, more calculation time is consumed, the speed for obtaining the similarity detection result is low, and the efficiency is low; according to the method and the device, only part of video frames, namely the video frames to be detected, need to be processed, and meanwhile, part of video segments to be detected of the videos to be detected are screened out for processing, so that the data processing amount is greatly reduced, and the speed for obtaining the similarity detection result is high. Therefore, the video similarity detection method improves the efficiency of video similarity detection.
In addition, the hash value is used as a basic basis for similarity detection, and a large number of hash values of the video frames can be obtained only by consuming a small amount of calculation, so that the operation time of data processing of the terminal equipment is saved, and the efficiency of similarity detection is improved.
Referring to fig. 4, fig. 4 is a block diagram of a video similarity detection apparatus according to a first embodiment of the present invention, where the apparatus is used in a terminal device, and the apparatus includes:
the acquisition module 10 is configured to determine a video frame to be detected in a video to be detected when the video to be detected is acquired;
the screening module 20 is configured to screen a video segment to be detected from the video to be detected by using the video frame to be detected;
the matching module 30 is configured to match a selected preset video segment, which meets a preset condition with the to-be-detected video segment, in a preset video library with the to-be-detected video segment, so as to obtain a video pair;
and the detection module 40 is configured to perform similarity detection on the video pair to obtain a similarity detection result of the video to be detected.
The above description is only an alternative embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.