CN116489449A - Video redundancy fragment detection method and system - Google Patents

Video redundancy fragment detection method and system Download PDF

Info

Publication number
CN116489449A
CN116489449A CN202310353228.8A CN202310353228A CN116489449A CN 116489449 A CN116489449 A CN 116489449A CN 202310353228 A CN202310353228 A CN 202310353228A CN 116489449 A CN116489449 A CN 116489449A
Authority
CN
China
Prior art keywords
video
audio
target video
target
redundant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310353228.8A
Other languages
Chinese (zh)
Inventor
刘威
杨昕磊
李振华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202310353228.8A priority Critical patent/CN116489449A/en
Publication of CN116489449A publication Critical patent/CN116489449A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a video redundancy fragment detection method and a video redundancy fragment detection system. The method comprises the following steps: extracting audio features from an audio clip of a target video; clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes; acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video frame based on the perceived hash of the target video frame to determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video. The video redundant segment detection method provided by the invention can realize rapid and accurate redundant segment detection, and effectively improve the video redundant segment detection efficiency and accuracy, thereby improving the use experience of users.

Description

Video redundancy fragment detection method and system
Technical Field
The invention relates to the technical field of computer multimedia, in particular to a method and a system for detecting video redundant fragments. In addition, the invention also relates to an electronic device and a processor readable storage medium.
Background
Web video is an important multimedia application, and at present, transmission of Web video occupies a major part of internet data traffic. Of these, cross-correlated videos (e.g., television shows, documentaries, cartoon episodes) are the most common transmission. For normal users, the cross-correlated video is often accompanied by a very large number of redundant segments, most of which are contained in the top, tail, and advertisement. For example, each episode of a television series contains redundant segments, such as the beginning or end of a episode, for up to several minutes, some even 10% of the duration of the episode. Viewers often want to be able to skip these redundant segments to enable a more continuous viewing experience while saving time and network traffic overhead.
Driven by user demand, today's mainstream network video content providers typically mark redundant segments ahead of time at the video play bar and provide the user with automatic skip functionality when playing. In fact, the most straightforward method of marking these redundant segments is manual labeling, however, this would bring significant labor costs to these video content providers and is therefore rarely used in practice. Conventional redundant data detection algorithms based on file byte patterns are also not applicable in this scenario, since video content providers typically use different video coding parameters on different videos to achieve an optimal balance between video quality and video size. For this reason, some video websites attempt to detect these redundant segments using computer vision methods, which, however, involve training of computer vision models, which is costly. Other video websites reduce the overhead required by reducing the complexity of the model and detecting only the beginning and ending portions of the video, but this reduces the accuracy of the detection, resulting in many false marks or missed marks. In addition to this, other video websites also attempt to identify redundant segments by comparing audio spectral features that video has based on a more lightweight audio fingerprinting method. However, the audio-based method is affected by audio coding, and is prone to introduce noise and cause loss of key information, making the actual detection accuracy unstable.
Therefore, the existing cross-correlation video redundant segment detection method cannot achieve a good balance between detection accuracy and required calculation and cost overhead, which is a main reason that the automatic skip redundant segment function provided by the current video content provider is not satisfactory to users. In order to solve this problem, how to provide a more accurate and efficient video redundant segment detection scheme is a challenge to be solved.
Disclosure of Invention
Therefore, the invention provides a method and a system for detecting video redundant fragments, which are used for solving the defects of poor actual user experience caused by lower efficiency and accuracy of a video redundant fragment detection and marking scheme in the prior art.
In a first aspect, the present invention provides a method for detecting a video redundant segment, including:
extracting audio features from an audio clip of a target video;
clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes;
acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video frame based on the perceived hash of the target video frame to determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
Further, before extracting the audio feature from the audio clip of the target video, the method further includes: determining a reference video from video data to be detected;
determining a reference perceived hash corresponding to a reference video frame in the reference video, and constructing a perceived hash table for completing matching query operation within linear time complexity based on the reference perceived hash corresponding to the reference video frame; the perceptual hash table comprises reference perceptual hashes corresponding to reference video frames, wherein the reference video frames comprise video frames of redundant segments in the reference video.
Further, the extracting the audio feature from the audio clip of the target video specifically includes: determining a target video from video data to be detected;
dividing the target video into a plurality of video clips according to a preset time interval, and extracting corresponding audio features based on audio clips corresponding to the video clips; the audio features include frequency domain features and time domain features of the audio data.
Further, the clustering of the audio segments according to the time sequence constraint condition based on the audio features to obtain a clustering result specifically includes:
And determining Ward distances of each two adjacent audio clips in the audio clips based on the audio features, and clustering and merging the two adjacent audio clips with the Ward distances nearest to each other to obtain a clustering result.
Further, before comparing the perceived hash of the target video frame with the perceived hash of the reference video to determine the redundant segment in the target video, the method further comprises:
determining a perceptual hash of the target video frame; modeling the video scene into a corresponding scene division tree according to the scene switching point of the video scene;
the determining the redundant segment in the target video based on the comparison between the perceived hash of the target video frame and the perceived hash of the reference video specifically includes:
and taking the perceived hash of the target video frame as an index, searching a reference perceived hash matched with the index in a perceived hash table of the reference video in a mode from coarse granularity to fine granularity based on the scene division tree, and determining redundant fragments in the target video based on the reference perceived hash matched with the index.
Further, the dividing the target video into a plurality of video scenes based on the clustering result, and determining a scene switching point of the video scene specifically includes:
And dividing the target video into a plurality of corresponding video scenes according to the clustering points between two adjacent audio clips in the clustering result, and determining the clustering points between the two adjacent audio clips as scene switching points of the video scenes.
Further, the obtaining the target video frame corresponding to the scene switching point specifically includes: determining the number of video frame extractions corresponding to the scene switching points;
and respectively acquiring a corresponding number of video frames from the front and the back of the scene switching point based on the video frame extraction number to serve as target video frames.
In a second aspect, the present invention further provides a video redundant segment detection system, including:
the audio feature extraction module is used for extracting audio features from the audio fragments of the target video;
the scene dividing module is used for clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes;
the video comparison module is used for acquiring a target video frame corresponding to the scene switching point, comparing the target video frame based on the perceived hash of the target video frame and the perceived hash of the reference video, and determining a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
Further, before extracting the audio feature from the audio clip of the target video, the method further includes: the video frame hash calculation module is used for determining a reference video from video data to be detected; determining a reference perceived hash corresponding to a reference video frame in the reference video, and constructing a perceived hash table for completing matching query operation within linear time complexity based on the reference perceived hash corresponding to the reference video frame; the perceptual hash table comprises reference perceptual hashes corresponding to reference video frames, wherein the reference video frames comprise video frames of redundant segments in the reference video.
Further, the audio feature extraction module is specifically configured to: determining a target video from video data to be detected; dividing the target video into a plurality of video clips according to a preset time interval, and extracting corresponding audio features based on audio clips corresponding to the video clips; the audio features include frequency domain features and time domain features of the audio data.
Further, the scene division module is specifically configured to:
and determining Ward distances of each two adjacent audio clips in the audio clips based on the audio features, and clustering and merging the two adjacent audio clips with the Ward distances nearest to each other to obtain a clustering result.
Further, the scene division module is specifically configured to:
and dividing the target video into a plurality of corresponding video scenes according to the clustering points between two adjacent audio clips in the clustering result, and determining the clustering points between the two adjacent audio clips as scene switching points of the video scenes.
Further, the video comparison module is specifically configured to: determining the number of video frame extractions corresponding to the scene switching points;
and respectively acquiring a corresponding number of video frames from the front and the back of the scene switching point based on the video frame extraction number to serve as target video frames.
Further, before comparing the perceived hash of the target video frame with the perceived hash of the reference video to determine the redundant segment in the target video, the method further comprises:
the video frame hash calculation module is also used for determining the perceived hash of the target video frame; the scene division module is further used for modeling the video scene into a corresponding scene division tree according to the scene switching point of the video scene;
the video comparison module is specifically used for:
and taking the perceived hash of the target video frame as an index, searching a reference perceived hash matched with the index in a perceived hash table of the reference video in a mode from coarse granularity to fine granularity based on the scene division tree, and determining redundant fragments in the target video based on the reference perceived hash matched with the index.
In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the video redundancy fragment detection method as described in any one of the above when executing the computer program.
In a fourth aspect, the present invention also provides a processor readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video redundancy fragment detection method as described in any one of the above.
According to the video redundant segment detection method provided by the invention, audio features are extracted from audio segments of a target video, the audio segments are clustered according to time sequence constraint conditions based on the audio features to obtain clustering results, the target video is divided into a plurality of corresponding video scenes based on the clustering results, and scene switching points of the video scenes are determined; and acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video based on the perceived hash of the target video frame to determine redundant fragments in the target video. The video redundant segment detection method provided by the invention can realize rapid and accurate redundant segment detection and effectively improve the video redundant segment detection efficiency and accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly describe the drawings that are required to be used in the embodiments or the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without any inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a video redundant segment detection method according to an embodiment of the present invention;
FIG. 2 is a complete flowchart of a method for detecting video redundant segments according to an embodiment of the present invention;
FIG. 3 is a schematic view of hierarchical clustering constrained by time provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of a process for searching for redundant segments according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a video redundant segment detection system according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which are derived by a person skilled in the art from the embodiments according to the invention without creative efforts, fall within the protection scope of the invention.
The invention provides a redundant segment detection method of an audio-driven high-performance Web cross-correlation video. Video scenes are a very important concept in video authoring, which represents a series of logically related, sequentially presented scenario events. In fact, redundant clips, clips and advertisements are often inserted completely into the video content after video authoring, so that the video data is composed of one or more complete scenes and does not appear or end up inside a certain scene. The invention starts from the video scene, and utilizes light-weight audio information to roughly determine the approximate position of scene switching, and only compares video frames near the scene switching point so as to quickly and accurately position the starting and ending positions of redundant fragments of video data, thereby avoiding the inefficient processing of all video frames of a part of video data.
The following describes embodiments of the video redundancy segment detection method according to the present invention in detail. As shown in fig. 1, which is a schematic flow chart of a video redundant segment detection method according to an embodiment of the present invention, a specific implementation process includes the following steps:
Step 101: audio features are extracted from an audio clip of a target video.
In the embodiment of the invention, a reference video is required to be determined in advance from video data to be detected before the step is executed, a reference perceived hash corresponding to a reference video frame in the reference video is calculated, and a perceived hash table for completing matching query operation within linear time complexity is constructed based on the reference perceived hash corresponding to the reference video frame. The perceptual hash table comprises reference perceptual hashes corresponding to reference video frames, wherein the reference video frames comprise video frames of redundant segments in the reference video. For example, before detecting redundant segments, it is first necessary to arbitrarily select a set (for example, a first set of dramas) of the whole cross-correlation videos as a reference video, and the remaining videos as target videos (i.e., query videos), which may refer to video data other than the reference video determined from video data to be detected. When detecting the redundant segments, each target video is conveniently compared with the reference video in a video frame under the guidance of the audio scene switching information, so that the redundant segments are detected. The cross-correlation video is video data with association relationship, such as a complete television show or documentary. In the implementation process of the step, the target video can be divided into a plurality of video clips according to a preset time interval, and corresponding audio features are extracted based on the audio clips corresponding to the video clips; the audio features include frequency domain features and time domain features of the audio data.
That is, in the embodiment of the present invention, all video frames of the reference video need to be decoded first, and the perceptual hash (i.e., the perceptual hash value or the perceptual hash feature value) of the decoded video frames needs to be calculated, and finally, 256 bits of perceptual hash is output for each frame of the reference video. After 256-bit perceptual hash of all video frames is generated, a multi-segment index hash technology is utilized to establish a perceptual hash table capable of completing query within the secondary linear time complexity. Then, audio features are extracted from the audio data of the target video, so that the target video can be segmented into video segments with a time interval of 180 milliseconds, for example, and then the frequency domain features and the time domain features of the audio data are extracted from the audio segments of the video segments. Wherein the frequency domain features mainly comprise mel-frequency cepstrum coefficient (MFCC), frequency centroid, and the time domain features mainly comprise zero-crossing rate and root mean square energy. The acoustic characteristic data is extracted from the audio data, so that the process is very light, but a large amount of beneficial guiding information can be provided for the detection of the subsequent video redundant fragments, the scene of the video data is segmented by utilizing the acoustic characteristics of the video data, and the detection efficiency is effectively improved.
As shown in fig. 2, for a given portion of interrelated video (e.g., a television episode, documentary), the present invention first selects a collection from among them as a reference video. The other sets of videos of the cross-correlated video are used as target videos, and redundant video fragments (namely redundant fragments of the videos) of the cross-correlated video are searched based on the reference videos. After the reference video is selected, the present invention first needs to decode all video frames of the reference video and calculate the perceptual hash for all video frames after decoding. Specifically, for a certain frame image, it is first reduced to a 16×16 gray image using a fast bilinear interpolation method; and then, performing discrete cosine transform on the 16×16 gray image to extract visual features, and finally obtaining a 16×16 frequency domain feature matrix. Unlike the conventional image hashing process, the high frequency components are not discarded in this step of the present invention, but all frequency domain components are retained. This is because the difference between adjacent frames of video is very small, and this slight difference will be reflected in the high frequency components of the discrete cosine transform, which would be difficult to distinguish if discarded. Thereafter, the present invention generates a perceptual hash H from a 16 x 16 frequency domain feature matrix F of the video frame based on the following formula:
Wherein i and j represent index values of a 16×16 frequency domain feature matrix F or a perceptual hash H extracted from a video frame. For example, F (i, j) represents the value of the component of the ith row and jth column on the 16×16 frequency domain feature matrix, and H (i, j) represents the value of the component of the ith row and jth column of the 16×16 perceptual hash.
a represents the average of all frequency domain components of the 16 x 16 frequency domain feature matrix.
Thus, the perceptual hash H of a video frame is a 0-1 vector of length 256. After each video frame of the reference video is calculated to obtain the perceptual hash, the invention organizes the hash values by using a multi-segment index hash technology to construct the perceptual hash, thereby providing sub-linear time complexity for video frame comparison in the subsequent step.
Step 102: clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; and dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes.
In the implementation process of this step, the audio features (i.e., acoustic features) of the extracted audio segments are obtained, and hierarchical clustering constrained by the time sequence is performed on the audio segments based on the audio features, so as to divide the corresponding video scenes, and detect possible scene switching points (i.e., scene switching time points) of the target video. Specifically, the Ward distance of each two adjacent audio clips in the audio clips can be determined based on the audio features, and the two adjacent audio clips with the nearest Ward distance are clustered and combined to obtain a clustering result. And dividing the target video into a plurality of corresponding video scenes according to the clustering points between two adjacent audio clips in the clustering result, and determining the clustering points between the two adjacent audio clips as scene switching points of the video scenes.
In an actual implementation, for each target video, audio features are first extracted for the audio data of that video. Specifically, this step extracts mel-frequency cepstral coefficients, frequency centroids, zero-crossing rates, root mean square energies for audio data within a sliding window every 180 milliseconds, with a step size of the sliding window of 90 milliseconds. That is, there is a 90 millisecond overlap between the 180 millisecond sliding windows, enabling finer granularity of the extracted acoustic features. The invention extracts the first 13 coefficients of the mel-cepstrum coefficient, combines the frequency centroid, the zero crossing rate and the root mean square energy, and finally obtains 16-dimensional feature vectors for each 180-millisecond audio segment.
And performing time-constrained hierarchical clustering based on all audio features of the target video, so that scene division information is obtained through clustering. As shown in fig. 3, a process of time-constrained hierarchical clustering of each audio segment is illustrated. At the beginning of this process, all audio segments are in an un-clustered state. When the algorithm is iterated for one round, the step calculates the Ward distance of each adjacent two audio fragments (or the fragments clustered in the previous iteration), and selects the two adjacent audio fragments with the nearest Ward distance for clustering and merging. For two adjacent sets of audio segments A, B, the Ward distance between them is calculated as follows:
Where μ represents the centroid of all audio feature vectors in the set; x represents a certain audio segment in the collection. Accordingly, the cluster point between each adjacent audio segment can be regarded as the switching point of a scene. As shown in fig. 3, the audio clip 6 and the audio clip 7 can be considered as a scene cut, which bisects the video; continuing with the subdivision down, these scenes may also be divided into sub-scenes, e.g., scenes 1-6 may be subdivided into sub-scenes 1-3 and sub-scenes 4-6, and scenes 7-16 may be subdivided into sub-scenes 7-11 and sub-scenes 12-16. The partitioning process may model the video scene as a scene partitioning tree.
Step 103: acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video frame based on the perceived hash of the target video frame to determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
In the implementation process of the step, firstly, the extraction quantity of video frames corresponding to the scene switching point is determined, and the corresponding quantity of video frames are respectively obtained from the front and the rear of the scene switching point based on the extraction quantity of the video frames to serve as target video frames. By traversing all target videos, gradually extracting the nearest video frame from a scene switching time point (obtained by clustering the audio features in the previous step) as a target video frame according to the extraction quantity of the video frames for each target video, and decoding and calculating 256-bit perceptual hash for the target video frame.
It should be noted that, before comparing the perceived hash of the target video frame with the perceived hash of the reference video to determine the redundant segment in the target video, the perceived hash of the target video frame needs to be determined in advance, and the video scene is modeled as a corresponding scene division tree according to the scene switching point of the video scene. And then, taking the perceived hash of the target video frame as an index, searching a reference perceived hash matched with the index in a perceived hash table of the reference video in a mode from coarse granularity to fine granularity based on the scene division tree, and determining redundant fragments in the target video based on the reference perceived hash matched with the index. That is, in a form from coarse granularity to fine granularity, video frames decoded near the scene switching time points are searched in a perceptual hash table of a reference video to finally determine the boundary positions of redundant fragments, so that unnecessary video frames in the video scene are avoided from being compared, and the detection efficiency is greatly improved.
Based on the scene division tree generated by the hierarchical clustering of the target videos, the step can search video fragments which are redundant relative to the reference video in the target videos from a form of coarse granularity to fine granularity from top to bottom. As shown in fig. 4, starting from the root node of the scene division tree, each node divides one video clip into two scenes. This step extracts video frames (typically the nearest n key frames, since they can decode images independently of other video frames, where n represents a specific number of video key frames) from the scene cut point of the target video, calculates the perceptual hash value, and looks up in the multi-segment index hash table of the reference video to determine if this same video frame exists (as in (1) of fig. 4), if so, indicates that the target video may have a redundant segment there, marks this frame accordingly, sets a redundancy flag bit (indicating that another endpoint needs to be found where one of the redundant segments has currently been found to begin or end, thereby framing the redundant segment), and continues the search recursively (as in (2) of fig. 4) of the scene division tree; if there is no video frame at the scene switching point of the target video in the reference video (as in (2) of fig. 4), searching and comparing are performed toward the video frame of the target video near the direction of the redundancy flag bit based on the subtree of the scene division tree, and the searching range is narrowed until it is found that one frame is identical again, so as to finally determine the position of the other end of the redundancy clip (as in (3) of fig. 4). The redundancy flag bit is then cancelled and no further recursive downward search is performed on the current scene partition tree node. As can be seen from fig. 4, this process skips most of the video frames, decodes and compares only the video frames at the scene cut point, thereby saving a significant portion of the video data processing overhead. It should be noted that, in the searching process, if no redundant video frame is found, the algorithm recursively searches downwards until convergence to a certain granularity is not possible to continue to divide the scene, that is, the duration of the video segment represented by the node of the current scene division tree is less than a certain time interval or time threshold, for example, 1 second. When no more scene division tree nodes can be searched, the redundancy detection process of the target video of the set is finished, and the starting and ending positions of the redundancy fragments are output. And step 101, searching and comparing the next set until the redundant segments of all the target videos are detected, and ending the execution.
The invention innovatively uses the scene information divided based on acoustics to guide the comparison of redundant segments of the cross-correlation video, and further accelerates the comparison process of video frames by using a multi-segment index hash technique. Compared with the traditional method based on computer vision, the method can greatly reduce the required calculation cost, and compared with the traditional method based on audio fingerprint, the method can greatly improve the comparison precision. The accuracy of actually detecting the redundant segment reaches 98%, the recall rate reaches 93%, and the redundant segment detection speed is averagely improved by 9.3 times compared with a computer vision method. The method can be particularly used for the background of a video website, help video content providers to mark redundant fragments of videos in advance before video release, and can be seamlessly integrated into browser plug-ins, so that users can be helped to detect redundant contents in real time and display corresponding prompts when watching cross-correlation videos with the redundant fragments which are not marked by using a browser.
According to the video redundant segment detection method, audio features are extracted from audio segments of a target video, the audio segments are clustered according to time sequence constraint conditions based on the audio features to obtain clustering results, the target video is divided into a plurality of corresponding video scenes based on the clustering results, and scene switching points of the video scenes are determined; and acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video based on the perceived hash of the target video frame to determine redundant fragments in the target video. The video redundant segment detection method provided by the invention can realize rapid and accurate redundant segment detection and effectively improve the video redundant segment detection efficiency and accuracy.
Corresponding to the video redundant segment detection method provided by the invention, the invention also provides a video redundant segment detection system. Since the embodiments of the system are similar to the method embodiments described above, the description is relatively simple, and reference should be made to the description of the method embodiments described above, and the embodiments of the video redundancy fragment detection system described below are merely illustrative. Fig. 5 is a schematic structural diagram of a video redundant segment detection system according to an embodiment of the present invention.
The invention relates to a video redundancy fragment detection system, which specifically comprises the following parts:
an audio feature extraction module 501, configured to extract audio features from an audio clip of a target video;
the scene division module 502 is configured to cluster the audio segments according to a time sequence constraint condition based on the audio features, so as to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes;
the video comparison module 502 is configured to obtain a target video frame corresponding to the scene switching point, compare the target video frame based on a perceived hash of the target video frame and a perceived hash of a reference video, and determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
Further, before extracting the audio feature from the audio clip of the target video, the method further includes: the video frame hash calculation module is used for determining a reference video from video data to be detected; determining a reference perceived hash corresponding to a reference video frame in the reference video, and constructing a perceived hash table for completing matching query operation within linear time complexity based on the reference perceived hash corresponding to the reference video frame; the perceptual hash table comprises reference perceptual hashes corresponding to reference video frames, wherein the reference video frames comprise video frames of redundant segments in the reference video.
Further, the audio feature extraction module is specifically configured to: determining a target video from video data to be detected; dividing the target video into a plurality of video clips according to a preset time interval, and extracting corresponding audio features based on audio clips corresponding to the video clips; the audio features include frequency domain features and time domain features of the audio data.
Further, the scene division module is specifically configured to:
and determining Ward distances of each two adjacent audio clips in the audio clips based on the audio features, and clustering and merging the two adjacent audio clips with the Ward distances nearest to each other to obtain a clustering result.
Further, the scene division module is specifically configured to:
and dividing the target video into a plurality of corresponding video scenes according to the clustering points between two adjacent audio clips in the clustering result, and determining the clustering points between the two adjacent audio clips as scene switching points of the video scenes.
Further, the video comparison module is specifically configured to: determining the number of video frame extractions corresponding to the scene switching points;
and respectively acquiring a corresponding number of video frames from the front and the back of the scene switching point based on the video frame extraction number to serve as target video frames.
Further, before comparing the perceived hash of the target video frame with the perceived hash of the reference video to determine the redundant segment in the target video, the method further comprises:
the video frame hash calculation module is also used for determining the perceived hash of the target video frame; the scene division module is further used for modeling the video scene into a corresponding scene division tree according to the scene switching point of the video scene;
the video comparison module is specifically used for:
and taking the perceived hash of the target video frame as an index, searching a reference perceived hash matched with the index in a perceived hash table of the reference video in a mode from coarse granularity to fine granularity based on the scene division tree, and determining redundant fragments in the target video based on the reference perceived hash matched with the index.
According to the video redundant segment detection system, audio features are extracted from audio segments of a target video, the audio segments are clustered according to time sequence constraint conditions based on the audio features to obtain clustering results, the target video is divided into a plurality of corresponding video scenes based on the clustering results, and scene switching points of the video scenes are determined; and acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video based on the perceived hash of the target video frame to determine redundant fragments in the target video. The video redundant segment detection system provided by the invention can realize rapid and accurate redundant segment detection and effectively improve the video redundant segment detection efficiency and accuracy.
The invention further provides electronic equipment corresponding to the video redundant segment detection method or the video redundant segment detection. Since the embodiments of the electronic device are similar to the method embodiments described above, the description is relatively simple, and reference should be made to the description of the method embodiments described above, and the electronic device described below is merely illustrative. Fig. 6 is a schematic diagram of the physical structure of an electronic device according to an embodiment of the present invention. The electronic device may include: a processor (processor) 601, a memory (memory) 602, and a communication bus 603, wherein the processor 601, the memory 602, and the communication bus 603 are used to communicate with each other and with the outside through a communication interface 604. The processor 1101 may call logic instructions in the memory 602 to perform a video redundancy fragment detection method comprising: extracting audio features from an audio clip of a target video; clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes; acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video frame based on the perceived hash of the target video frame to determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
Further, the logic instructions in the memory 602 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a Memory chip, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a storage medium readable by a processor, and the computer program includes program instructions, where when the program instructions are executed by a computer, the computer is capable of executing the video redundancy fragment detection method provided in the foregoing method embodiments. The method comprises the following steps: extracting audio features from an audio clip of a target video; clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes; acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video frame based on the perceived hash of the target video frame to determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
In yet another aspect, embodiments of the present invention further provide a processor readable storage medium having a computer program stored thereon, where the computer program is implemented when executed by a processor to perform the video redundancy fragment detection method provided in the foregoing embodiments. The method comprises the following steps: extracting audio features from an audio clip of a target video; clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes; acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video frame based on the perceived hash of the target video frame to determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
The processor-readable storage medium may be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), and the like.
The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for detecting redundant segments of video, comprising:
extracting audio features from an audio clip of a target video;
clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes;
acquiring a target video frame corresponding to the scene switching point, and comparing the target video frame with the reference video frame based on the perceived hash of the target video frame to determine a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
2. The video redundant segment detection method of claim 1, further comprising, prior to extracting the audio feature from the audio segment of the target video:
determining a reference video from video data to be detected;
determining a reference perceived hash corresponding to a reference video frame in the reference video, and constructing a perceived hash table for completing matching query operation within linear time complexity based on the reference perceived hash corresponding to the reference video frame; the perceptual hash table comprises reference perceptual hashes corresponding to reference video frames, wherein the reference video frames comprise video frames of redundant segments in the reference video.
3. The method for detecting video redundant segments according to claim 1, wherein the extracting audio features from the audio segments of the target video specifically comprises:
determining a target video from video data to be detected;
dividing the target video into a plurality of video clips according to a preset time interval, and extracting corresponding audio features based on audio clips corresponding to the video clips; the audio features include frequency domain features and time domain features of the audio data.
4. The method for detecting video redundant segments according to claim 1, wherein the clustering the audio segments according to a time sequence constraint condition based on the audio features to obtain a clustering result specifically comprises:
And determining Ward distances of each two adjacent audio clips in the audio clips based on the audio features, and clustering and merging the two adjacent audio clips with the Ward distances nearest to each other to obtain a clustering result.
5. The method of claim 1, further comprising, prior to determining the redundant segment in the target video based on comparing the perceptual hash of the target video frame with the perceptual hash of the reference video:
determining a perceptual hash of the target video frame; modeling the video scene into a corresponding scene division tree according to the scene switching point of the video scene;
the determining the redundant segment in the target video based on the comparison between the perceived hash of the target video frame and the perceived hash of the reference video specifically includes:
and taking the perceived hash of the target video frame as an index, searching a reference perceived hash matched with the index in a perceived hash table of the reference video in a mode from coarse granularity to fine granularity based on the scene division tree, and determining redundant fragments in the target video based on the reference perceived hash matched with the index.
6. The method for detecting video redundant segments according to claim 1, wherein the dividing the target video into a plurality of video scenes based on the clustering result, determining a scene switching point of the video scene, specifically comprises:
and dividing the target video into a plurality of corresponding video scenes according to the clustering points between two adjacent audio clips in the clustering result, and determining the clustering points between the two adjacent audio clips as scene switching points of the video scenes.
7. The method for detecting a video redundant segment according to claim 1, wherein the obtaining the target video frame corresponding to the scene change point specifically comprises:
determining the number of video frame extractions corresponding to the scene switching points;
and respectively acquiring a corresponding number of video frames from the front and the back of the scene switching point based on the video frame extraction number to serve as target video frames.
8. A video redundancy segment detection system, comprising:
the audio feature extraction module is used for extracting audio features from the audio fragments of the target video;
the scene dividing module is used for clustering the audio fragments according to time sequence constraint conditions based on the audio features to obtain a clustering result; dividing the target video into a plurality of corresponding video scenes based on the clustering result, and determining scene switching points of the video scenes;
The video comparison module is used for acquiring a target video frame corresponding to the scene switching point, comparing the target video frame based on the perceived hash of the target video frame and the perceived hash of the reference video, and determining a redundant segment in the target video; wherein, there is an association relationship between the target video and the reference video.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the video redundancy fragment detection method of any one of claims 1 to 7 when the computer program is executed by the processor.
10. A processor readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the video redundancy fragment detection method of any one of claims 1 to 7.
CN202310353228.8A 2023-04-04 2023-04-04 Video redundancy fragment detection method and system Pending CN116489449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310353228.8A CN116489449A (en) 2023-04-04 2023-04-04 Video redundancy fragment detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310353228.8A CN116489449A (en) 2023-04-04 2023-04-04 Video redundancy fragment detection method and system

Publications (1)

Publication Number Publication Date
CN116489449A true CN116489449A (en) 2023-07-25

Family

ID=87224334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310353228.8A Pending CN116489449A (en) 2023-04-04 2023-04-04 Video redundancy fragment detection method and system

Country Status (1)

Country Link
CN (1) CN116489449A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117201873A (en) * 2023-11-07 2023-12-08 湖南博远翔电子科技有限公司 Intelligent analysis method and device for video image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117201873A (en) * 2023-11-07 2023-12-08 湖南博远翔电子科技有限公司 Intelligent analysis method and device for video image
CN117201873B (en) * 2023-11-07 2024-01-02 湖南博远翔电子科技有限公司 Intelligent analysis method and device for video image

Similar Documents

Publication Publication Date Title
US11197036B2 (en) Multimedia stream analysis and retrieval
US20230199264A1 (en) Automated voice translation dubbing for prerecorded video
US8064641B2 (en) System and method for identifying objects in video
US20130006625A1 (en) Extended videolens media engine for audio recognition
CN107533850B (en) Audio content identification method and device
CN112733654B (en) Method and device for splitting video
JP4332700B2 (en) Method and apparatus for segmenting and indexing television programs using multimedia cues
CN112733660B (en) Method and device for splitting video strip
CN113766314B (en) Video segmentation method, device, equipment, system and storage medium
WO2016188329A1 (en) Audio processing method and apparatus, and terminal
CN113347489B (en) Video clip detection method, device, equipment and storage medium
CN116489449A (en) Video redundancy fragment detection method and system
CN116361510A (en) Method and device for automatically extracting and retrieving scenario segment video established by utilizing film and television works and scenario
KR20200098381A (en) methods and apparatuses for content retrieval, devices and storage media
JP2000285242A (en) Signal processing method and video sound processing device
Liang et al. A novel role-based movie scene segmentation method
CN112597335B (en) Output device and output method for selecting drama
CN114299415A (en) Video segmentation method and device, electronic equipment and storage medium
CN113012723B (en) Multimedia file playing method and device and electronic equipment
Stein et al. From raw data to semantically enriched hyperlinking: Recent advances in the LinkedTV analysis workflow
CN115080792A (en) Video association method and device, electronic equipment and storage medium
KR20040001306A (en) Multimedia Video Indexing Method for using Audio Features
JP2007060606A (en) Computer program comprised of automatic video structure extraction/provision scheme
CN117750148A (en) Method and system for identifying dramatic subtitles, electronic equipment and storage medium
Park et al. A novel DTW-based distance measure for speaker segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination