CN114329063B

CN114329063B - Video clip detection method, device and equipment

Info

Publication number: CN114329063B
Application number: CN202111275890.3A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Filing date: 2021-10-29
Publication date: 2024-06-11
Anticipated expiration: 2041-10-29

Abstract

The application discloses a video clip detection method, a device and equipment, wherein the method comprises the following steps: and carrying out multidimensional feature extraction on the video clips to be detected to obtain a plurality of clip feature information, determining a first source video clip from a source video set based on the clip feature information, and determining a target source video identification based on the first source video clip. And determining a second source video segment from the source video set based on the first segment characteristic information and the second segment characteristic information in the plurality of segment characteristic information, and determining a target starting time point and a target ending time point based on the second source video segment to obtain source video positioning information. The method can improve the relevance between the video clips to be detected and the source video, and improve the accuracy and efficiency of video clip detection, thereby improving the continuity and efficiency of video clip recommendation.

Description

Video clip detection method, device and equipment

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a method, an apparatus, and a device for detecting video clips.

Background

Short video refers to high frequency pushed video content played on various new media platforms, suitable for viewing in a mobile state and a short leisure state, varying from a few seconds to a few minutes. By clipping and stripping the source video, a plurality of short videos with different long fragments can be obtained. In the related art, when associating a short video with a source video, the short video needs to be manually marked on the basis of the position information of the source video, so that the efficiency of video clip detection is low, and the efficiency of video clip recommendation is low.

Disclosure of Invention

The application provides a video clip detection method, a video clip detection device and video clip detection equipment, which can solve the technical problems of low video clip detection efficiency and low video clip recommendation efficiency.

In one aspect, the present application provides a video clip detection method, the method comprising:

carrying out multidimensional feature extraction on the video clips to be detected to obtain a plurality of clip feature information;

Determining at least one first source video clip from the source video collection that matches each clip feature information based on the plurality of clip feature information;

performing matching verification on source video identifications of a plurality of first source video clips corresponding to the clip characteristic information to obtain target source video identifications;

Determining at least one second source video segment which is matched with the first segment characteristic information and matched with the second segment characteristic information from the source video set based on first segment characteristic information and second segment characteristic information in the plurality of segment characteristic information, wherein the first segment characteristic information is characteristic information irrelevant to time information, and the second segment characteristic information is characteristic information relevant to time information;

determining target time point information from the time point information corresponding to the second source video clip;

And taking the target source video identification and the target time point information as source video positioning information corresponding to the video clip to be detected.

Another aspect provides a video clip detection apparatus, the apparatus comprising:

The segment feature extraction module is used for extracting multidimensional features of the video segment to be detected to obtain a plurality of segment feature information;

A first source video segment determining module, configured to determine, from a source video set, at least one first source video segment that matches each segment feature information based on the plurality of segment feature information;

The source video identification matching verification module is configured to perform matching verification on source video identifications of a plurality of first source video clips corresponding to the clip characteristic information to obtain target source video identifications;

A second source video segment determining module, configured to determine, from the source video set, at least one second source video segment that matches the first segment feature information and matches the second segment feature information based on first segment feature information and second segment feature information in the plurality of segment feature information, where the first segment feature information is feature information unrelated to time information, and the second segment feature information is feature information related to time information;

The time point determining module is used for determining target time point information from the time point information corresponding to the second source video clip;

and the positioning information determining module is used for taking the target source video identification and the target time point information as source video positioning information corresponding to the video clip to be detected.

In another aspect, an electronic device is provided, where the electronic device includes a processor and a memory, where at least one instruction or at least one program is stored, where the at least one instruction or the at least one program is loaded and executed by the processor to implement a video clip detection method as described above.

Another aspect provides a computer readable storage medium comprising a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement a video clip detection method as described above.

In a further aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the video clip detection method described above.

The application provides a video clip detection method, a device and equipment, wherein the method comprises the following steps: and carrying out multidimensional feature extraction on the video clips to be detected to obtain a plurality of clip feature information, determining a first source video clip from a source video set based on the clip feature information, and determining a target source video identification based on the first source video clip. And determining a second source video segment from the source video set based on the first segment characteristic information in the segment characteristic information, determining a second source video segment from the second source video segment based on the second segment characteristic information in the segment characteristic information, and determining a target starting time point and a target ending time point based on the second source video segment to obtain source video positioning information. The method can improve the relevance between the video clips to be detected and the source video, and improve the accuracy and efficiency of video clip detection, thereby improving the continuity and efficiency of video clip recommendation.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of an application scenario of a video clip detection method according to an embodiment of the present application;

Fig. 2 is a flowchart of a video clip detection method according to an embodiment of the present application;

fig. 3 is a flowchart of a method for extracting multidimensional features in a video clip detection method according to an embodiment of the present application;

fig. 4 is a schematic diagram of multi-dimensional feature extraction in a video clip detection method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a training mode of a text feature extraction model in a video segment detection method according to an embodiment of the present application;

Fig. 6 is a schematic diagram of an application method of an image feature extraction model in a video clip detection method according to an embodiment of the present application;

fig. 7 is a flowchart of a method for detecting a video clip according to an embodiment of the present application to obtain a target source video identifier;

Fig. 8 is a flowchart of a method for determining a target start time point and a target end time point in a video clip detection method according to an embodiment of the present application;

Fig. 9 is a flowchart of performing duration verification in a video clip detection method according to an embodiment of the present application;

Fig. 10 is a schematic diagram of obtaining source video positioning information by a video clip detection method according to an embodiment of the present application;

FIG. 11 is a flowchart of video recommendation based on a video clip detection method according to an embodiment of the present application;

fig. 12 is a schematic diagram showing a target start time point and a source video identifier in a video clip detection method according to an embodiment of the present application;

fig. 13 is a schematic view of an application scenario for video clip detection in the video clip detection method according to the embodiment of the present application;

fig. 14 is a schematic structural diagram of a video clip detecting apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a video clip detection system according to an embodiment of the present application;

Fig. 16 is a schematic hardware structure of an apparatus for implementing the method provided by the embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. Moreover, the terms "first," "second," and the like, are used to distinguish between similar objects and do not necessarily describe a particular order or precedence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein.

Referring to fig. 1, an application scenario schematic diagram of a video clip detection method provided by the embodiment of the present application is shown, where the application scenario includes a client 110 and a server 120, the server 120 may determine source video positioning information corresponding to a video clip to be detected based on multi-dimensional feature extraction, and when a current video clip is played by the client 110, the server 120 may determine a next video clip of the current video clip through the source video positioning information corresponding to the current video clip, so that when the playing of the current video clip is finished, the next video clip is recommended to the client 110.

In an embodiment of the present application, the client 110 includes a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, or other types of physical devices, and may also include software running in the physical devices, such as an application program, etc. The operating system running on the entity device in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, unix, windows, etc. The client 110 includes a UI (User Interface) layer, through which the client 110 externally provides display of video clips, and in addition, receives the next video clip transmitted from the server 110 based on an API (Application Programming Interface, application program Interface).

In an embodiment of the present application, the server 120 may include a server that operates independently, or a distributed server, or a server cluster that is composed of a plurality of servers. The server 120 may include a network communication unit, a processor, a memory, and the like. Specifically, the server 120 may determine source video positioning information corresponding to the video segment to be detected based on multi-dimensional feature extraction.

Referring to fig. 2, a video clip detection method is shown, which can be applied to a server side, and the method includes:

S210, carrying out multidimensional feature extraction on a video segment to be detected to obtain feature information of a plurality of segments;

in some embodiments, feature extraction is performed on the video segments to be detected based on a plurality of feature extraction models, so that segment feature information corresponding to each feature extraction model can be obtained.

In some embodiments, referring to fig. 3, as shown in fig. 3, the plurality of segment feature information includes segment text feature information, segment image feature information and segment audio feature information, and the multi-dimensional feature extraction is performed on the video segment to be detected to obtain a plurality of segment feature information includes:

S310, extracting text features of a video segment to be detected based on a preset text feature extraction model to obtain segment text feature information;

S320, extracting image features of the video clips to be detected based on a preset image feature extraction model to obtain clip image feature information;

S330, extracting audio features of the video clips to be detected based on a preset audio feature extraction model to obtain clip audio feature information.

In some embodiments, referring to fig. 4, as shown in fig. 4, segment text feature information is obtained from corresponding text information of a video segment to be detected, such as a short video title, based on a preset text feature extraction model. Based on a preset image feature extraction model, segment image feature information is obtained from video signals of the video segments to be detected, and based on a preset audio feature extraction model, segment audio feature information is obtained from audio signals of the video segments to be detected.

In some embodiments, text feature extraction is performed on the video segment to be detected based on a preset text feature extraction model, so as to obtain segment text feature information. The text characteristic information in the video segment to be detected can be extracted through the title of the video segment and the text abstract of the video content, and can also be extracted through the text information displayed in the playing process of the video segment, and the text information displayed in the playing process can be identified through an optical character recognition (Optical Character Recognition, OCR) algorithm.

And based on a preset text feature extraction model, extracting text features of the video clips to be detected to obtain global feature vectors of video titles, text summaries and OCR recognition results, and obtaining the text feature information of the clips. The text feature extraction model may be a Bert model, as shown in fig. 5, in the model training process, by means of pre-training, preset text information is intercepted, the intercepted text information is used as training data, the intercepted text segment is used as labeling data, the intercepted text information is input into the model to be trained for carrying out context recognition to obtain a training text segment, the model to be trained is trained based on the training text segment and the intercepted text segment, a pre-training model can be obtained, for example, sentence a and sentence B are intercepted, the intercepted sentence a and the intercepted sentence B are input into the model to be trained for carrying out context recognition to obtain the intercepted training text segment in the sentence a and the sentence B, and the model to be trained is trained based on the training text segment and the text segment obtained by intercepting the sentence a and the sentence B. The missing part in the intercepted text information is identified, so that the model can learn the context in the text information. And then the text information is used as training data to adjust the pre-training model, at the moment, a plurality of data sets such as MNLI data sets (Multi-Genre Natural Language Inference) and SQuAD data sets can be selected as training data sets, the training data sets are input into the pre-training model to perform feature extraction, and model training is performed according to feature extraction results and labeling data to obtain the text feature extraction model. At this time, a Named Entity Recognition (NER) function in the pre-training model may also be trained, so that information such as entity class, time class, digital class, etc. in text information corresponding to the video segment to be detected, and information such as person name, organization name, place name, time, date, currency, percentage, etc. are recognized based on the function.

In some embodiments, before extracting the image features of the video segment to be detected, the video frames in the video segment to be detected may be extracted as image information for performing the feature extraction, for example, one video frame may be determined every second. The video frame sequence to be detected can be obtained after frame extraction processing, the image feature information of each video frame is obtained by carrying out feature extraction on the video frame sequence based on a preset image feature extraction model, then the image feature information of each video frame is added and averaged, the fragment image feature information can be obtained, the image feature extraction model can be a plurality of, please refer to fig. 6, the video frame sequence is input into a time sensitive network (Temporal Segment Networks, TSN) for network processing, then the image feature is extracted through a Xception model, and then the image feature vector of each video frame obtained by using an intermediate layer of a Youtub M-NeXtVLad network competition model is added and averaged to obtain the fragment image feature information.

In some embodiments, when extracting the audio features, an audio signal may be obtained from a video clip to be detected, and the audio signal is converted into spectral image input information by calculating mel-frequency cepstrum coefficient (mel-frequency cepstrum, MFCC) features, and feature extraction is performed on the spectral image input information, so that clip audio feature information may be obtained. The change characteristics and energy values of the audio can be obtained on the basis of a preset audio characteristic extraction window and a sliding step length, so that characteristic extraction results corresponding to each audio characteristic extraction window are obtained, weighted summation is carried out on each characteristic extraction result, and segment audio characteristic information can be obtained. The audio feature extraction model may be VGGish models, VGGish is an audio model trained from the AudioSet dataset, and 128-dimensional feature vectors may be generated. When feature extraction is performed on the spectrum image corresponding to each audio feature extraction window, feature extraction can be performed through NetXtVLAD, and a feature extraction result is obtained.

Different types of segment characteristic information are obtained through different characteristic extraction models, so that cross verification can be performed based on the different types of segment characteristic information in the video segment detection process, and the accuracy of video segment detection is improved.

S220, determining at least one first source video segment matched with each segment characteristic information from a source video set based on the segment characteristic information;

In some embodiments, matching each segment characteristic information with video characteristic information of source videos in the source video set, a first source video segment that matches each segment characteristic information separately may be determined from the source video set.

In some embodiments, determining, based on the plurality of segment feature information, a first source video segment from the source video set for which each segment feature information corresponds respectively comprises:

Acquiring video characteristic information of each source video in a source video set, wherein the video characteristic information is characteristic information corresponding to characteristic types of a plurality of fragment characteristic information respectively;

And matching the segment characteristic information and the video characteristic information, and determining at least one first source video segment corresponding to each segment characteristic information.

In some embodiments, the plurality of segment feature information may include segment text feature information, segment image feature information, and segment audio feature information, where the plurality of segment feature information corresponds to different feature types, and the video feature information of each source video in the source video set also needs to correspond to the feature types to match the segment feature information, that is, where the segment feature information includes segment text feature information, segment image feature information, and segment audio feature information, the video feature information corresponds to and includes video text feature information, video image feature information, and video audio feature information.

In some embodiments, the segment text feature information is matched with the video text feature information, source video segments corresponding to the matched video text feature information are obtained, and the source video segments are used as first source video segments corresponding to the segment text feature information.

And matching the segment image characteristic information with the video image characteristic information to obtain source video segments corresponding to the matched video image characteristic information, and taking the source video segments as first source video segments corresponding to the segment image characteristic information.

And matching the segment audio feature information with the video audio feature information to obtain source video segments corresponding to the matched video audio feature information, and taking the source video segments as first source video segments corresponding to the segment audio feature information.

Video feature information corresponding to the fragment feature information of different feature types is determined, and matching recall is independently carried out on each fragment feature information, so that the matching recall results can be mutually verified later, and the accuracy of video fragment detection is improved.

S230, carrying out matching verification on source video identifications of a plurality of first source video clips corresponding to the clip characteristic information to obtain target source video identifications;

In some embodiments, the target source video identifier is a source video identifier corresponding to the fragment information to be detected, the same source video identifier in the source video identifiers of the plurality of first source video fragments corresponding to the fragment characteristic information is determined, and whether the first source video fragments are correct or not is verified based on the same source video identifier, so that the target source video identifier can be obtained. The source video is identified as name information of a creative work corresponding to the source video, and the creative work may include literature, movies, animation, games, short videos, and the like. The source video identifier may be a source video ip, and the target source video identifier may also be a source video ip corresponding to the to-be-detected segment, for example, a certain network literature, and a series of derivative products such as an adapted television play, a game, a cartoon and the like belong to the network literature, so that the ip of the adapted television play, the game, the cartoon and the like is the network literature. Or on some short video platforms, a series of video contents issued by a certain account belong to the ip corresponding to the account, so that the ips of the videos are the account.

In some embodiments, referring to fig. 7, performing matching verification on source video identifiers of a plurality of first source video clips corresponding to a plurality of clip feature information, and obtaining a target source video identifier includes:

S710, acquiring the number of each source video identifier;

s720, weighting the number of each source video identifier based on preset weight information corresponding to each fragment characteristic information;

S730, determining a target source video identification from the source video identifications according to the weighted number of the source video identifications.

In some embodiments, each piece of feature information may correspond to a plurality of first source video pieces, source video identifications corresponding to the first source video pieces are acquired, and the number of each source video identification is determined. For example, the segment text feature information corresponds to 5 source video segments, the source video identifications being A, B, C and D, respectively. The segment image characteristic information corresponds to 3 source video segments, and source video identifiers are A, B and E respectively. The segment audio feature information corresponds to 3 source video segments, and the source video identifications are A and E respectively. Then there are five source video identifications corresponding to the source video clips, the number of source video identifications a is 3, the number of source video identifications B is 2, the number of source video identifications C is 1, the number of source video identifications D is 1, and the number of source video identifications E is 2.

The source video identifier corresponding to the maximum value of the number of source video identifiers may be selected as the target source video identifier without adding the weight information, for example, in the case where the number of source video identifiers a is 3, the number of source video identifiers B is 2, the number of source video identifiers C is 1, the number of source video identifiers D is 1, and the number of source video identifiers E is 2, the source video identifier a may be determined as the target source video identifier.

In some embodiments, corresponding weight information, typically empirical values, may be set for each segment characteristic information. The weight information can be adjusted according to key characteristics corresponding to different video contents, the key characteristics are characteristic information related to the source video, for example, for short video contents, the segment text characteristic information, the segment image characteristic information and the segment audio characteristic information can be set to be 0.25, 0.35 and 0.3, and for video contents with more random video titles, the segment text characteristic information has lower correlation degree with the source video, so that the weight of the segment text characteristic information can be reduced, and the weight information can be adjusted to be 0.1, 0.4 and 0.4.

And under the condition that the weight information is set, the number of each source video identifier is adjusted proportionally according to the weight information of the fragment characteristic information corresponding to each source video identifier, so that the number of each source video identifier is weighted. For example, in the case that the weight information is 0.1, 0.4, and 0.4, the number of source video identifications corresponding to the clip text feature information may be regarded as one source video identification, the number of source video identifications corresponding to the clip image feature information may be regarded as four source video identifications, and the number of source video identifications corresponding to the clip audio feature information may be regarded as four source video identifications. The source video identifications corresponding to the segment text feature information are A, B, C and D respectively. The source video identifications corresponding to the fragment image characteristic information are A, B and E respectively. When the source video identifications of the clip audio feature information are a and E respectively, it may be determined that the number of source video identifications a is 9, the number of source video identifications B is 5, the number of source video identifications C is 1, the number of source video identifications D is 1, and the number of source video identifications E is 8 based on the weight information.

In the case of setting the weight information, a source video identification corresponding to the maximum value of the number of source video identifications after the weighting process may be selected as the target source video identification. For example, in the case where the number of source video identifications a is 9, the number of source video identifications B is 5, the number of source video identifications C is 1, the number of source video identifications D is 1, and the number of source video identifications E is 8, the source video identification a is determined to be the target source video identification.

The source video identification with the largest number is obtained from a plurality of mutually independent matching recall results, so that a target source video identification is obtained, and the correlation between the video segment to be detected and the source video can be established through segment characteristic information of each dimension.

S240, determining at least one second source video segment which is matched with the first segment characteristic information and matched with the second segment characteristic information from the source video set based on the first segment characteristic information and the second segment characteristic information in the plurality of segment characteristic information, wherein the first segment characteristic information is characteristic information irrelevant to time information, and the second segment characteristic information is characteristic information relevant to the time information;

In some embodiments, the first segment characteristic information is characteristic information that does not have a determined correspondence with a time axis of the video, such as segment text characteristic information. Such as segment text feature information, which generally represents content or profile information in a video segment, there is no correspondence between the segment text feature information and the time axis of the video or a time axis that is more ambiguous for a start time point and an end time point. For example, in a segment scenario of a certain television play, the segment text feature information corresponds to a certain segment in the television play, and the start time point and the end time point of the segment can be the start time point and the end time point corresponding to the segment text feature information, but in general, the time range corresponding to the segment summarized by text is relatively fuzzy, and it is difficult to accurately determine the start time point and the end time point of the segment to be detected in the corresponding source video, so that the segment text feature information can be regarded as irrelevant to the time information.

In some embodiments, the second clip characteristic information is characteristic information having a determined correspondence with a timeline of the video. For example, the clip image feature information and the clip audio feature information may be determined to a certain time point on the time axis in the source video according to the clip image feature information and the clip audio feature information, for example, the corresponding image 1 at 3 minutes and 15 seconds, the corresponding audio 2 at 2 minutes and 13 seconds, etc., so that the clip image feature information and the clip audio feature information are feature information related to the time information, and the clip image feature information and the clip audio feature information may be determined to a specific time point on the time axis in the source video.

In some embodiments, a source video segment matching the first segment feature information may be determined from the source video set based on the first segment feature information, and then the source video segment matching the second segment feature information may be acquired from the source video segment matching the first segment feature information based on the second segment feature information, so as to obtain the second source video segment. Or determining the source video segment matched with the second segment characteristic information from the source video set based on the second segment characteristic information, and then obtaining the source video segment matched with the first segment characteristic information from the source video segment matched with the second segment characteristic information based on the first segment characteristic information in a narrowing range so as to obtain the second source video segment.

S250, determining target time point information from time point information corresponding to the second source video clip;

In some embodiments, in the case where the time point information includes a start time point, the start time point is subjected to matching verification, resulting in a target start time point. Under the condition that the time point information comprises a starting time point and an ending time point, respectively carrying out matching verification on the starting time point and the ending time point, and verifying duration information corresponding to the starting time point and the ending time point to obtain target time point information, wherein the target time point information comprises a target starting time point and a target ending time point.

In some embodiments, referring to fig. 8, the target time point information includes a target start time point and a target end time point, and determining the target time point information from the time point information corresponding to the second source video clip includes:

S810, performing matching verification on a starting time point of the second source video segment to obtain an initial starting time point;

s820, performing matching verification on the ending time point of the second source video segment to obtain an initial ending time point;

s830, obtaining initial duration information based on a difference value between an initial starting time point and an initial ending time point;

S840, performing duration verification on the initial duration information to obtain a duration verification result;

S850, determining a target starting time point and a target ending time point based on the time length verification result.

In some embodiments, starting time points of the second source video clips corresponding to the plurality of second clip feature information are acquired, the number of time points corresponding to the same time or the number of time points corresponding to the same time interval in the starting time points are determined, and the starting time point corresponding to the maximum value in the number is determined as an initial starting time point. And acquiring end time points of the second source video clips corresponding to the clip characteristic information, determining the number of time points corresponding to the same time or the number of time points corresponding to the same time interval in the starting time points, and determining the end time point corresponding to the maximum value in the number as an initial end time point. The time interval is a preset empirical value, and can be set to be within 10 seconds or within 5 seconds. There may be a plurality of second source video clips, and there may be a plurality of corresponding start time points and end time points.

For example, the start time point 1 of the second source video clip corresponding to the clip image feature information is 6 th minute, the start time point 2 is 6 th minute 5 second, and the start time point 3 is 6 th minute 30 second, the start time point 4 of the second source video clip corresponding to the clip audio feature information is 6 th minute, the start time point 5 is 6 th minute 40 second, and the start time point 6 is 6 th minute 10 second, wherein the start time point 1 and the start time point 4 are time points corresponding to the same time, and in the case where the time interval is 10 seconds, the start time point 1, the start time point 4, the start time point 2, and the start time point 6 are time points corresponding to the same time interval, the four time points are similar, and the number of the time points is the maximum value, so the 6 th minute can be set as the initial start time point. The initial end time point may be set in the same manner according to the initial start time point.

In some embodiments, based on a difference value between an initial starting time point and an initial ending time point, initial duration information can be obtained, duration verification is performed on the initial duration information, whether the initial starting time point and the initial ending time point are set reasonably or not is determined, a duration verification result is obtained, and a target starting time point and a target ending time point are determined according to the duration verification result.

Based on the characteristic information related to time and the characteristic information irrelevant to time, matching recall is conducted again, so that mutual verification can be conducted on time points on a time axis on the basis of reducing recall range, a target starting time point and a target ending time point corresponding to the video clip to be detected are determined, and accuracy of time point determination is improved.

In some embodiments, referring to fig. 9, the duration verification result includes a first duration verification result and a second duration verification result, and performing duration verification on the initial duration information to obtain the duration verification result includes:

S910, determining a target source segment from the second source video segment;

s920, comparing the initial time length information with target time length information corresponding to a target source fragment to obtain a first time length verification result;

s930, comparing the initial duration information with duration information to be detected of the video segment to be detected to obtain a second duration verification result;

based on the duration verification result, determining the target start time point and the target end time point includes:

S940, determining an initial starting time point as a target starting time point and determining an initial ending time point as a target ending time point when the first time length verification result indicates that the initial time length information is smaller than or equal to the target time length information and the second time length verification result indicates that the initial time length information is larger than or equal to the time length information to be detected.

In some embodiments, the target source segment is determined from the second source video segments corresponding to the plurality of second segment feature information, and the second source video segments corresponding to the plurality of second segment feature information may be subjected to matching verification, so as to obtain the number of each second source video segment, and the second source video segment corresponding to the maximum value of the number is taken as the target source segment.

Comparing the initial time length information with the target time length information corresponding to the target source fragment, obtaining a first time length verification result, and determining that the initial time length information corresponding to the initial starting time point and the initial ending time point is matched with the target time length information when the initial time length information is smaller than or equal to the target time length information, namely, the initial time length information passes verification. Under the condition that the initial time length information is larger than the target time length information, it can be determined that the initial time length information corresponding to the initial starting time point and the initial ending time point is not matched with the target time length information, namely the initial time length information is not verified.

Comparing the initial duration information with the duration information to be detected of the video segment to be detected, a second duration verification result can be obtained, and under the condition that the initial duration information is greater than or equal to the duration information to be detected, the matching of the video segment to be detected and the positioned source video segment can be determined, and the video segment to be detected is not a video segment obtained by splicing, namely the initial duration information passes verification. Under the condition that the initial duration information is smaller than the duration information to be detected, it can be determined that the video segment to be detected is not matched with the positioned source video segment, the video segment to be detected may be a spliced video segment, and the ending time of the video segment to be detected exceeds the ending time of the positioned source video segment, namely the initial duration information is not verified.

In some embodiments, when the first time length verification result indicates that the initial time length information is less than or equal to the target time length information and the second time length verification result indicates that the initial time length information is greater than or equal to the time length information to be detected, determining an initial start time point as a target start time point and determining an initial end time point as a target end time point.

When the first time length verification result indicates that the initial time length information is smaller than or equal to the target time length information and the second time length verification result indicates that the initial time length information is smaller than the time length information to be detected, the initial starting time point can be actually determined to be the target starting time point, but the initial ending time point is not the target ending time point, so that the initial starting time point passes verification, the initial ending time point does not pass verification, and only the target starting time point is output when the source video positioning information is output.

When the first time length verification result indicates that the initial time length information is larger than the target time length information and the second time length verification result indicates that the initial time length information is smaller than the time length information to be detected, the initial starting time point and the initial ending time point are not verified.

After the initial starting time point and the initial ending time point are obtained, verifying the duration information corresponding to the initial starting time point and the initial ending time point, and determining different video clips to be detected, so that corresponding source video positioning information is output for different video clips to be detected and is respectively applied to different application scenes such as video continuous recommendation scenes or video editing recommendation scenes, and the applicability of the source video positioning information is enriched.

S260, taking the target source video identification and the target time point information as source video positioning information corresponding to the video clip to be detected.

In some embodiments, if the target ending time point can be detected, the target source video identifier, the target starting time point and the target ending time point are used as source video positioning information, and if the target ending time point cannot be detected, the target source video identifier and the target starting time point can be used as source video positioning information. For example, in a video obtained by clipping a plurality of video clips from different source videos, when there are a plurality of target source video identifications and some video clips have a short time, and a target ending time point is not detected, each target source video identification and a target starting time point corresponding to each target source video identification may be used as source video positioning information.

In some embodiments, where the target source video identification comprises a plurality of source video identifications, the method further comprises:

determining a target starting time point corresponding to each target source video identifier from starting time points corresponding to the second source video fragments;

and taking each target source video identifier and a target starting time point corresponding to each target source video identifier as source video positioning information corresponding to the video segment to be detected.

In some embodiments, the video clip to be detected may be a concatenation of a plurality of video clips, i.e. the video clip to be detected is a video clip result. For the video clip result, the problem that continuous playing of video is required is not existed, in the video clip result, the picture of a certain source video may be only for a few seconds, so that the initial duration information corresponding to each video clip is necessarily smaller than the duration information of the video clip result, the target ending time point cannot be determined, and therefore all target source video identifications and the target starting time point corresponding to each target source video identification are required to be determined. For example, referring to fig. 10, as shown in fig. 10, after video positioning detection, source video positioning information corresponding to a video clip to be detected is a source video identifier: a third set of drama X-the source video identifying a corresponding target start point in time: 00:34:53.

Therefore, when the video clip result is detected, the source video identification of the video clip used in the video clip result can be determined, so that a user can determine which source videos are included in the video clip result, and the corresponding source videos can be acquired according to the requirement of the user. The target starting time point of the video clip used in the video clip result can also be determined, namely, the target starting time point corresponding to each target source video identifier is determined, and each target source video identifier and the target starting time point corresponding to each target source video identifier are used as source video positioning information corresponding to the video clip result.

In some embodiments, referring to fig. 11, the method further comprises:

S1110, responding to a playing instruction of a current video clip corresponding to a target object, and acquiring current source video positioning information of the current video clip;

s1120, determining a next video clip of the current video clip based on the current source video positioning information;

S1130, recommending the next video clip to the target object when the playing of the current video clip is finished.

In some embodiments, when the method is applied to a video recommendation scene, a playing instruction of a current video clip corresponding to a target object can be responded, and the target object can be a user, namely, when the user clicks to play the current video clip, current source video positioning information of the current video clip is obtained. Based on the current source video positioning information, the next video clip of the current video clip can be obtained, so that when the playing of the current video clip is finished, the next video clip can be automatically recommended to the user. The next video clip can be a video clip which is continuous with the current video clip on the story line, when the current video clip is a certain episode of the series episode, the next video clip can also be the next episode of the current video clip, thereby realizing continuous playing of the same story line and organization playing of the content episode, enabling a user to obtain immersive playing experience, and improving the playing time and user experience of the video.

In some embodiments, referring to fig. 12, as shown in fig. 12, a target start time point and a source video identifier corresponding to a video clip may be displayed in a current video clip. Based on the source video positioning information, a next video clip can be determined when the current video clip is played, and the next video clip of the current video clip is displayed in a list below the current video clip. And after the playing of the current video clip is finished, switching to the next video clip for playing. Therefore, the video clip recommendation continuity and recommendation efficiency can be improved.

In the lower list of the current video clip, the video clip following the next video clip may also be displayed, as shown in fig. 12, where the current video clip is the first video clip, the next video clip is the second video clip, the third video clip is the next video clip of the second video clip, and the fourth video clip is the next video clip of the third video clip.

In some embodiments, please refer to fig. 13, after the video clip to be detected is obtained, feature information of three dimensions of text, image and audio is extracted respectively, so as to obtain clip text feature information, clip image feature information and clip audio feature information, as shown in fig. 13. Short text characterization information, i.e., segment text characterization information, may be determined by a sentenc2vec model when obtaining segment text characterization information, and a sentenc2vec model may map sentence vectors into vector space. When the segment image characteristic information is acquired, frame sampling can be firstly carried out on the video segment to be detected, and after the frame sampling is carried out to obtain a video frame sequence, characteristic extraction is carried out through a convolution network to obtain the segment image characteristic information. When the segment audio feature information is acquired, a feature window can be constructed, then Fourier transformation is carried out on the audio signal to obtain frequency domain features, and feature extraction is carried out on the frequency domain features based on the feature window to obtain the segment audio feature information.

And performing text matching, image matching and audio matching based on the multidimensional feature information, positioning the video clip to be detected based on the matching result, and outputting a video positioning result.

The embodiment of the application also provides a video clip detection method, which comprises the following steps: and carrying out multidimensional feature extraction on the video clips to be detected to obtain a plurality of clip feature information, determining a first source video clip from a source video set based on the clip feature information, and determining a target source video identification based on the first source video clip. And determining a second source video segment from the source video set based on the first segment characteristic information and the second segment characteristic information in the plurality of segment characteristic information, and determining target time point information based on the second source video segment to obtain source video positioning information. The method can improve the relevance between the video clips to be detected and the source video, and improve the accuracy and efficiency of video clip detection, thereby improving the continuity and efficiency of video clip recommendation.

Meanwhile, the method reduces a great deal of manpower and material resource expenditure required by manual video segment annotation, improves user experience, and can excite users to view more interesting contents, thereby bringing substantial improvement to average video consumption and optimizing the content distribution efficiency of a recommendation engine; the method can also realize the linkage of the ip content of the short video and the long video, and reduce the drainage and popularization cost of the long video by scheduling the playing of the long video by the short video.

The embodiment of the application also provides a video clip detection device, please refer to fig. 14, which comprises:

The segment feature extraction module 1410 is configured to perform multidimensional feature extraction on a video segment to be detected to obtain a plurality of segment feature information;

a first source video clip determination module 1420 to determine at least one first source video clip from the source video collection that matches each clip feature information based on the plurality of clip feature information;

the source video identification matching verification module 1430 is configured to perform matching verification on source video identifications of a plurality of first source video clips corresponding to the plurality of clip feature information to obtain a target source video identification;

A second source video segment determining module 1440, configured to determine, from the source video set, at least one second source video segment that matches the first segment feature information and matches the second segment feature information based on the first segment feature information and the second segment feature information in the plurality of segment feature information, where the first segment feature information is feature information unrelated to time information, and the second segment feature information is feature information related to time information;

a time point determining module 1450, configured to determine target time point information from time point information corresponding to the second source video clip;

The positioning information determining module 1460 is configured to use the target source video identifier and the target time point information as source video positioning information corresponding to the video segment to be detected.

In some embodiments, the first source video clip determination module comprises:

The video feature information acquisition unit is used for acquiring video feature information of each source video in the source video set, wherein the video feature information is feature information corresponding to feature types of the plurality of fragment feature information respectively;

The feature matching unit is used for matching the feature information of each segment with the feature information of the video and determining at least one first source video segment corresponding to the feature information of each segment.

In some embodiments, the source video identification match verification module comprises:

A source video identification number acquisition unit, configured to acquire the number of each source video identification;

the weighting processing unit is used for carrying out weighting processing on the number of each source video identifier based on the preset weight information corresponding to each piece of characteristic information;

And the target source video identification determining unit is used for determining the target source video identification from the source video identifications according to the weighted number of the source video identifications.

In some embodiments, the target time point information includes a target start time point and a target end time point, and the time point determination module includes:

the initial starting time point determining unit is used for carrying out matching verification on the starting time point corresponding to the second source video segment to obtain an initial starting time point;

The initial ending time point determining unit is used for carrying out matching verification on the ending time point corresponding to the second source video segment to obtain an initial ending time point;

the initial duration information determining unit is used for obtaining initial duration information based on the difference value between the initial starting time point and the initial ending time point;

The duration verification unit is used for performing duration verification on the initial duration information to obtain a duration verification result;

and the target time point determining unit is used for determining a target starting time point and a target ending time point based on the duration verification result.

In some embodiments, the duration verification result includes a first duration verification result and a second duration verification result, and the duration verification unit includes:

A target source segment determining unit, configured to determine a target source segment from the second source video segment;

The first comparison unit is used for comparing the initial time length information with the target time length information corresponding to the target source fragment to obtain a first time length verification result;

the second comparison unit is used for comparing the initial duration information with the duration information to be detected of the video segment to be detected to obtain a second duration verification result;

The target time point determination unit includes:

The condition matching unit is used for determining an initial starting time point as a target starting time point and determining an initial ending time point as a target ending time point when the first time length verification result indicates that the initial time length information is smaller than or equal to the target time length information and the second time length verification result indicates that the initial time length information is larger than or equal to the time length information to be detected.

In some embodiments, where the target source video identification comprises a plurality of source video identifications, the apparatus further comprises:

a video editing time point determining unit, configured to determine a target starting time point corresponding to each target source video identifier from starting time points corresponding to the second source video segments;

and the video clipping positioning unit is used for taking each target source video identifier and the target starting time point corresponding to each target source video identifier as source video positioning information corresponding to the video clip to be detected.

In some embodiments, the plurality of clip feature information includes clip text feature information, clip image feature information, and clip audio feature information, the clip feature extraction module includes:

the text feature extraction unit is used for extracting text features of the video clips to be detected based on a preset text feature extraction model to obtain clip text feature information;

the image feature extraction unit is used for extracting image features of the video clips to be detected based on a preset image feature extraction model to obtain clip image feature information;

The audio feature extraction unit is used for extracting audio features of the video clips to be detected based on a preset audio feature extraction model to obtain clip audio feature information.

In some embodiments, the apparatus further comprises:

The current positioning information acquisition module is used for responding to a playing instruction of the current video clip corresponding to the target object to acquire current source video positioning information of the current video clip;

The next video segment determining module is used for determining the next video segment of the current video segment based on the current source video positioning information;

and the recommending module is used for recommending the next video clip to the target object when the playing of the current video clip is finished.

The device provided in the above embodiment can execute the method provided in any embodiment of the present application, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the above embodiments may be referred to a video clip detection method provided in any embodiment of the present application.

The embodiment of the application also provides a video clip detection system, please refer to fig. 15, which includes: the system comprises a production end, a consumption end and a server. The server comprises a content distribution outlet service, a manual auditing system, a dispatching center service, an uplink and downlink interface service, a content database, a multi-mode film source positioning service, a vector retrieval service, a multi-mode vector generation service, a text analysis and audio-visual frame extraction service, a source video library, a content storage service and a file downloading system.

The production end and the consumption end are electrically connected with the uplink and downlink interface service, the uplink and downlink interface service is electrically connected with the dispatching center service, the content storage service and the content database, the dispatching center service is electrically connected with the multi-mode film source positioning service, the manual auditing system and the content distribution outlet service, the content distribution outlet service is electrically connected with the consumption end, and the manual auditing system is electrically connected with the content database. The multi-modal sheet source localization service and the vector retrieval service are electrically connected, and the multi-modal vector generation service and the vector retrieval service are electrically connected. The multi-mode vector generation service is electrically connected with the text analysis and audio-visual frame extraction service, the text analysis and audio-visual frame extraction service is electrically connected with the downloading file system and the source video library, the downloading file system is electrically connected with the content storage service, and the content storage service is electrically connected with the consumption terminal.

The production side may generate content (Professional Generated Content, PGC) or user generated content (User Generate Content, UGC) for the professional production, and provide video content, which is the primary content source of the recommended distribution content, through the mobile side or back-end interface system.

The production end communicates with the uplink and downlink content interface service, and the production end for releasing video content is usually a shooting end, and local video content can be selected to match with music, clips, cover diagrams, filter templates, beautifying functions of video and the like in the shooting process;

The consumption end communicates with the uplink and downlink content interface server, pushes through recommendation to obtain index information of access content, then communicates with the content storage service, and obtains corresponding content including recommendation to obtain content, content subscribed by thematic, the content storage service stores content entities such as video source files, picture source files of cover charts, and meta information of the content such as titles, authors, cover charts, classifications, label information and the like in the content database. Meanwhile, the consumption end can report behavior data played by a user in the uploading and downloading processes, such as blocking, loading time, playing clicking and the like to the back end for statistical analysis. The consumer typically browses the content data via Feeds streaming.

The up-down content interface service communicates directly with the production end, and the content submitted from the production end, typically, the title, publisher, abstract, cover map, release time, etc., stores the file in the content database. The uplink and downlink content interface service writes meta information of the content, such as file size, cover map link, title, distribution time, author, etc., into the content database. The uplink and downlink content interface service synchronizes the issued submitted content to the dispatching center service for subsequent content processing and circulation.

The content database is a core database of the content, meta information of the released content of all producers is stored in the service database, and the key point is that the meta information of the content, such as file size, cover map link, code rate, file format, title, release time, author, video file size, video format, whether original mark or first generation, also comprises classification of the content in the manual checking process.

The manual auditing system reads information in the content database in the manual auditing process, and meanwhile, the result and the state of the manual auditing are returned to the content database.

The dispatching center service can process the content and mainly comprises machine processing and manual auditing processing, wherein a machine processing core comprises various quality judgments such as low-quality filtering, content labels such as classification and label information, and content similar investigation, results are written into a content database, and the completely repeated content cannot be subjected to repeated secondary processing by manual work, so that auditing manpower resources are saved.

The dispatching center service is responsible for the whole dispatching process of content circulation, receives the warehoused content through the uplink and downlink content interface service, and then acquires the meta information of the content from the content database;

the dispatching center service can dispatch the manual auditing system and the machine processing system, and control the dispatching sequence and priority. The dispatch center service is enabled by manually auditing the system content, and then is provided to the consumer end of the terminal through the content outlet distribution service directly displaying the page, namely the content index information obtained by the consumer end, typically the entry address of the content access.

The manual auditing system is a carrier of manual service capability and is mainly used for auditing contents which are sensitive to filtering, pornographic, legal disallowed and the like and cannot be judged by machines, and labeling labels of video contents;

The content storage service may store content entity information other than meta information of the content, such as video source files and picture source files of teletext content. When the video content tag characteristics are acquired, temporary storage of the frame extraction content and the audio information in the middle of the video source file including the source file is provided, and repeated extraction is avoided.

The download file system can download and acquire original content from the content storage service, control the speed and progress of the download, and is usually a group of parallel servers, and consists of related task scheduling and distribution clusters. The downloaded file invokes text parsing and video/audio frame extraction service to obtain video frames and audio information of the necessary video file from the source file as basic input for subsequently constructing image vectors and audio vectors of the video.

The text parsing and video/audio framing service may perform a primary processing of file characteristics on files downloaded from the content storage service and the source video library by the download file system according to the above-mentioned algorithms and policies. According to the characteristic construction method of the video mode and the audio mode, the frame image of the video is extracted to be used as the video searching, and the associated data source of the video is extracted.

The multimodal vector generation service may obtain a text vector for each video segment to be detected, many visual frame vectors and audio content vectors, in accordance with the methods described above. The generated vector can be written into the vector retrieval service for storage and indexing, thereby facilitating positioning retrieval.

The source video library can manually collect or download source content libraries for locating film sources, such as video clips, sports, animation, games and other vertical video content libraries, and then extract key information such as work names, collection numbers, actors, roles and the like from the knowledge graph through manual operation and machine learning. The method comprises the steps of communicating a source video library with a text parsing and video and audio frame extraction service, and establishing a vector index for the content of the source video library;

The vector retrieval service uses a Faiss library to store and manage vectors on the basis of the constructed multi-mode video vectors, and simultaneously uses a Faiss vector matching retrieval function as the basis of the retrieval service and communicates with the multi-mode film source positioning service to complete the bottom-layer realization of the retrieval service of the basic video multi-mode vector management;

the multi-mode film source positioning service can perform text information identification on the video clips to be detected, extract characteristics extraction operations such as video frames and the like, then search the contents of all active video libraries by taking different vectors as query entries according to a method, and realize accurate video clip positioning, wherein the video clip positioning is performed by the source video identification, the target starting time point and the target ending time point of each video clip to be detected.

The system provided in the above embodiment can execute the method provided in any embodiment of the present application, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the above embodiments may be referred to a video clip detection method provided in any embodiment of the present application.

The present embodiment also provides a computer-readable storage medium, in which computer-executable instructions are stored, and the computer-executable instructions are loaded by a processor and execute a video clip detection method according to the present embodiment.

The present embodiment also provides a computer program product comprising a computer program stored in a computer readable storage medium. The computer program is read from a computer readable storage medium by a processor of an electronic device, which executes the computer program, causing the electronic device to perform the methods provided in the various alternative implementations of video clip detection described above.

The present embodiment also provides an electronic device, which includes a processor and a memory, where the memory stores a computer program, and the computer program is adapted to be loaded by the processor and execute a video clip detection method according to the present embodiment.

The electronic device may be a computer terminal, a mobile terminal or a server, and may also participate in forming an apparatus or a system provided by the embodiments of the present application. As shown in fig. 16, the server 16 may include one or more processors 1602 (shown in the figures as 1602a, 1602b, … …,1602 n) (the processor 1602 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 1604 for storing data, and a transmission device 1606 for communication functions. In addition, the method may further include: input/output interfaces, network interfaces, power sources, and/or cameras. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 16 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the server 16 may also include more or fewer components than shown in fig. 16, or have a different configuration than shown in fig. 16.

It should be noted that the one or more processors 1602 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Further, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the server 16.

The memory 1604 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the methods in the embodiments of the present application, and the processor 1602 executes the software programs and modules stored in the memory 1604 to perform various functional applications and data processing, i.e., to implement a method for generating a time-series behavior capture frame based on a self-attention network as described above. The memory 1604 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1604 may further include memory remotely located relative to the processor 1602, which may be connected to the server 16 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1606 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 16. In one example, the transmission means 1606 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 1606 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

The present specification provides method operational steps as an example or a flowchart, but may include more or fewer operational steps based on conventional or non-inventive labor. The steps and sequences recited in the embodiments are merely one manner of performing the sequence of steps and are not meant to be exclusive of the sequence of steps performed. In actual system or interrupt product execution, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing).

The structures shown in this embodiment are only partial structures related to the present application and do not constitute limitations of the apparatus to which the present application is applied, and a specific apparatus may include more or less components than those shown, or may combine some components, or may have different arrangements of components. It should be understood that the methods, apparatuses, etc. disclosed in the embodiments may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules is merely a division of one logic function, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or unit modules.

Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for detecting video clips, the method comprising:

Performing matching verification on source video identifications of a plurality of first source video clips corresponding to the plurality of clip feature information to obtain a target source video identification, wherein the plurality of clip feature information comprises clip text feature information, clip image feature information and clip audio feature information;

Performing matching verification on the initial time point of the second source video segment to obtain an initial time point; performing matching verification on the ending time point of the second source video clip to obtain an initial ending time point; obtaining initial duration information based on a difference value between the initial starting time point and the initial ending time point;

Performing duration verification on the initial duration information to obtain a duration verification result, wherein the duration verification result comprises a first duration verification result and a second duration verification result, the first duration verification result indicates whether the initial duration information is smaller than or equal to target duration information corresponding to a target source segment in the second source video segment, and the second duration verification result indicates whether the initial duration information is larger than or equal to the duration information to be detected of the video segment to be detected;

Determining target time point information based on the duration verification result, wherein the target time point information comprises a target starting time point and a target ending time point, or the target time point information comprises a target starting time point;

2. The method according to claim 1, wherein determining, based on the plurality of segment feature information, a first source video segment corresponding to each segment feature information from the source video set includes:

Acquiring video characteristic information of each source video in the source video set, wherein the video characteristic information is characteristic information corresponding to characteristic types of the plurality of fragment characteristic information respectively;

And matching the segment characteristic information with the video characteristic information, and determining at least one first source video segment corresponding to each segment characteristic information.

3. The method for detecting video clips according to claim 1, wherein the performing matching verification on source video identifications of a plurality of first source video clips corresponding to the plurality of clip feature information to obtain a target source video identification includes:

acquiring the number of each source video identifier;

Weighting the number of each source video identifier based on weight information corresponding to preset fragment characteristic information;

And determining a target source video identification from the source video identifications according to the weighted number of the source video identifications.

4. The method for detecting video clips according to claim 1, wherein the performing duration verification on the initial duration information to obtain a duration verification result includes:

Determining the target source segment from the second source video segment;

Comparing the initial time length information with the target time length information corresponding to the target source segment to obtain the first time length verification result;

comparing the initial duration information with the duration information to be detected of the video segment to be detected to obtain a second duration verification result;

the determining the target time point information based on the duration verification result includes:

And when the first time length verification result indicates that the initial time length information is smaller than or equal to the target time length information and the second time length verification result indicates that the initial time length information is larger than or equal to the time length information to be detected, determining the initial starting time point as the target starting time point and determining the initial ending time point as the target ending time point.

5. The video clip detection method of claim 1, wherein in the case where the target source video identification comprises a plurality of source video identifications, the method further comprises:

And taking the target source video identifications and the target starting time points corresponding to the target source video identifications as source video positioning information corresponding to the video clips to be detected.

6. The method for detecting video clips according to claim 1, wherein the step of performing multi-dimensional feature extraction on the video clips to be detected to obtain a plurality of clip feature information includes:

Based on a preset text feature extraction model, extracting text features of the video segment to be detected to obtain segment text feature information;

Based on a preset image feature extraction model, extracting image features of the video segment to be detected to obtain segment image feature information;

and extracting the audio characteristics of the video clip to be detected based on a preset audio characteristic extraction model to obtain clip audio characteristic information.

7. The video clip detection method of claim 1, wherein the method further comprises:

Responding to a playing instruction of a current video clip corresponding to a target object, and acquiring current source video positioning information of the current video clip;

Determining a next video clip of the current video clip based on the current source video positioning information;

And recommending the next video clip to the target object when the playing of the current video clip is finished.

8. A video clip detection apparatus, the apparatus comprising:

the segment feature extraction module is used for carrying out multidimensional feature extraction on the video segment to be detected to obtain a plurality of segment feature information, wherein the plurality of segment feature information comprises segment text feature information, segment image feature information and segment audio feature information;

The time point determining module comprises an initial starting time point determining unit, and is used for carrying out matching verification on the starting time point of the second source video clip to obtain an initial starting time point; the initial ending time point determining unit is used for carrying out matching verification on the ending time point of the second source video clip to obtain an initial ending time point; an initial duration information determining unit, configured to obtain initial duration information based on a difference between the initial start time point and the initial end time point;

the time length verification unit is used for performing time length verification on the initial time length information to obtain a time length verification result, wherein the time length verification result comprises a first time length verification result and a second time length verification result, the first time length verification result indicates whether the initial time length information is smaller than or equal to target time length information corresponding to a target source segment in the second source video segment, and the second time length verification result indicates whether the initial time length information is larger than or equal to the to-be-detected time length information of the to-be-detected video segment;

A target time point determining unit configured to determine target time point information based on the duration verification result, the target time point information including a target start time point and a target end time point, or the target time point information including a target start time point;

9. The video clip detection apparatus of claim 8, wherein the first source video clip determination module comprises:

And the feature matching unit is used for matching the feature information of each segment with the video feature information and determining at least one first source video segment corresponding to the feature information of each segment.

10. The video clip detection apparatus of claim 8, wherein the source video identification match verification module comprises:

And the target source video identification determining unit is used for determining target source video identifications from the source video identifications according to the weighted number of the source video identifications.

11. The video clip detecting apparatus according to claim 8, wherein the duration verification unit includes:

A target source segment determining unit, configured to determine the target source segment from the second source video segment;

The first comparison unit is used for comparing the initial time length information with the target time length information corresponding to the target source segment to obtain the first time length verification result;

The target time point determination unit includes:

The condition matching unit is configured to determine the initial start time point as the target start time point and determine the initial end time point as the target end time point when the first time length verification result indicates that the initial time length information is less than or equal to the target time length information and the second time length verification result indicates that the initial time length information is greater than or equal to the time length information to be detected.

12. The video clip detection apparatus of claim 8, wherein in the case where the target source video identification comprises a plurality of source video identifications, the apparatus further comprises:

and the video clipping positioning unit is used for taking each target source video identifier and a target starting time point corresponding to each target source video identifier as source video positioning information corresponding to the video clip to be detected.

13. The video clip detection apparatus of claim 8, wherein the clip feature extraction module comprises:

And the audio feature extraction unit is used for extracting the audio features of the video clips to be detected based on a preset audio feature extraction model to obtain the clip audio feature information.

14. The video clip detection apparatus of claim 8, wherein the apparatus further comprises:

The current positioning information acquisition module is used for responding to a playing instruction of a current video clip corresponding to a target object to acquire current source video positioning information of the current video clip;

A next video clip determining module, configured to determine a next video clip of the current video clip based on the current source video positioning information;

15. An electronic device comprising a processor and a memory, wherein the memory has stored therein at least one instruction or at least one program that is loaded and executed by the processor to implement the video clip detection method of any of claims 1-7.

16. A computer readable storage medium comprising a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement the video clip detection method of any of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the video clip detection method of any one of claims 1-7.