CN112069952A

CN112069952A - Video clip extraction method, video clip extraction device, and storage medium

Info

Publication number: CN112069952A
Application number: CN202010866476.9A
Authority: CN
Inventors: 胡佳高; 王飞; 余鹏飞; 周代国
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-12-11
Also published as: JP7292325B2; EP3961490A1; KR102456272B1; US20220067383A1; JP2022037876A; US11900682B2; KR20220026471A

Abstract

The present disclosure relates to a video clip extraction method, a video clip extraction apparatus, and a storage medium. The video clip extraction method comprises the following steps: and acquiring a video, and sampling in the video to obtain N video frames, wherein N is a positive integer. And inputting the N video frames into a pre-trained frame feature extraction model to obtain the feature vector of each video frame in the N video frames. Scores for the N video frames are determined based on a pre-trained scoring model. And extracting a target video segment from the video based on the scores of the N video frames. By the video clip extraction method, sampling extraction can be performed according to the acquired video frames in the online video acquisition process, so that the calculation workload of the scoring model is saved, the extraction rate of the video clips is accelerated, the target video clips required by users can be extracted quickly after the videos are completely acquired, and the use experience of the users is improved.

Description

Video clip extraction method, video clip extraction device, and storage medium

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video segment extraction method, a video segment extraction apparatus, and a storage medium.

Background

The video segment extraction may be for any one or more relatively short video segments in the video. For example: a video segment that is highlighted in the video is extracted, and one or more video segments in the video that are more highlighted in content than other video segments can be extracted.

In the related technology, video segment extraction of a video needs to divide the video into a plurality of video segments according to the content of the video after the video is completely acquired, score is required to be given to each video segment, and video segment extraction is performed based on the score of each video segment. However, when the video segments are extracted by the method, the scores of the video segments need to be determined through a large amount of calculation, so that the extraction is time-consuming, and the use experience of a user is influenced.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a video clip extracting method, a video clip extracting apparatus, and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a video segment extraction method, including: the method comprises the steps of obtaining a video, and sampling in the video to obtain N video frames, wherein N is a positive integer. And inputting the N video frames into a pre-trained frame feature extraction model to obtain a feature vector of each video frame in the N video frames. Determining scores of the N video frames based on a pre-trained scoring model, wherein for an ith frame in the N video frames, feature vectors of K video frames taking the ith frame as a center are input into the pre-trained scoring model to obtain the score of the ith frame, wherein i is a positive integer less than or equal to N, and K is a positive integer. And extracting a target video segment in the video based on the scores of the N video frames.

In one embodiment, the scoring model is obtained by training based on a multi-frame fusion layer and a data pair consisting of a positive segment and a negative segment; the data composed of the positive segments and the negative segments are obtained based on a sample video segment marked with target attributes, the target attributes comprise attributes representing that the video segment is a target video segment or a non-target video segment, and the multi-frame fusion layer is used for fusing the feature vectors of the K video frames into fixed-length vectors.

In another embodiment, training the scoring model based on the multi-frame fusion layer and the data pair consisting of the positive segment and the negative segment comprises: sampling K video frames in a positive segment, extracting feature vectors of the K video frames sampled in the positive segment based on a frame feature extraction model, sampling K video frames in a negative segment, extracting feature vectors of the K video frames sampled in the negative segment based on the frame feature extraction model, fusing the feature vectors of the K video frames sampled in the positive segment into positive segment feature vectors with fixed length vectors based on the multi-frame fusion layer, and fusing the K video frames sampled in the negative segment into negative segment feature vectors with fixed length vectors based on the multi-frame fusion layer. Inputting the positive segment feature vector and the negative segment feature vector into a twin neural network to obtain the scores of the positive segments and the scores of the negative segments, and performing back propagation by using the sequencing loss to train to obtain a trained twin neural network; wherein the twin neural network comprises two multi-layered perceptron models sharing parameters; and the scoring model is a multilayer perceptron model of the trained twins neural network.

In yet another embodiment, the data pairs of positive and negative segments are obtained based on the sample video segments labeled with the target attribute as follows: a sample video is obtained that includes one or more sample video clips. And obtaining a data pair consisting of a positive segment and a negative segment based on the target attribute marked by the one or more sample video segments and the non-sample video segments included in the sample video, wherein the probability of the positive segment becoming the target video segment is greater than the probability of the negative segment becoming the target video segment.

In another embodiment, obtaining a data pair consisting of a positive segment and a negative segment based on the target attribute labeled by the one or more sample video segments and the non-sample video segments included in the sample video comprises: if the target attributes marked by the one or more sample video clips comprise attributes representing that the video clips are target video clips, the one or more sample video clips are used as positive clips, partial video clips are extracted from non-sample video clips included in the sample video and used as negative clips, and one or more data pairs are obtained from the positive clips and the negative clips. Or if the target attributes marked by the one or more sample video clips comprise attributes for representing the video clips as non-target video clips, taking the one or more sample video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video as positive clips, and obtaining one or more data pairs from the positive clips and the negative clips. Or if the target attributes marked by the one or more sample video clips comprise attributes for representing the video clips as target video clips and attributes for representing the video clips as non-target video clips, taking the sample video clips marked with the attributes for representing the target video clips as positive clips, taking the sample video clips marked with the attributes for representing the non-target video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video, obtaining data pairs from the positive clips and the negative clips, obtaining data pairs from the positive clips and the partial video clips, and obtaining data pairs from the negative clips and the partial video clips.

In another embodiment, extracting a target video segment in the video based on the scores of the N video frames comprises: and a plurality of video clips obtained by sliding on the video along the time sequence based on a fixed-length sliding window, wherein each sliding window corresponds to one video clip. And respectively determining the average score of the video frames included in the sliding window aiming at each sliding window, and taking the average score of the video frames as the score of the video segment corresponding to the sliding window. Extracting one or more target video segments from the plurality of video segments based on the scores of the plurality of video segments.

According to a second aspect of the embodiments of the present disclosure, there is provided a video segment extracting apparatus including: the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a video and sampling the video to obtain N video frames, and N is a positive integer. And the feature extraction unit is used for inputting the N video frames into a pre-trained frame feature extraction model to obtain a feature vector of each video frame in the N video frames. The determining unit is used for determining the scores of the N video frames based on a pre-trained scoring model, wherein for the ith frame in the N video frames, the feature vectors of K video frames taking the ith frame as the center are input into the pre-trained scoring model to obtain the score of the ith frame, i is a positive integer less than or equal to N, and K is a positive integer. And the extracting unit is used for extracting a target video clip in the video based on the scores of the N video frames.

In an embodiment, the video segment extracting apparatus further includes a training unit; the training unit is used for training to obtain the scoring model based on a multi-frame fusion layer and a data pair consisting of a positive segment and a negative segment, the data pair consisting of the positive segment and the negative segment is obtained based on a sample video segment marked with a target attribute, the target attribute comprises an attribute representing that the video segment is a target video segment or a non-target video segment, and the multi-frame fusion layer is used for fusing the feature vectors of K video frames into fixed-length vectors.

In another embodiment, the training unit trains the scoring model based on the multi-frame fusion layer and the data pair consisting of the positive segment and the negative segment in the following manner: sampling K video frames in a positive segment, extracting feature vectors of the K video frames sampled in the positive segment based on a frame feature extraction model, sampling K video frames in a negative segment, extracting feature vectors of the K video frames sampled in the negative segment based on the frame feature extraction model, fusing the feature vectors of the K video frames sampled in the positive segment into positive segment feature vectors with fixed length vectors based on the multi-frame fusion layer, and fusing the K video frames sampled in the negative segment into negative segment feature vectors with fixed length vectors based on the multi-frame fusion layer. Inputting the positive segment feature vector and the negative segment feature vector into a twin neural network to obtain the scores of the positive segments and the scores of the negative segments, and performing back propagation by using the sequencing loss to train to obtain a trained twin neural network; wherein the twin neural network comprises two multi-layered perceptron models sharing parameters; and the scoring model is a multilayer perceptron model of the trained twins neural network.

In another embodiment, the data pair consisting of the positive segment and the negative segment is obtained based on the target attribute marked by the one or more sample video segments and the non-sample video segments included in the sample video in the following manner: if the target attributes marked by the one or more sample video clips comprise attributes representing that the video clips are target video clips, taking the one or more sample video clips as positive clips, extracting partial video clips from non-sample video clips included in the sample video as negative clips, and obtaining one or more data pairs from the positive clips and the negative clips; or if the target attributes marked by the one or more sample video clips comprise attributes for representing the video clips as non-target video clips, taking the one or more sample video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video as positive clips, and obtaining one or more data pairs from the positive clips and the negative clips. Or if the target attributes marked by the one or more sample video clips comprise attributes for representing the video clips as target video clips and attributes for representing the video clips as non-target video clips, taking the sample video clips marked with the attributes for representing the target video clips as positive clips, taking the sample video clips marked with the attributes for representing the non-target video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video, obtaining data pairs from the positive clips and the negative clips, obtaining data pairs from the positive clips and the partial video clips, and obtaining data pairs from the negative clips and the partial video clips.

In yet another embodiment, the extraction module extracts a target video segment in the video based on the scores of the N video frames in the following manner: and a plurality of video clips obtained by sliding on the video along the time sequence based on a fixed-length sliding window, wherein each sliding window corresponds to one video clip. And respectively determining the average score of the video frames included in the sliding window aiming at each sliding window, and taking the average score of the video frames as the score of the video segment corresponding to the sliding window. Extracting one or more target video segments from the plurality of video segments based on the scores of the plurality of video segments.

According to a third aspect of the embodiments of the present disclosure, there is provided a video segment extracting apparatus including: a memory to store instructions; and the processor is used for calling the instructions stored in the memory to execute any one of the video segment extraction methods.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein instructions that, when executed by a processor, perform the video segment extraction method of any one of the above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: by the video clip extraction method provided by the disclosure, sampling extraction can be performed according to the acquired video frame in the online video acquisition process, so that the calculation workload of the scoring model is saved, and the extraction rate of the video clip is accelerated. And based on the wonderful degree score of the video frame, the wonderful degree of each part of the video can be compared while the video frame is acquired, so that the target video clip required by the user can be quickly extracted after the video is completely acquired, and the use experience of the user is improved conveniently.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flow diagram illustrating a video segment extraction method according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a scoring model training method in accordance with an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating another scoring model training method in accordance with an exemplary embodiment.

FIG. 4 is a flowchart illustrating a method of determining data pairs in accordance with an exemplary embodiment.

FIG. 5 is a sample video annotation schematic diagram in accordance with an exemplary embodiment.

FIG. 6 is a schematic diagram illustrating another sample video annotation, according to an example embodiment.

FIG. 7 is a schematic diagram illustrating yet another sample video annotation in accordance with an illustrative embodiment.

Fig. 8 is a block diagram illustrating a video segment extraction apparatus according to an exemplary embodiment.

Fig. 9 is a block diagram illustrating another video segment extraction apparatus according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The video clip extraction method provided by the embodiment of the disclosure is applied to a scene for extracting a target video clip, for example, a scene for extracting a highlight video clip. Such as: generating a highlight video clip of a user video in a mobile phone album, and displaying the highlight video clip as a preview to a scene of the user; or in the small video application, acquiring a highlight video clip of the small video to generate a gif picture, and displaying the gif picture as a video cover to a scene of a user; and the highlight degree of each moment can be calculated in the background while the video watched by the user on line is played, and the highlight video clip in the video is popped up immediately after the user watches the video, so that the user can review the highlight.

In the related technology, video segment extraction of a video requires that the video is completely acquired in an off-line state, and the video is divided into a plurality of video segments according to the content of the video, so that video feature extraction is performed on each video segment, and the wonderful degree score of each video segment is calculated. Thereby determining the video segments that need to be extracted. The video clips are extracted by the method, and the extraction operation is required to be carried out after the video is completely acquired. When the video segment characteristics are extracted from video segments to video segments, the characteristics need to be determined through a large amount of calculation, the consumed time is long, the required extracted video segments cannot be obtained quickly, and therefore the use experience of users is influenced.

In view of this, the embodiments of the present disclosure provide a video segment extraction method, which can perform training based on frame feature vectors when a video segment extraction model used in the video segment extraction method is trained, thereby helping to reduce the calculation amount of a scoring model and being beneficial to improve the scoring rate. And because the calculated amount of the model is small, the method is beneficial to deployment in terminals such as mobile phones, tablets and computers, and is convenient for users to use at any time.

According to the video clip extraction method provided by the disclosure, the score of each video frame can be obtained by inputting the feature vector of the sampled video frame into the scoring model, and the extraction of the target video clip can be performed based on the score of each video frame. The adopted scoring model is obtained based on frame feature vector training, the scoring model is simple in structure and low in calculation amount, and is beneficial to rapidly outputting scores of all video frames, so that when a target video clip is extracted according to the scores of all the video frames, rapid extraction can be performed in a short time, the extraction process is shortened, and the use experience of a user is promoted.

Fig. 1 is a flowchart illustrating a video segment extracting method according to an exemplary embodiment, and as shown in fig. 1, the video segment extracting method includes the following steps S11 to S14.

In step S11, a video is acquired and N video frames are sampled in the video.

The number of sampled video frames may be different for different durations of video. In the disclosed embodiments, the sampling for a video frame may take a variety of forms.

In an embodiment, the sampling mode of the video frame may be equal time sampling through a preset time step, and then the total time length of the video does not need to be considered, and only the sampling is performed according to a time interval, thereby facilitating the reduction of the sampling algorithm and the acceleration of the sampling rate of the video frame. For example: according to a preset time step, one video frame is sampled every 2 seconds according to the acquired video, and the video frames are sampled every two seconds in the video according to the 2 nd, 4 th, 6 th and 8 th seconds in the video and so on until the video is finished. If the video has 10 seconds, 5 video frames are obtained. In one example, when the video is not completely loaded, the currently loaded video frame can be sampled, and then the score of each moment of the video can be calculated only under the condition of delaying the video for a few frames in the process of loading the video without waiting until the video is completely acquired, so that the online quasi-real-time calculation is facilitated, the process of extracting the target video clip is shortened, and the use experience of a user is facilitated to be improved.

In another embodiment, the video frames may be sampled by presetting the number of video frames for which sampling is specified. The method is beneficial to saving the extraction time of the feature vector when calculating the feature vector of the video frame, is convenient for calculating a feature vector extraction model and quickens the extraction process of the target video segment. In one example, the video can be uniformly sampled according to the specified number of video frames, so that the corresponding video content between the moments can be distinguished, and the target video segment can be rapidly extracted based on the scores of the video frames. For example: 5 video frames need to be acquired, and for 10 seconds of video, sampling can be performed every two seconds. For 15 seconds of video, the sampling may be done every three seconds.

In step S12, the N video frames are input to a frame feature extraction model trained in advance, and a feature vector of each of the N video frames is obtained.

In the embodiment of the disclosure, the obtained N video frames are respectively input into the trained frame feature extraction model to obtain the feature vector corresponding to each video frame, so that the scoring model can score according to the obtained feature vector of each video frame, the score of each video frame corresponding to each moment in the video is conveniently evaluated, and the extraction of the target video segment required by the user is facilitated. The frame feature model may be a standard Convolutional Neural Network (CNN) or an online Video Understanding model such as a Temporal Shift Module for Efficient Video Understanding (TSM). When the trained CNN is used to extract frame features, the output vector of the layer before the network classification layer may be used as the frame feature vector of the input video frame. When the trained online TSM is used for frame feature extraction, the last layer output of the base network (backhaul) may be taken as a feature vector, which is not limited in this disclosure.

In step S13, scores for the N video frames are determined based on the pre-trained scoring model.

In the embodiment of the disclosure, the obtained frame feature vectors of the N video frames are input into the scoring model through the trained scoring model, and then the score of each video frame is obtained. The scoring model may score based on the amount of highlights of the image content of the respective video frames. For different video frames of the same video, the higher the score output by the scoring model is, the higher the wonderful degree of the content is. The relative size between the individual scores may be characterized as the relative difference in highlights between the content of the individual video frames. The relative wonderful degree among the contents of the video frames can be visually distinguished through the obtained scores of the video frames, and further the target video clips required by the user can be extracted quickly.

In the present disclosure, the score obtained by each video frame through the trained scoring model is obtained based on the fusion feature vector of the plurality of video frames obtained by taking the current moment of the video frame as the center. When the score of the ith frame in the N video frames is calculated, a plurality of video frames are respectively obtained at the front and rear moments of the ith frame based on the corresponding moment position of the ith frame in the video to obtain K video frames, and then the score output based on the fusion feature vectors of the K video frames is used as the score of the current ith frame. Wherein i is a positive integer less than or equal to N, and K is a positive integer. When the score of the current video frame is calculated, the obtained score is more accurate by combining with the frame feature vectors of the video frames around the current video frame, and when the target video segment is extracted according to the obtained score, the content of the obtained target video segment is consistent with the content corresponding to the score, so that the possibility of mistaken extraction or extraction omission of the target video segment is avoided. For example: the content of the current video frame belongs to a common video clip in the moment corresponding to the video, and highlight video clips are arranged before and after the moment, which is only transient transition. The score of the current frame is calculated based on the video frames extracted before and after the current video frame, which is helpful to avoid the possibility of missing the extraction of the target video segment. In one example, in order to obtain more accurate scores, the same number of video frames may be obtained before and after the ith frame, for example, obtaining [ i- (K/2) ] video frames before the moment of the ith frame, obtaining [ i + (K/2) -1] video frames after the moment of the ith frame, and performing uniform sampling, so that the obtained scores more fit the scores of the video segments where the current video frame is located, and more accurate, so as to eliminate abnormal data. In another example, when the ith frame belongs to the first frame of the video, the feature vectors of [ i- (K/2) ] video frames before the first frame may be default to 0 or be the same as the feature vectors of [ i + (K/2) -1] video frames acquired after the first frame, which further helps to score the video frames smoothly and facilitate evaluation frame by frame.

In step S14, a target video segment is extracted in the video based on the scores of the N video frames.

In the embodiment of the disclosure, according to the requirement of the user, the target video segment required by the user is extracted based on the obtained scores of the video frames.

In the embodiment of the disclosure, in one video, the target video segment to be extracted also has a non-target video segment. The target video segment has the attributes of the target video segment and the non-target video segments also have the attributes of the non-target video segments. For example, when the target video segment is a highlight video segment, the video segment with the most highlight is provided, and the video segment with the least highlight is provided. The user needs are different, and the required target video clips are different. In order to rapidly extract a target video segment required by a user according to an acquired video, the acquired video can be sampled to obtain N video frames, wherein N is a positive integer. The score of each video frame can be rapidly obtained through the scoring model, evaluation is carried out based on the score of each video frame, and the target video segment required to be extracted is determined.

In general, a target video segment may be one or more video segments in a video and have target video segment attributes. For example, a highlight video segment is one or more relatively short video segments in a video, and the content is more highlight and more attractive than the content of other video segments. For example: taking a basketball game video as an example, video segments such as shooting, killing and the like in the video are wonderful video segments in the basketball game video, video segments such as common dribbling and the like are non-wonderful video segments, and a black screen, animation and the like during shot switching are the least wonderful video segments.

In the embodiment of the present disclosure, a target video segment is taken as an example of a highlight video segment. The highlight video clip extraction is carried out on a plurality of video frames in the video based on the scores, and the scores of the video frames are evaluated together with the video frames at the surrounding moments when the scores of the video frames are calculated, so that the obtained scores can represent the average highlight of the video clips in a small range before and after the video frames. And the wonderful degree of the extracted target video segment is determined based on the score of the video frame, compared with the method for calculating the score of the whole video segment, the calculation amount is smaller, the rapid evaluation in a short time is facilitated, and a suitable target video segment is provided. For example: if the user needs the most wonderful video segment in the video, the video segment in which the video frame with the highest score is located can be used as the target video segment according to the score of each video frame. In one example, if the user needs a plurality of highlight video segments, the video frames may be sorted based on the corresponding scores, and the video segment in which the plurality of video frames with relatively high scores are located is taken as the target video segment.

Through the embodiment, the video frames based on sampling replace video clips to be scored through the trained scoring model, the calculated amount of the scoring model can be effectively reduced, the scoring calculation speed is further increased, and the process of extracting the target video clips is improved. Thereby helping to improve the user experience.

In an embodiment, the target video segment may be extracted by sliding on the video along the time sequence based on a sliding window with a fixed length, and the range covered by the sliding window in each sliding is one video segment. That is, each time the sliding window slides, when the end position of the sliding window slides to the start position of the sliding window, a video clip is generated. And aiming at each sliding window, obtaining the average score of the sliding window according to the score of each video frame included in the sliding window, and further taking the average score as the score of the video segment corresponding to the sliding window. So as to extract one or more target segments from the plurality of segments based on the scores of the plurality of segments according to the user demand. Taking the example of obtaining the highlight video clip, a sliding window with a fixed length is used for sliding on the video along a time sequence, and the average value of the highlight degree scores of all video frames in the window is calculated and used as the highlight degree score of the video clip corresponding to the window. And the video segment corresponding to the sliding window with the highest score is the highlight segment of the video. When a plurality of highlight segments need to be extracted, video segments with relatively high scores can be extracted according to the scores corresponding to the sliding windows. In order to avoid higher repetition degree among the video clips, the non-maximum mechanism algorithm can be used for eliminating the video clips corresponding to the sliding windows with higher repetition degree before the video clips are extracted, so that the extracted video clips are mutually dispersed, and the watching experience of a user when watching the extracted video clips is favorably improved.

In the embodiment of the disclosure, the scoring model may be obtained by pre-training based on a multi-frame fusion layer and a data pair composed of a positive segment and a negative segment.

Fig. 2 is a flowchart illustrating a scoring model training method according to an exemplary embodiment, and as shown in fig. 2, the scoring model training method includes the following steps S21 to S24.

In step S21, a video is acquired and N video frames are sampled in the video.

In step S22, the N video frames are input to a frame feature extraction model trained in advance, and a feature vector of each of the N video frames is obtained.

In step S23, a multi-frame fusion layer that fuses the feature vectors of K video frames into fixed-length vectors is determined.

In the embodiment of the present disclosure, a target video segment is taken as an example of a highlight video segment. The score of a video frame, which is convenient to obtain through the scoring model, corresponds to the highlight degree of the content of the video segment in which the video frame is positioned. When the feature vector of the ith frame is obtained, the feature vectors of (K-1) video frames taking the ith frame as the center are obtained at the same time, so that the reliability of outputting the score is improved. Therefore, before training the scoring model, a multi-frame fusion layer capable of fusing the feature vectors of the K video frames into fixed-length vectors needs to be determined, and then the length vectors output by the multi-frame fusion layer can enter the scoring model for scoring. For example: one video frame corresponds to one N-dimensional vector, and when 7 video frames are acquired simultaneously, 7N-dimensional vectors are obtained. Therefore, in order to ensure the normal operation of the scoring model and the reliability of the score, 7 obtained N-dimensional vectors need to be fused through a multi-frame fusion layer, so as to obtain an M-dimensional vector suitable for being input as the scoring model. The multi-frame fusion layer can fuse a plurality of N-dimensional vectors into a fixed-length vector in a serial connection mode, a pooling mode or a vector addition mode.

In step S24, a scoring model is obtained for training based on the multi-frame fusion layer and the data consisting of the positive and negative segments.

In the disclosed embodiment, the scoring model is trained based on the determined data pairs consisting of the multi-frame fusion layer, the positive segment and the negative segment. In the process of obtaining the data pairs, labeling the obtained labeled video segments based on the content of the video segments, and further determining whether the labeled video segments belong to positive segments or negative segments. The target attribute includes an attribute characterizing the video segment as a target video segment or a non-target video segment. In one example, the target video segment may be the most exciting video segment in the video; the non-target video segment is the least wonderful video segment in the video. Therefore, when data pairs are obtained, the difference between the positive segments and the negative segments can be clearly distinguished, so that when the scoring model is trained, the scoring model can rapidly learn the characteristics of the video segments with different attributes. Furthermore, based on attribute labeling of the labeled video segments, the accuracy of the training data can be improved, the interference of noise data on model training is reduced, the cleanliness of the training data is improved, the structure of the training model is simpler, and the reliability of the labeled video segments is calculated without adopting an additional network model. Thereby helping the training process of the scoring model to speed up convergence and helping to save costs.

In one implementation scenario, the scoring model is trained by randomly sampling or sampling K video frames in a positive segment. Further, as shown in fig. 3, frame feature vectors are extracted from the K video frames by using a frame feature extraction model, so as to obtain frame feature vectors corresponding to the video frames in the positive segment. And then the obtained K framesThe feature vectors are fused through a multi-frame fusion layer to obtain positive fragment feature vectors with fixed length vectors, and the positive fragment feature vectors are marked as { P₁,P₂,...P_k}. Randomly sampled or with K video frames sampled in each negative segment. And further extracting the frame feature vectors of the K video frames through a frame feature extraction model to obtain the frame feature vectors corresponding to the video frames in the negative segment. And fusing the obtained K frame feature vectors through a multi-frame fusion layer to obtain a negative fragment feature vector with a fixed length vector, and recording the negative fragment feature vector as { N₁,N₂,...,N_K}。

The scoring model may be derived from a multi-layered perceptron model. The twin neural network is obtained by copying a multilayer perceptron model and sharing parameters. And further, a well-trained scoring model is obtained through training the twin neural network. When the obtained positive segment feature vector and the negative segment feature vector are input into the twin neural network for training, the obtained positive segment feature vector and the obtained negative segment feature vector can be simultaneously used as input to obtain the score of the positive segment and the score of the negative segment, and the loss value is calculated according to the scores of the positive segment and the negative segment, so that the twin neural network is trained through a back propagation algorithm. In training, the feature vector { P of the positive segment in each data pair is used₁,P₂,...P_kAnd the eigenvector of the negative fragment N₁,N₂,...,N_KInputting the result into a twin neural network to obtain the scores S (P) of positive segments and the scores S (N) of negative segments. And the score output corresponding to the positive segment should be higher than the score output corresponding to the negative segment. And then, the score output by the scoring model is reversely propagated by utilizing the ranking loss function so as to adjust each parameter and the weight proportion in the twin neural network, thereby improving the accuracy of the scoring model and accelerating the convergence rate of the twin neural network in the training process. The ordering penalty can be written as: l ({ P)₁,P₂,...P_k},{N₁,N₂,...,N_K})＝max(0,1-S(P)+S(N))。

The training process of the scoring model will be exemplified below with reference to practical applications.

Before the scoring model is trained, the data pairs suitable for the training of the scoring model are obtained in advance, so that when the scoring model is trained, the scoring model can distinguish wonderful degree differences among different video contents, and the scoring accuracy is improved. And the data pair used for training is obtained based on the marked segment of the marked target attribute in the sample video, thereby being beneficial to avoiding the mixing of noise data and improving the quality of training data, thereby being beneficial to reducing the training difficulty and accelerating the training process.

Fig. 4 is a flowchart illustrating a method of determining data pairs, as shown in fig. 4, according to an exemplary embodiment, including the following steps S31 through S32.

In step S31, a sample video including one or more annotated video segments is obtained.

In an embodiment, before training the scoring model, a certain amount of sample videos are obtained in advance to obtain a sample video set, so that the scoring model can have enough training data to be trained.

In step S32, a data pair consisting of a positive segment and a negative segment is obtained based on the target attribute tagged by one or more tagged video segments and the non-tagged video segments included in the sample video.

In the sample video set, each sample video has one or more annotated video segments and non-annotated video segments. And marking the marked video segments in each sample video based on the content, and determining the target attribute of each marked video segment. And obtaining the data pair consisting of the positive segment and the negative segment from each marked video segment and each non-marked video segment based on the marked target attribute. Wherein the probability that a positive segment becomes a target video segment is greater than the probability that a negative segment becomes a target video segment. Based on the difference between the positive segment and the negative segment, the scoring model can accurately distinguish the characteristic difference between the target video segment and the non-annotated video segment, and the accuracy of the scoring model is improved.

In one example, in order to facilitate the scoring model to better distinguish the difference in the highlight degree between different video clips in the same video, when data pairs are obtained, the positive clip and the negative clip can come from the same sample video, and then the relative score difference between the video clips can be obtained, so that the highlight degree between the video clips in the same video can be distinguished, and the sample video can be fully utilized. For example: aiming at the dunk collection video, each dunk video clip belongs to a wonderful video clip. The scoring model obtained by training is subjected to data obtained by using the positive segment and the negative segment from the same sample video, so that the relative size of each basket deducting video segment score can be obtained, and the relatively more wonderful basket deducting video segments can be distinguished, and the target video can be conveniently extracted.

The following description of the embodiments of the present disclosure takes a target video segment as a highlight video segment as an example.

When the marked video segment with the target attribute being the target video segment attribute in the sample video is marked, according to the content of the sample video, the most wonderful video segment in the sample video is taken as the video segment with the target video segment attribute, namely, the video segment is more wonderful and attractive than the content at other moments in the same sample video. And marking the starting and ending time of the video clip to obtain a marked video clip with the attribute of the target video clip. When the marked video segment with the non-target attribute as the target video segment attribute in the sample video is marked, according to the content of the sample video, the video segment with the least highlights in the sample video is taken as the video segment with the non-target video segment attribute, namely, the video segment is less highlights and less attractively than the content of other moments in the same sample video. And marking the starting and ending time of the video clip to obtain a marked video clip with the attribute of the non-target video clip.

In one example, the sample video can include one or more annotated video segments having a target attribute that is an attribute of the target video segment, as well as non-annotated video segments. In obtaining the data pairs, one or more annotated video segments may be taken as positive segments, and a portion of the video segments may be extracted as negative segments from the non-annotated video segments included in the sample video. If only one marked video segment exists in the sample video and the duration of the non-marked video segment is close to that of the marked video segment, the marked video segment can be directly used as a positive segment and the non-marked video segment can be used as a negative segment, so that a data pair required during training can be obtained. For example: as shown in fig. 5, the video segment 2 is an annotated video segment whose target attribute is the attribute of the target video segment, and the

video segments

1 and 3 are non-annotated video segments. Further, when the data pair is obtained, a data pair in which the video segment 2 is a positive segment, the video segment 1 is a negative segment, and a data pair in which the video segment 2 is a positive segment and the video segment 3 is a negative segment can be obtained. If only one annotated video segment exists in the sample video, but the duration of the non-annotated video segment is too long, the non-annotated video segment can be divided into a plurality of sub-non-annotated video segments within a specified duration range, and then a plurality of data pairs with the annotated video segment as a positive segment and the sub-non-annotated video segment as a negative segment can be obtained. Therefore, the annotation difficulty is reduced, and a large number of training data pairs can be obtained by annotating a small number of annotated video segments. For example: the sample video time is 60 seconds, wherein the annotated video segment is 10 seconds, and the non-annotated video segment is 50 seconds. To facilitate the acquisition of a large number of training data pairs, the non-annotated video segment may be divided into a plurality of sub-non-annotated video segments that are similar in duration to the annotated video segment. If the non-annotated video segment is divided into a plurality of sub non-annotated video segments that do not exceed 10 seconds, at least 5 sub non-annotated video segments can be obtained, for example: the video clip comprises a sub non-annotated video clip 1, a sub non-annotated video clip 2, a sub non-annotated video clip 3, a sub non-annotated video clip 4 and a sub non-annotated video clip 5. Further, 5 pairs of data for the fractional model training can be obtained: the marked video segment is a positive segment, and the sub non-marked video segment 1 is a data pair of a negative segment; the marked video segment is a positive segment, and the sub non-marked video segment 2 is a data pair of a negative segment; the marked video segment is a positive segment, and the sub non-marked video segment 3 is a data pair of a negative segment; the marked video segment is a positive segment, and the sub non-marked video segment 4 is a data pair of a negative segment; the annotated video segment is a positive segment and the child non-annotated video segment 5 is a data pair of a negative segment.

In another example, the sample video can include one or more annotated video segments with non-target video segment attributes and non-annotated video segments. When obtaining the data pairs, one or more annotated video segments may be used as negative segments, and a part of the video segments are extracted from the non-annotated video segments included in the sample video as positive and negative segments. If only one marked video segment exists in the sample video and the duration of the non-marked video segment is close to that of the marked video segment, the marked video segment can be directly used as a negative segment and the non-marked video segment can be used as a positive segment, so that a data pair required during training can be obtained. For example: as shown in fig. 6, video segment 3 is an annotated video segment whose target attribute is a non-target video segment attribute, and

video segments

1 and 2 are non-annotated video segments. Further, when the data pair is obtained, a data pair in which the video segment 1 is a positive segment, the video segment 3 is a negative segment, and a data pair in which the video segment 2 is a positive segment and the video segment 3 is a negative segment can be obtained. If only one annotated video segment exists in the sample video, but the duration of the non-annotated video segment is too long, the non-annotated video segment can be divided into a plurality of sub-non-annotated video segments within a specified duration range, and then a plurality of data pairs with the annotated video segment as a negative segment and the sub-non-annotated video segment as a positive segment can be obtained. Therefore, the annotation difficulty is reduced, and a large number of training data pairs can be obtained by annotating a small number of annotated video segments.

In yet another example, the sample video can include one or more annotated video segments with target attributes that are attributes of target video segments, one or more annotated video segments with target attributes that are attributes of non-target video segments, and non-annotated video segments. When the data pairs are obtained, if the marked video segment marked with the attribute representing the target video segment is taken as a positive segment, the marked video segment marked with the attribute representing the non-target video segment is taken as a negative segment, or a part of video segments extracted from the non-marked video segment is taken as a negative segment. And if the marked video segment marked with the attribute representing the non-target video segment is taken as a negative segment, taking the marked video segment marked with the attribute representing the target video segment as a positive segment, or taking a part of extracted video segments of the non-marked video segment as a positive segment. For example: as shown in fig. 7, the video segment 2 is an annotated video segment representing attributes of the target video segment, the video segment 3 is an annotated video segment having attributes of non-target video segments, and the video segment 1 is a non-annotated video segment. Further, when the data pair is obtained, the data pair with the video clip 2 as a positive clip and the video clip 1 as a negative clip can be obtained; obtaining a data pair with a video segment 2 as a positive segment and a video segment 3 as a negative segment; resulting in a data pair with video segment 1 being a positive segment and video segment 3 being a negative segment.

Through obtaining the training data pair that has the mark, can effectively reduce the production of noise data, avoid noise data's interference, and then help improving the cleanliness factor of training data, make scoring model simple structure, and need not to adopt other network models or add other parameters in order to improve training data's reliability, the training degree of difficulty is low, helps in training the convergence of in-process faster scoring model.

Based on the same conception, the embodiment of the disclosure also provides a video clip extracting device.

It is understood that the video segment extracting apparatus provided by the embodiments of the present disclosure includes a hardware structure and/or a software module for performing the above functions. The disclosed embodiments can be implemented in hardware or a combination of hardware and computer software, in combination with the exemplary elements and algorithm steps disclosed in the disclosed embodiments. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

Fig. 8 is a block diagram illustrating a video segment extraction apparatus according to an exemplary embodiment. Referring to fig. 8, the video segment extraction apparatus 100 includes an acquisition unit 101, a feature extraction unit 102, a determination unit 103, and an extraction unit 104.

The acquiring unit 101 is configured to acquire a video and sample the video to obtain N video frames, where N is a positive integer.

The feature extraction unit 102 is configured to input the N video frames to a pre-trained frame feature extraction model to obtain a feature vector of each video frame in the N video frames.

The determining unit 103 is configured to determine scores of the N video frames based on a pre-trained scoring model, where for an ith frame of the N video frames, feature vectors of K video frames centered on the ith frame are input into the pre-trained scoring model, so as to obtain a score of the ith frame, i is a positive integer less than or equal to N, and K is a positive integer.

And an extracting unit 104, configured to extract a target video segment in the video based on the scores of the N video frames.

In an embodiment, the video segment extracting apparatus further comprises a training unit. The training unit is used for training to obtain a scoring model based on the fusion layer and the data pair consisting of the positive segment and the negative segment, the data pair consisting of the positive segment and the negative segment is obtained based on the sample video segment marked with the target attribute, the target attribute comprises the attribute representing that the video segment is the target video segment or the non-target video segment, and the multi-frame fusion layer is used for fusing the feature vectors of the K video frames into fixed-length vectors.

In another embodiment, the training unit obtains a scoring model based on the multi-frame fusion layer and the training of the data pair consisting of the positive segment and the negative segment by adopting the following modes: sampling K video frames in a positive segment, extracting feature vectors of the K video frames sampled in the positive segment based on a frame feature extraction model, sampling the K video frames in a negative segment, extracting feature vectors of the K video frames sampled in the negative segment based on the frame feature extraction model, fusing the feature vectors of the K video frames sampled in the positive segment into a positive segment feature vector with a fixed length vector based on a multi-frame fusion layer, and fusing the K video frames sampled in the negative segment into a negative segment feature vector with the fixed length vector based on the multi-frame fusion layer. Inputting the positive segment feature vector and the negative segment feature vector into the twin neural network to obtain the scores of the positive segments and the scores of the negative segments, and performing back propagation by using the sequencing loss to train to obtain a trained twin neural network; the twin neural network comprises two multilayer perceptron models sharing parameters; the scoring model is a multi-layer perceptron model of the trained twin neural network.

In yet another embodiment, the data pairs of positive and negative segments are derived based on the sample video segments labeled with the target attribute as follows: a sample video is obtained that includes one or more sample video clips. And obtaining a data pair consisting of a positive segment and a negative segment based on the target attribute marked by one or more sample video segments and the non-sample video segments included in the sample video, wherein the probability of the positive segment becoming the target video segment is greater than the probability of the negative segment becoming the target video segment.

In yet another embodiment, the data pair consisting of the positive and negative segments is obtained based on the target attribute labeled by one or more sample video segments and the non-sample video segments included in the sample video in the following manner: if the target attributes marked by the one or more sample video clips comprise attributes representing that the video clips are target video clips, taking the one or more sample video clips as positive clips, extracting partial video clips from non-sample video clips included in the sample video as negative clips, and obtaining one or more data pairs from the positive clips and the negative clips; or if the target attributes marked by the one or more sample video clips comprise attributes representing that the video clips are non-target video clips, taking the one or more sample video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video as positive clips, and obtaining one or more data pairs from the positive clips and the negative clips. Or if the target attributes marked by one or more sample video clips comprise attributes which represent the video clips to be target video clips and attributes which represent the video clips to be non-target video clips, taking the sample video clips marked with the attributes which represent the target video clips as positive clips, taking the sample video clips marked with the attributes which represent the non-target video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video, obtaining data pairs from the positive clips and the negative clips, obtaining the data pairs from the positive clips and the partial video clips, and obtaining the data pairs from the negative clips and the partial video clips.

In yet another embodiment, the extraction module extracts the target video segment in the video based on the scores of the N video frames in the following manner: and sliding the obtained video clips on the video along the time sequence based on fixed-length sliding windows, wherein each sliding window corresponds to one video clip. And respectively determining the average score of the video frames included in the sliding window aiming at each sliding window, and taking the average score of the video frames as the score of the video segment corresponding to the sliding window. One or more target video segments are extracted from the plurality of video segments based on the scores of the plurality of video segments.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 9 is a block diagram illustrating another video segment extraction apparatus according to an example embodiment. For example, the video clip extracting apparatus 200 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 9, the video clip extracting apparatus 200 may include one or more of the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, an input/output (I/O) interface 212, a sensor component 214, and a communication component 216.

The processing component 202 generally controls the overall operation of the video clip extraction apparatus 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 202 may include one or more processors 220 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 202 can include one or more modules that facilitate interaction between the processing component 202 and other components. For example, the processing component 202 can include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.

The memory 204 is configured to store various types of data to support the operation at the video clip extraction apparatus 200. Examples of such data include instructions for any application or method operating on the video clip extraction apparatus 200, contact data, phonebook data, messages, pictures, videos, and the like. The memory 204 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 206 provides power to the various components of the video clip extraction device 200. The power components 206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the video clip extraction apparatus 200.

The multimedia component 208 includes a screen that provides an output interface between the video clip extraction device 200 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 208 includes a front facing camera and/or a rear facing camera. When the video clip extraction apparatus 200 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 210 is configured to output and/or input audio signals. For example, the audio component 210 includes a Microphone (MIC) configured to receive an external audio signal when the video clip extracting apparatus 200 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 also includes a speaker for outputting audio signals.

The I/O interface 212 provides an interface between the processing component 202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 214 includes one or more sensors for providing various aspects of status assessment for the video clip extraction apparatus 200. For example, the sensor component 214 may detect an open/closed state of the video segment extraction apparatus 200, the relative positioning of components, such as a display and keypad of the video segment extraction apparatus 200, the sensor component 214 may also detect a change in position of the video segment extraction apparatus 200 or a component of the video segment extraction apparatus 200, the presence or absence of user contact with the video segment extraction apparatus 200, the orientation or acceleration/deceleration of the video segment extraction apparatus 200, and a change in temperature of the video segment extraction apparatus 200. The sensor assembly 214 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 216 is configured to facilitate wired or wireless communication between the video clip extraction apparatus 200 and other devices. The video clip extraction apparatus 200 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the video clip extracting apparatus 200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 204 comprising instructions, executable by the processor 220 of the video segment extraction apparatus 200 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is further understood that the use of "a plurality" in this disclosure means two or more, as other terms are analogous. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will be further understood that the terms "first," "second," and the like are used to describe various information and that such information should not be limited by these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the terms "first," "second," and the like are fully interchangeable. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure.

It will be further understood that, unless otherwise specified, "connected" includes direct connections between the two without the presence of other elements, as well as indirect connections between the two with the presence of other elements.

It is further to be understood that while operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video clip extraction method is characterized by comprising the following steps:

acquiring a video, and sampling in the video to obtain N video frames, wherein N is a positive integer;

inputting the N video frames into a pre-trained frame feature extraction model to obtain a feature vector of each video frame in the N video frames;

determining scores of the N video frames based on a pre-trained scoring model, wherein for an ith frame in the N video frames, feature vectors of K video frames taking the ith frame as a center are input into the pre-trained scoring model to obtain the score of the ith frame, wherein i is a positive integer less than or equal to N, and K is a positive integer;

and extracting a target video segment in the video based on the scores of the N video frames.

2. The method of claim 1, wherein the scoring model is trained based on a plurality of frames of fusion layers and data pairs consisting of positive segments and negative segments;

the data composed of the positive segments and the negative segments are obtained based on a sample video segment marked with target attributes, the target attributes comprise attributes representing that the video segment is a target video segment or a non-target video segment, and the multi-frame fusion layer is used for fusing the feature vectors of the K video frames into fixed-length vectors.

3. The method of claim 2, wherein training the scoring model based on the multi-frame fusion layer and the data pair consisting of the positive segment and the negative segment comprises:

sampling K video frames in a positive segment, extracting feature vectors of the K video frames sampled in the positive segment based on a frame feature extraction model, sampling the K video frames in a negative segment, extracting the feature vectors of the K video frames sampled in the negative segment based on the frame feature extraction model, and

fusing feature vectors of the K video frames sampled in the positive segment into a positive segment feature vector with a fixed length vector based on the multi-frame fusion layer, and fusing the K video frames sampled in the negative segment into a negative segment feature vector with a fixed length vector based on the multi-frame fusion layer;

inputting the positive segment feature vector and the negative segment feature vector into a twin neural network to obtain the scores of the positive segments and the scores of the negative segments, and performing back propagation by using the sequencing loss to train to obtain a trained twin neural network; wherein the twin neural network comprises two multi-layered perceptron models sharing parameters; and the scoring model is a multilayer perceptron model of the trained twins neural network.

4. The method according to claim 2 or 3, wherein the data pair consisting of the positive segment and the negative segment is obtained based on the sample video segment labeled with the target attribute in the following manner:

acquiring a sample video comprising one or more sample video clips;

and obtaining a data pair consisting of a positive segment and a negative segment based on the target attribute marked by the one or more sample video segments and the non-sample video segments included in the sample video, wherein the probability of the positive segment becoming the target video segment is greater than the probability of the negative segment becoming the target video segment.

5. The method of claim 4, wherein obtaining the data pair of the positive segment and the negative segment based on the target attribute labeled by the one or more sample video segments and the non-sample video segments included in the sample video comprises:

if the target attributes marked by the one or more sample video clips comprise attributes representing that the video clips are target video clips, taking the one or more sample video clips as positive clips, extracting partial video clips from non-sample video clips included in the sample video as negative clips, and obtaining one or more data pairs from the positive clips and the negative clips; or

If the target attributes marked by the one or more sample video clips comprise attributes representing that the video clips are non-target video clips, taking the one or more sample video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video as positive clips, and obtaining one or more data pairs from the positive clips and the negative clips; or

If the target attributes marked by the one or more sample video clips comprise attributes for representing the video clips as target video clips and attributes for representing the video clips as non-target video clips, taking the sample video clips marked with the attributes for representing the target video clips as positive clips, taking the sample video clips marked with the attributes for representing the non-target video clips as negative clips, extracting partial video clips from the non-sample video clips included in the sample video, obtaining data pairs from the positive clips and the negative clips, obtaining data pairs from the positive clips and the partial video clips, and obtaining data pairs from the negative clips and the partial video clips.

6. The method of claim 1, wherein extracting a target video segment from the video based on the scores of the N video frames comprises:

a plurality of video clips obtained by sliding on the video along a time sequence based on a sliding window with a fixed length, wherein each sliding window corresponds to one video clip;

respectively determining the average score of the video frames included in the sliding window aiming at each sliding window, and taking the average score of the video frames as the score of the video segment corresponding to the sliding window;

extracting one or more target video segments from the plurality of video segments based on the scores of the plurality of video segments.

7. A video clip extraction apparatus, characterized in that the video clip extraction apparatus comprises:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a video and sampling the video to obtain N video frames, and N is a positive integer;

the feature extraction unit is used for inputting the N video frames into a pre-trained frame feature extraction model to obtain a feature vector of each video frame in the N video frames;

a determining unit, configured to determine scores of the N video frames based on a pre-trained scoring model, where for an ith frame of the N video frames, feature vectors of K video frames centered on the ith frame are input into the pre-trained scoring model, so as to obtain a score of the ith frame, where i is a positive integer less than or equal to N, and K is a positive integer;

and the extracting unit is used for extracting a target video clip in the video based on the scores of the N video frames.

8. The video segment extraction apparatus according to claim 7, further comprising a training unit;

the training unit is used for training to obtain the scoring model based on a multi-frame fusion layer and a data pair consisting of a positive segment and a negative segment, the data pair consisting of the positive segment and the negative segment is obtained based on a sample video segment marked with a target attribute, the target attribute comprises an attribute representing that the video segment is a target video segment or a non-target video segment, and the multi-frame fusion layer is used for fusing the feature vectors of K video frames into fixed-length vectors.

9. The apparatus according to claim 8, wherein said training unit trains the scoring model based on the data pair consisting of the multi-frame fusion layer, the positive segment and the negative segment in the following manner:

10. The apparatus according to claim 8 or 9, wherein the data pair consisting of the positive segment and the negative segment is obtained based on the sample video segment labeled with the target attribute in the following manner:

acquiring a sample video comprising one or more sample video clips;

11. The apparatus according to claim 10, wherein the data pair consisting of the positive segment and the negative segment is obtained based on the target attribute labeled by the one or more sample video segments and the non-sample video segments included in the sample video as follows:

12. The apparatus according to claim 7, wherein the extracting module extracts a target video segment from the video based on the scores of the N video frames in the following manner:

13. A video segment extraction apparatus, wherein the video segment extraction apparatus comprises:

a memory to store instructions; and

a processor for invoking the memory-stored instructions to perform the video segment extraction method of any of claims 1-6.

14. A computer readable storage medium having stored therein instructions which, when executed by a processor, perform a video segment extraction method as claimed in any one of claims 1-6.