CN117710777A

CN117710777A - Model training method, key frame extraction method and device

Info

Publication number: CN117710777A
Application number: CN202410169860.1A
Authority: CN
Inventors: 何俊烽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-03-15
Anticipated expiration: 2044-02-06
Also published as: CN117710777B

Abstract

The embodiment of the disclosure provides a model training method, a key frame extraction method and a device, and relates to the fields of artificial intelligence, machine learning, video processing and the like. The method comprises the following steps: acquiring a plurality of sample candidate key frames in a video sample; and performing at least one training operation on the initial key frame extraction model and at least one initial video understanding model based on the plurality of sample candidate key frames until a preset training ending condition is met, and taking the initial key frame extraction model meeting the preset training ending condition as a trained key frame extraction model. The embodiment of the disclosure realizes that the extracted sample key frame can better represent the characteristics of the video sample, thereby ensuring that the trained key frame extraction model can accurately extract the representative key frame from the video, and improving the accuracy of the extracted key frame.

Description

Model training method, key frame extraction method and device

Technical Field

The disclosure relates to the technical field of video processing, in particular to a model training method, a key frame extraction method and a device.

Background

Video is made up of successive frames, adjacent frames of video having temporal and spatial continuity, so that adjacent frames of video contain a large amount of identical or similar content, which can be represented by extracting from the video the frames that are most representative, reflecting the main content of the video, these frames of video that are representative of the video semantics being key frames.

The existing key frame extraction method is to extract frames of the video according to preset intervals, and the accuracy of the extracted key frames is low.

Disclosure of Invention

The embodiment of the disclosure provides a model training method, a key frame extraction method and a device, which can solve the problem of low key frame extraction accuracy. The technical scheme provided by the disclosure is as follows:

according to one aspect of an embodiment of the present disclosure, there is provided a method of model training, the method comprising:

acquiring a plurality of sample candidate key frames in a video sample;

performing at least one training operation on the initial key frame extraction model and at least one initial video understanding model based on the plurality of sample candidate key frames until a preset training ending condition is met, and taking the initial key frame extraction model meeting the preset training ending condition as a trained key frame extraction model;

wherein the training operation comprises:

inputting the plurality of sample candidate key frames into an initial key frame extraction model, and determining an evaluation value vector of each sample candidate key frame; each evaluation value in the evaluation value vector is used for representing the association degree of each sample candidate key frame and the video sample respectively;

Determining at least one sample key frame from the plurality of sample candidate key frames based on the evaluation value vector;

respectively inputting the at least one sample key frame into at least one initial video understanding model to obtain video prediction labels which are respectively output by the at least one initial video understanding model and are aimed at the video samples;

for each initial video understanding model, determining a first loss function corresponding to the initial video understanding model based on a video sample label and a video prediction label corresponding to the video sample;

determining a second loss function based on the first loss function corresponding to each initial video understanding model;

and adjusting the parameters of the initial key frame extraction model and the parameters of the at least one initial video understanding model based on the second loss function, taking the initial key frame extraction model after parameter adjustment as an initial key frame extraction model corresponding to the next training operation, and taking the at least one initial video understanding model after parameter adjustment as at least one initial video understanding model corresponding to the next training operation.

Optionally, the inputting the plurality of sample candidate key frames into an initial key frame extraction model, determining an evaluation value vector of each sample candidate key frame includes:

Respectively extracting features of the plurality of sample candidate key frames to obtain a plurality of sample candidate frame features respectively corresponding to the plurality of sample candidate key frames;

determining the evaluation value vector based on correlations between the plurality of sample candidate frame features and a reference vector; the reference vector is used for representing semantic features of the video sample;

the determining at least one sample key frame from the plurality of sample candidate key frames based on the evaluation value vector comprises:

and based on the evaluation values respectively corresponding to the sample candidate key frames in the evaluation value vector, taking a preset number of sample candidate key frames with the largest evaluation values in the sample candidate key frames as the at least one sample key frame.

Optionally, the determining the evaluation value vector based on correlation between the plurality of sample candidate frame features and a reference vector includes:

acquiring a reference vector corresponding to the current training operation through an initial semantic extraction module;

for each sample candidate frame feature, determining the similarity between the reference vector and the sample candidate frame feature to obtain a weight corresponding to the sample candidate frame feature;

The evaluation value vector is generated based on each sample candidate frame feature and its corresponding weight.

Optionally, the method further comprises:

and adjusting the parameters of the initial semantic extraction module based on the second loss function, and taking the initial semantic extraction module after the parameters are adjusted as an initial semantic extraction module corresponding to the next training operation.

Optionally, the feature extraction is performed on the plurality of sample candidate key frames to obtain a plurality of sample candidate frame features corresponding to the plurality of sample candidate key frames, respectively, including:

extracting features of the sample candidate key frames to obtain a plurality of initial sample candidate frame features corresponding to the sample candidate key frames respectively;

and determining time sequence information among a plurality of sample candidate key frames, and carrying out feature fusion on the time sequence information and the corresponding initial sample candidate frame features for each sample candidate key frame to obtain sample candidate frame features respectively corresponding to each sample candidate key frame.

Optionally, the method further comprises:

acquiring at least two video understanding tasks;

determining at least two different tag types based on the at least two video understanding tasks;

And acquiring at least two different initial video understanding models corresponding to the at least two different tag types respectively.

Optionally, the method further comprises:

when the training times corresponding to the current training operation are detected to be in accordance with the preset times, increasing the preset number based on the frame extraction step length; the frame-extracting step length is determined based on the duration of the video sample;

and taking the increased preset number as the preset number corresponding to the next training operation.

Optionally, the acquiring a plurality of sample candidate key frames in the video sample includes:

if the difference between the current video frame and the previous video frame in the video sample is detected to be larger than a preset threshold value, the current video frame is used as a sample candidate key frame;

or (b)

And extracting frames from the video samples at preset time intervals to obtain the candidate key frames of the samples.

According to one aspect of an embodiment of the present disclosure, there is provided a method of key frame extraction, the method including:

acquiring a video to be processed, and performing frame extraction on the video to be processed to obtain a plurality of candidate key frames;

determining evaluation value vectors corresponding to the plurality of candidate key frames based on the plurality of candidate key frames through a trained key frame extraction model, and determining at least one key frame from the plurality of candidate key frames based on the evaluation value vectors corresponding to the plurality of candidate key frames;

The key frame extraction model is trained based on the model training method provided by any optional embodiment of the disclosure.

According to another aspect of an embodiment of the present disclosure, there is provided an apparatus for model training, the apparatus comprising:

the acquisition module is used for acquiring a plurality of sample candidate key frames in the video samples;

the training module is used for carrying out at least one training operation on the initial key frame extraction model and at least one initial video understanding model based on the plurality of sample candidate key frames until a preset training ending condition is met, and taking the initial key frame extraction model meeting the preset training ending condition as a trained key frame extraction model;

wherein the training operation comprises:

According to another aspect of an embodiment of the present disclosure, there is provided an apparatus for key frame extraction, the apparatus including:

the candidate key frame acquisition module is used for acquiring a video to be processed, and extracting frames from the video to be processed to obtain a plurality of candidate key frames;

The key frame extraction module is used for determining evaluation value vectors corresponding to the plurality of candidate key frames based on the plurality of candidate key frames through the trained key frame extraction model, and determining at least one key frame from the plurality of candidate key frames based on the evaluation value vectors corresponding to the plurality of candidate key frames;

According to another aspect of an embodiment of the present disclosure, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any one of the model training methods or the keyframe extraction methods described above when executing the program.

According to yet another aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the model training methods or key frame extraction methods described above.

According to an aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the model training methods or keyframe extraction methods described above.

The technical scheme provided by the embodiment of the disclosure has the beneficial effects that:

by acquiring a plurality of sample candidate key frames in a video sample, determining an evaluation value vector based on the plurality of sample candidate key frames by an initial key frame extraction model, and determining at least one key frame from the plurality of sample candidate key frames based on the evaluation value vector. The key frames in the video samples are pre-extracted once, the extracted candidate key frames of the samples are scored, the candidate key frames are screened based on the scores corresponding to the candidate key frames, and the candidate key frames with higher association degree with the video samples are used as the sample key frames, so that the extracted sample key frames can better represent the characteristics of the video samples, further, the trained key frame extraction model can accurately extract representative key frames from the video, and the accuracy of the extracted key frames is improved.

Further, through coupling training of the initial keyframe extraction model and at least one initial video understanding model, the initial keyframe extraction model can be adaptively adjusted according to the downstream video understanding task in the training process, so that the keyframes extracted by the trained keyframe extraction model can be better adapted to the downstream video understanding task, and the accuracy of the output result of the downstream video understanding task is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments of the present disclosure will be briefly introduced below.

Fig. 1 is a schematic view of an application environment of a model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a model training method according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of a model structure provided in an embodiment of the present disclosure;

fig. 4 is a flowchart of a key frame extraction method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating another key frame extraction method according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a model training device according to an embodiment of the disclosure;

fig. 7 is a schematic structural diagram of a key frame extracting device according to an embodiment of the disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure are described below with reference to the drawings in the present disclosure. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present disclosure, and the technical solutions of the embodiments of the present disclosure are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g. "a and/or B" or "A, B" indicates implementation as "a", or as "B", or as "a and B".

For the purposes of clarity, technical solutions and advantages of the present disclosure, the following further details the embodiments of the present disclosure with reference to the accompanying drawings.

Technical terms to which the present disclosure relates are first introduced and explained:

and (5) frame extraction: and intercepting and exporting frame pictures of a video medium (such as MP4, mkv and the like) according to requirements, and outputting the frame pictures as a group of pictures.

Key frame: video frames that are representative of video semantics.

Transformer: the transducer is a neural network architecture based on a self-attention mechanism, and is suitable for sequence-to-sequence tasks. The method comprises an encoding module and a decoding module which are respectively responsible for extracting characteristics of an input sequence and generating an output sequence. Compared with the traditional cyclic neural network, the transducer can process sequence information in parallel, so that faster training speed and higher performance are realized. Application fields include machine translation, text summarization, semantic understanding, etc.

Position coding: position coding is a technique for natural language processing tasks that adds position information to a word-embedded representation. Because the transducer model has no loop structure, it is not possible to capture the order information in the sequence. The position code represents the position of a word in a sentence in a vector form, and adds the position information to the word vector so that the model can distinguish words at different positions. Common position coding methods are fixed position coding and leachable position coding.

With the rapid development of the internet and multimedia, a huge amount of video is uploaded to the internet every day.

Typically, a video frame that is adjacent to or behind a video frame contains a large amount of the same or similar content, and the video frame can be represented by extracting a plurality of frames that are most representative and reflect the main content of the video from the video, and the frames are key frames. By extracting the key frames, a video characteristic can be represented with a small amount of data. The key frame extraction of video is typically the first step in the video understanding algorithm, on which most video understanding algorithms are developed.

The model training method, the key frame extraction method and the key frame extraction device provided by the disclosure aim to solve the technical problems in the prior art.

Alternatively, model training according to embodiments of the present disclosure may be implemented based on Machine Learning (ML) in artificial intelligence (Artificial Intelligence, AI).

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Optionally, the data processing involved in the method provided by the embodiment of the present disclosure may also be implemented based on cloud technology. For example, various calculations in the model training process may be implemented using cloud computing technology, and training samples may be stored in a cloud manner.

Among them, cloud computing (cloud computing) is a computing mode that distributes computing tasks over a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space, and information services as needed, and a network providing resources is called a "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed. Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and can combine a large number of different types of storage devices (storage devices are also called storage nodes) in a network to cooperate through application software or application interfaces, so as to provide data storage and service access functions together.

In particular embodiments of the present disclosure, any object-related data, such as data related to a video sample, a video to be processed, etc., when the embodiments of the present disclosure are applied to a particular product or technology, it is necessary to obtain permission or consent for the object, and the collection, use, and processing of the related data is necessary to comply with the relevant laws and regulations and standards of the relevant country and region. That is, in embodiments of the present disclosure, if any of the subject-related data described above is involved, such data would need to be obtained via subject authorization consent, and in compliance with relevant national and regional laws and regulations and standards.

The technical solutions of the embodiments of the present disclosure and technical effects produced by the technical solutions of the present disclosure are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

Fig. 1 is an application environment schematic diagram of a model training method according to an embodiment of the disclosure. The application environment may include a server 101 and a terminal 102, among others. The server 101 acquires a plurality of sample candidate key frames in a video sample; and performing at least one training operation on the initial key frame extraction model and at least one initial video understanding model based on the plurality of sample candidate key frames until a preset training ending condition is met, and taking the initial key frame extraction model meeting the preset training ending condition as a trained key frame extraction model. The terminal 102 sends the video to be processed to the server 101, and the server 101 performs frame extraction on the video to be processed to obtain a plurality of candidate key frames; determining evaluation value vectors corresponding to a plurality of candidate key frames through the trained key frame extraction model, and determining at least one key frame from the plurality of candidate key frames based on the evaluation value vectors corresponding to the plurality of candidate key frames; the server 101 returns the obtained at least one key frame to the terminal 102.

The model training method provided by the embodiment of the disclosure may be performed by any electronic device, which may be a server or a terminal shown in fig. 1.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The terminal may be a smart phone (e.g., android phone, iOS phone, etc.), a tablet computer, a notebook computer, a digital broadcast receiver, a MID (Mobile Internet Devices, mobile internet device), a PDA (personal digital assistant), a desktop computer, a smart home appliance, a vehicle-mounted terminal (e.g., a vehicle-mounted navigation terminal, a vehicle-mounted computer, etc.), a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, but are not limited thereto.

An embodiment of the present disclosure provides a model training method, as shown in fig. 2, including:

step S110, a plurality of sample candidate key frames in the video samples are acquired.

Specifically, the video sample may be a video for model training, where the video sample is obtained through an image acquisition device such as a camera, a mobile phone, a camera or a tablet computer, or may be collected through a network on the premise of meeting relevant regulations, and a specific obtaining mode of the video sample is not limited in the embodiments of the present disclosure.

After obtaining the video sample, frame extraction processing can be performed on the video sample to obtain a plurality of candidate sample key frames.

Optionally, frame extraction may be performed based on a difference between adjacent frames in the video sample, or may be performed at preset time intervals, or may be set, and video frames with preset frames are extracted from the video sample, as a plurality of sample candidate key frames, where a specific frame extraction mode is not limited in the embodiment of the present disclosure.

Step S120, performing at least one training operation on the initial key frame extraction model and at least one initial video understanding model based on a plurality of sample candidate key frames until a preset training ending condition is met, and taking the initial key frame extraction model meeting the preset training ending condition as a trained key frame extraction model;

Wherein the training operation comprises:

(1) Inputting a plurality of sample candidate key frames into an initial key frame extraction model, and determining an evaluation value vector of each sample candidate key frame; each evaluation value in the evaluation value vector is used for representing the association degree of each sample candidate key frame and each video sample;

(2) Determining at least one sample key frame from a plurality of sample candidate key frames based on the evaluation value vector;

(3) Respectively inputting at least one sample key frame into at least one initial video understanding model to obtain video prediction labels which are respectively output by the at least one initial video understanding model and are aimed at video samples;

(4) Determining a first loss function corresponding to each initial video understanding model based on a video sample label and a video prediction label corresponding to the video sample;

(5) Determining a second loss function based on the first loss function corresponding to each initial video understanding model;

(6) And adjusting parameters of the initial key frame extraction model and parameters of at least one initial video understanding model based on the second loss function, taking the initial key frame extraction model after the parameters are adjusted as an initial key frame extraction model corresponding to the next training operation, and taking the at least one initial video understanding model after the parameters are adjusted as at least one initial video understanding model corresponding to the next training operation.

Specifically, after obtaining a plurality of candidate sample key frames of the video sample, the plurality of candidate sample key frames may be input into an initial key frame extraction model, an evaluation value vector corresponding to each candidate sample key frame is obtained based on the plurality of candidate sample key frames through the initial key frame model, and the plurality of candidate sample key frames are screened based on evaluation values corresponding to each candidate sample key frame in the evaluation value vector, so as to determine at least one key frame from the plurality of candidate sample key frames.

Wherein, the evaluation value corresponding to each sample candidate key frame can be used for representing the association degree of the sample candidate key frame and the video sample. For example, if the association degree between the sample candidate key frame and the video sample is higher, the corresponding evaluation value is higher; and vice versa. The evaluation value vector can reflect the association degree between each sample candidate key frame and the video sample, and further the sample candidate key frame with higher association degree with the video sample can be used as the sample key frame through the evaluation value vector, so that the extracted sample key frame can better represent the characteristics of the video sample, and the extracted sample key frame has higher accuracy.

And respectively inputting at least one key frame into at least one initial video understanding model, and predicting a video understanding result of the video sample based on the at least one key frame by using the initial video understanding model according to each initial video understanding model to obtain a video prediction tag. A first loss function of the initial video understanding model is determined based on the video sample tags and the video prediction tags to which the video samples correspond, respectively. The video sample tag may be a true video understanding result of the video sample.

And determining a second loss function of the model of the whole obtained by coupling the initial key frame extraction model and at least one initial video understanding model based on the first loss function corresponding to each initial video understanding model.

Parameters of the initial keyframe extraction model corresponding to the current training operation can be adjusted based on the second loss function, and parameters of at least one initial video understanding model are adjusted. And taking the initial key frame extraction model and at least one initial video understanding model after parameter adjustment into the next training operation, repeatedly executing the training operation, and restricting the training of the model based on a loss function, so that the video prediction labels output by the initial video understanding models are more and more close to the video sample labels of the video samples until the video sample labels meet the preset training ending condition, and taking the initial key frame extraction model meeting the preset training ending condition as a trained key frame extraction model.

The preset training ending condition may be that the loss function converges, for example, the loss function is smaller than a set value or the loss function is smaller than the set value obtained by calculating the continuous set times; the preset training ending condition may also be that the training number reaches a preset number, which is not limited in the embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a model structure provided by an embodiment of the present disclosure, and as shown in fig. 3, in the model training method provided by the embodiment of the present disclosure, an initial keyframe extraction model and at least one initial video understanding model are coupled to train, where the number of initial video understanding models may be one or more, and when the number of initial video understanding models is multiple, multiple initial video understanding models may correspond to the same video understanding task, or may correspond to different video understanding tasks.

Optionally, when the trained keyframe extraction model needs to be applied to a specific video understanding task, at least one initial video understanding model corresponding to the video understanding task may be set, and by coupling training the initial keyframe extraction model with the at least one initial video understanding model, the initial keyframe extraction model may be adjusted for the at least one initial video understanding model in the downstream during the training, so that at least one keyframe obtained by the trained keyframe extraction model may be better adapted to the downstream video understanding model.

The number and types of the initial video understanding models can be specifically set according to actual application requirements, and the embodiment of the disclosure is not limited to this.

In an embodiment of the disclosure, an evaluation value vector is determined based on a plurality of sample candidate key frames by acquiring a plurality of sample candidate key frames in a video sample by an initial key frame extraction model, and at least one key frame is determined from the plurality of sample candidate key frames based on the evaluation value vector. The key frames in the video samples are pre-extracted once, the extracted candidate key frames of the samples are scored, the candidate key frames are screened based on the scores corresponding to the candidate key frames, and the candidate key frames with higher association degree with the video samples are used as the sample key frames, so that the extracted sample key frames can better represent the characteristics of the video samples, further, the trained key frame extraction model can accurately extract representative key frames from the video, and the accuracy of the extracted key frames is improved.

And further, by coupling training the initial keyframe extraction model with at least one initial video understanding model, the initial keyframe extraction model can be adaptively adjusted according to the downstream video understanding task in the training process, so that the keyframes extracted by the trained keyframe extraction model can be better adapted to the downstream video understanding task, and the accuracy of the output result of the downstream video understanding task is improved.

As an alternative embodiment, inputting a plurality of sample candidate key frames into an initial key frame extraction model, determining an evaluation value vector for each sample candidate key frame, comprising:

determining an evaluation value vector based on correlations between the plurality of sample candidate frame features and the reference vector; the reference vector is used for representing semantic features of the video sample;

determining at least one sample key frame from a plurality of sample candidate key frames based on the evaluation value vector, comprising:

and based on the evaluation values respectively corresponding to the sample candidate key frames in the evaluation value vector, taking a preset number of sample candidate key frames with the largest evaluation values in the sample candidate key frames as at least one sample key frame.

Specifically, after obtaining a plurality of sample candidate key frames, inputting the plurality of sample candidate key frames into an initial key frame extraction model, respectively extracting features of the plurality of sample candidate key frames through the initial key frame extraction model to obtain a plurality of corresponding sample candidate key frame features, and determining an evaluation value vector based on correlation between the plurality of sample candidate key frame features and a reference vector.

The reference vector may be used to characterize semantic features of a video sample, if the correlation between the sample candidate key frame feature and the reference vector is greater, that is, the degree of correlation between the sample candidate key frame and the video sample is greater, the sample candidate key frame is more capable of representing the video sample, and the evaluation value corresponding to the sample candidate key frame is higher, so that the possibility of selecting the sample candidate key frame as the sample key frame is greater.

After the evaluation value vector is obtained, a preset number of sample candidate key frames with the largest evaluation value can be used as at least one sample key frame according to the evaluation values respectively corresponding to the sample candidate key frames.

The preset number may be preset, and the corresponding preset number may be set according to different video understanding tasks, or may be set according to the length of the video sample, which is not specifically limited in the embodiments of the present disclosure.

As an alternative embodiment, determining an evaluation value vector based on correlation between a plurality of sample candidate frame features and a reference vector, comprises:

For each sample candidate frame feature, determining the similarity between the reference vector and the sample candidate feature to obtain the weight corresponding to the sample candidate frame feature;

and generating an evaluation value vector based on each sample candidate frame characteristic and the corresponding weight.

Specifically, an initial semantic extraction module may be configured to determine semantic features of the corresponding video sample based on the plurality of sample candidate key frames.

In the current training operation, a reference vector corresponding to the current training operation can be obtained through an initial semantic extraction module based on a plurality of sample candidate key frames. For each sample candidate frame feature, the similarity between the reference vector and the sample candidate frame feature may be calculated as the weight corresponding to the sample candidate frame feature.

The reference vector may represent a semantic feature of a video sample, and the greater the similarity between the sample candidate frame feature and the reference vector, the more representative the sample candidate frame feature can represent the video sample, and the greater the corresponding weight.

And after obtaining the weights respectively corresponding to the sample candidate frame features, weighting the sample candidate frame features based on the weights respectively corresponding to the sample candidate frame features, and determining an evaluation value vector according to the weighted results.

Alternatively, the evaluation value vector may be generated based on a self-attention mechanism. For each sample candidate Key frame feature, the reference vector may be referred to as a Query vector Q (i.e., query), and the sample candidate Key frame feature may be referred to as a Key vector K (i.e., key) and a Value vector V (i.e., value), respectively.

For each sample candidate key frame feature, calculating a dot product of the query vector Q and the key vector K, carrying out softmax normalization on the result of the dot product, and taking the normalized result as the weight of the corresponding value vector V. And carrying out weighted summation on each value vector V and the corresponding weight thereof to obtain the self-attention output vector.

After obtaining the self-attention output vector, the self-attention output vector may be input to an FFN (feed forward net), which may be composed of two layers of MLPs (Multilayer Perceptron, multi-layer perceptrons). When the self-attention output vector size is n×d, the vector size of FFN output is n×1; then, according to the number K of the plurality of sample candidate frames, the vector of n×1 is truncated to obtain a vector of k×1, the vector of k×1 is input to a sigmoid (activation function), and the output vector is used as an evaluation value vector.

In the embodiment of the disclosure, the similarity between the reference vector and each sample candidate frame feature is calculated as the weight corresponding to each sample candidate frame feature, and the evaluation value vector is generated based on each sample candidate frame feature and the weight corresponding to each sample candidate frame feature, so that the greater the weight of the sample candidate frame feature which can represent the feature of the video sample, the greater the correlation degree between each sample candidate key frame and the video sample can be accurately reflected by the calculated evaluation value vector, and therefore, a more representative sample key frame can be selected based on the evaluation value vector later.

As an alternative embodiment, the method further comprises:

and adjusting parameters of the initial semantic extraction module based on the second loss function, and taking the initial semantic extraction module after the parameters are adjusted as an initial semantic extraction module corresponding to the next training operation.

Specifically, in the current training operation, parameters of the initial semantic extraction model can be adjusted based on the second loss function, the initial semantic extraction model after the parameters are adjusted participates in the next training operation, and by repeatedly executing the training operation, the initial semantic extraction model can continuously adjust the parameters in the initial semantic extraction model through continuous learning, so that the parameters in the initial semantic extraction model can more accurately extract semantic features of a plurality of sample candidate key frames, the output reference vector can more accurately express the semantic features of a video sample, and the accuracy of the generated evaluation value vector is improved.

As an optional embodiment, performing feature extraction on a plurality of sample candidate key frames to obtain a plurality of sample candidate frame features corresponding to the plurality of sample candidate key frames, respectively, including:

and determining time sequence information among a plurality of sample candidate key frames, and for each sample candidate key frame, carrying out feature fusion on the time sequence information and the corresponding initial sample candidate frame features to obtain sample candidate frame features respectively corresponding to each sample candidate key frame.

Specifically, to obtain sample candidate frame features corresponding to each sample candidate key frame, for each sample candidate key frame, feature extraction may be performed on the sample candidate key frame once, and the extracted features are used as corresponding initial sample candidate frame features.

After obtaining the initial sample candidate frame features corresponding to each sample candidate key frame, time sequence information between each sample candidate key frame may be determined, where the time sequence information may include a time sequence of each sample candidate key frame in the video sample.

For each sample candidate key frame, the time sequence information and the corresponding initial sample candidate frame feature can be subjected to feature fusion, and the fused feature is used as the corresponding sample candidate frame feature.

Optionally, the time sequence information and the corresponding initial sample candidate frame feature can be subjected to feature fusion based on the attention mechanism, so that the fused feature not only contains the information of the corresponding sample candidate key frame, but also contains the information of the video frame related to the sample candidate key frame, and the feature expression capability of the candidate frame feature is improved.

As an alternative embodiment, the method further comprises:

acquiring at least two video understanding tasks;

In particular, a video understanding task refers to extracting and reasoning about semantic information in a video, such as video classification, video tags, video searching, video recommendation, and the like, by analyzing and understanding the content of the video.

At least two video understanding tasks may be determined based on the actual application scenario, and corresponding at least two different tags may be determined based on the at least two video understanding tasks.

The tag type may be a type of a result output by the video understanding task, for example, when the video understanding task is a type of outputting video, the tag type is a type corresponding to the video; when the video understanding task is outputting a brief introduction of the video, then the tag type is a brief introduction of the video content.

For example, in an application scenario where video understanding is performed on a video-type video, a video tag of the video-type video needs to be output, where the video tag may include keywords related to the video-type video, where the keywords may include a type of video (e.g., an ancient drama, a modern drama, etc.), a subject matter of the video (e.g., a family, a comedy, a crossing, etc.), a name of a character appearing in the video, a name of an actor appearing in the video, etc.; the video tag may also be a video content profile of the video-like video.

In the application scenario, a video keyword may be extracted as a first video understanding task, a video profile may be extracted as a second video understanding task, a keyword of a video may be used as a tag type, and a video profile may be used as a tag type.

For each tag type, a corresponding initial video understanding model can be set, and the initial video understanding model is trained, so that the initial video understanding model is continuously learned in the training process, and the capability of extracting the tag type from the video is achieved.

In the embodiment of the disclosure, the initial keyframe extraction model is coupled with at least two different initial video understanding models, and in the training process, parameters of the initial keyframe extraction model can be adjusted according to loss functions respectively corresponding to the different initial video understanding models, so that more general sample keyframes are learned to be extracted, the keyframes extracted by the trained keyframe extraction model can be suitable for different video understanding tasks, and the universality of the keyframe extraction model is improved.

As an alternative embodiment, the method further comprises:

when the training times corresponding to the current training operation are detected to be in accordance with the preset times, increasing the preset number based on the frame extraction step length; the frame extraction step length is determined based on the duration of the video sample;

Specifically, after the evaluation value vector is obtained, a preset number of sample candidate key frames with the largest evaluation value in the plurality of sample candidate key frames may be used as at least one sample key frame according to the evaluation value vector, that is, the preset number is the number of extracted sample key frames.

When the preset number is too small, the sampling rate of the key frame extraction is too low, the number of the extracted key frames is too small, and partial information in the video sample can be possibly missed; when the preset number is too large, redundancy may exist between the extracted key frames, and the data volume of model processing becomes large, so that the model training efficiency is reduced.

Aiming at the problems, the execution times of the training operation can be recorded, when the training times corresponding to the current training operation are detected to reach the preset times, the preset number can be increased according to the frame extraction step length, and the increased preset number participates in the next training operation. For example, the sum of the frame-extracting step length and the preset number may be used as the increased preset number.

The frame extraction step length may be determined based on the duration of the video sample, and when the duration of the video sample is longer, a larger number of key frames are required to represent the information of the whole video sample, and a larger frame extraction step length may be set. For example, when the video duration is 10 minutes, the frame-extraction step size may be set to 1; when the video duration is 60 minutes, the frame-extraction step size can be set to 2. The frame extraction step length is adaptively determined through the time length of the video sample, and the preset number can be more accurately adjusted for videos with different time lengths.

The preset times may be set according to actual application requirements, for example, the preset times may be determined according to the number and type of at least one video understanding task, or the preset times may be determined according to software resources and/or hardware resources of model training, which is not limited in the embodiments of the present disclosure.

It should be noted that, during the first training operation, an initial preset number may be set as the preset number to perform the key frame extraction.

In the embodiment of the disclosure, when the number of training times is detected to reach the preset number of times, the preset number is increased, and although the data volume processed by the model is increased due to the increase of the preset number of times, the model is trained for the preset number of times, parameters in the model are relatively close to the trained model, and the influence of the processed data volume on the model training efficiency is limited. The number of the extracted sample key frames is adjusted in the training process, so that the extracted sample key frames can effectively represent the information of the whole video sample, and meanwhile, the influence on the model training efficiency is avoided.

As an alternative embodiment, acquiring a plurality of sample candidate key frames in a video sample includes:

or (b)

And extracting frames from the video samples at preset time intervals to obtain a plurality of sample candidate key frames.

Specifically, in order to pre-extract a key frame from a video sample, frame extraction may be performed based on pixel differences between adjacent video frames in the video sample, one video frame in the video sample is used as a current video frame, a pixel difference between the current video frame and a previous video frame is calculated, and if the pixel difference is greater than a preset threshold, the current video frame is used as a candidate key frame of a sample. And carrying out the judgment on the video frames in the video samples, thereby screening a plurality of sample candidate key frames.

And the video samples can be subjected to frame extraction based on preset time intervals, the frame extraction is performed on the video samples once every preset time interval, and a plurality of extracted video frames are used as candidate key frames of the samples.

Alternatively, the video samples may be decimated based on FFmpeg to obtain multiple candidate sample key frames.

For example, pixel differences between adjacent frames in video samples may be utilized for frame extraction. For each frame in the video sample, the pixel value of each frame is compared to the pixel value of the previous frame, and then the mean absolute difference (Mean Absolute Difference, MAD) is calculated, the larger the MAD, the larger the image difference between adjacent frames. When the MAD exceeds the set threshold, a new scene may be considered to have occurred and the frame extracted as a candidate sample key frame.

For video samples with most of scenes of the pictures kept unchanged, the frame extraction is performed only through scene differences, so that the frame extraction number is too small, and the time interval of frame extraction can be set, namely, a candidate sample key frame is extracted every unit time interval, so that the sampling rate of the frame extraction is ensured.

A specific frame-pumping command may be constructed based on a select filter (a command) and a scene function (a command) in FFmpeg, for example, the frame-pumping command may be set to pump frames if the scene difference is greater than 0.3 or at least one frame is pumped every 5 seconds.

FFmpeg is an open source audio and video processing tool library, which can convert, edit and process audio, video and image files in various formats. FFmpeg supports numerous audio, video and image formats including common MP3, AAC, h.264, MPEG-4, JPEG, PNG, etc. FFmpeg is widely used in various audio and video processing software, streaming media servers, media players, and the like due to its powerful functions and open source characteristics. FFmpeg is specifically adapted to each platform and each hardware, and can be used for extracting frames of videos in various formats with high performance, so that the speed is high.

FFmpeg mainly comprises the following three parts:

FFmpeg library: the FFmpeg library provides core functions of audio and video processing, including encoding and decoding, format conversion, filter processing and the like;

FFmpeg command line tool: the FFmpeg command line tool can directly call the FFmpeg library to convert, edit and process the audio and video files. Through the FFmpeg command line tool, a user can perform various complex audio and video operations in the terminal, such as format conversion, clipping, merging, watermarking and the like;

FFprobe: FFprobe is a tool for analyzing information of audio and video files, and can display detailed information such as metadata, codec information, bit rate and the like of the files.

Fig. 4 is a flow chart of a key frame extraction method according to an embodiment of the disclosure, as shown in fig. 4, where the method includes:

step S210, obtaining a video to be processed, and performing frame extraction on the video to be processed to obtain a plurality of candidate key frames;

step S220, determining evaluation value vectors corresponding to the candidate key frames based on the candidate key frames through the trained key frame extraction model, and determining at least one key frame from the candidate key frames based on the evaluation value vectors corresponding to the candidate key frames;

Specifically, at least one key frame extracted by the key frame extraction method provided by the embodiment of the present disclosure may be directly applied to scenes such as video auditing, video deduplication, and the like, or may be combined with a video understanding algorithm, and the at least one key frame is used as input data of the video understanding algorithm, which is not limited in the embodiment of the present disclosure.

In order to extract the key frames, the video to be processed can be obtained through image acquisition equipment such as a camera, a mobile phone, a camera or a tablet personal computer, or can be obtained through network collection on the premise of meeting relevant regulations, and the specific obtaining mode of the video to be processed is not limited.

After the video to be processed is obtained, the video to be processed can be subjected to frame extraction to obtain a plurality of candidate key frames, and the acquisition mode of the candidate key frames can be referred to the acquisition mode of the sample candidate key frames in the training process, which is not described herein.

After obtaining the plurality of candidate key frames, the plurality of candidate key frames can be input into a trained key frame extraction model, evaluation value vectors corresponding to the plurality of candidate key frames are determined based on the plurality of candidate key frames through the key frame extraction model, and a preset number of candidate key frames with the largest evaluation value are taken as at least one key frame from the plurality of candidate key frames based on the evaluation value vectors corresponding to the plurality of candidate key frames. The preset number can be specifically set according to different application requirements.

The specific processing procedure of the key frame extraction model can be referred to the description in the training procedure above, and will not be repeated here.

According to the key frame extraction method provided by the embodiment of the disclosure, in the training process of the key frame extraction model, key frames in a video sample are pre-extracted once, a plurality of extracted sample candidate key frames are scored, the plurality of candidate key frames are screened based on scores corresponding to the candidate key frames, and the sample candidate key frames with higher association degree with the video sample are used as sample key frames, so that the extracted sample key frames can better represent the characteristics of the video sample, the trained key frame extraction model is further ensured to accurately extract representative key frames from the video, and the accuracy of the extracted key frames is improved.

As an alternative embodiment, fig. 5 is a flowchart of a key frame extraction method according to an embodiment of the disclosure, as shown in fig. 5, where the method includes:

and acquiring a plurality of candidate key frames of the video to be processed through FFmpeg.

Inputting a plurality of candidate key frames into a feature extraction module of a key frame extraction model, and respectively extracting features of the candidate key frames through the feature extraction module to obtain a plurality of corresponding initial candidate key frame features; the feature extraction module may be constructed based on a CNN (Convolutional Neural Networks, convolutional neural network) model, or may be constructed based on a transducer model, for example ViT (Vision Transformer, a transducer model), swin-T (Swin-transducer, a transducer model), or the like.

After obtaining the plurality of initial candidate key frame features, determining a time coding vector based on time sequence information among the initial candidate key frames; and aiming at each initial candidate key frame feature, adding the candidate key frame feature and the time coding vector, inputting the added vector into a coding module, and carrying out attention transformation among the initial candidate key frame features by combining the time coding vector through the coding module to obtain a plurality of fused candidate key frame features. The encoding module may be an encoder module of a Transformer, and the candidate key frame features have the same size as the corresponding initial candidate key frame features.

After obtaining the plurality of candidate key frame features, the plurality of candidate key frame features may be input to a decoding module, and the obtained Query vector Query may be input to the decoder module. The decoding module can be a decoder module of a transducer or constructed based on 3D convolution; the Query vector Query may be a reference vector generated by the initial semantic extraction module; or a learnable ebedding of size N x d, N being set to be much larger than the number K of candidate key frames.

And regarding each candidate Key frame feature by a decoding module, taking the candidate Key frame feature as a Key vector Key and a Value vector Value respectively, calculating the dot product of the Query and the Key based on a self-attention mechanism, then carrying out softmax normalization on the result to obtain a weight coefficient, multiplying the weight coefficient by the Value and summing to obtain self-attention output, namely a result vector, wherein the size of the result vector is also N x d.

And inputting the result vector into the FFN, obtaining an N1 vector after the result vector passes through the FFN, cutting off the N1 vector according to the number K of the candidate key frames to obtain a K1 vector, and inputting the K1 vector into a sigmoid function to obtain an evaluation value vector.

And taking a preset number of candidate key frames with the largest evaluation value from the candidate key frames as a plurality of key frames based on the evaluation value vector.

As an alternative embodiment, after obtaining the trained keyframe extraction model, the method further includes:

taking the feature extraction module as a teacher model, and carrying out model distillation on the feature extraction model to obtain a student model corresponding to the feature extraction module;

and replacing the feature extraction module in the key frame extraction model with the student model corresponding to the feature extraction module.

Specifically, the feature extraction module in the key frame extraction model is often constructed by a model with relatively large calculation amount in order to ensure the effect during training. However, if such complex modules are still used in the actual application process, a great waste of computing resources is caused.

For the above problems, the feature extraction module may be lightweight by model distillation.

Among them, model distillation is a knowledge distillation technique for migrating knowledge of a large neural network (teacher model) to a smaller neural network (student model). The student model is trained to simulate the behavior of the teacher model, so that the calculation cost and the memory occupation are reduced while higher precision is maintained. The distillation process typically involves letting the student model learn the soft targets (probability distributions) and the original hard targets (true labels) of the teacher model so that the student model can capture the knowledge of the teacher model, improving generalization ability.

In the embodiment of the disclosure, a large number of service video frames can be utilized as distillation data, and the distillation target is the frame characteristics of the output of the original characteristic extraction module. The light characteristic extraction module can be obtained by transferring the characteristic extraction module, and then the original characteristic extraction module is replaced by the light characteristic module, so that the computing resource can be reduced, the computing cost is reduced, and the computing efficiency is improved.

The disclosed embodiments provide a model training apparatus, as shown in fig. 6, which may include:

an obtaining module 310, configured to obtain a plurality of sample candidate key frames in a video sample;

the training module 320 is configured to perform at least one training operation on the initial keyframe extraction model and at least one initial video understanding model based on the plurality of sample candidate keyframes until a preset training end condition is met, and take the initial keyframe extraction model that meets the preset training end condition as a trained keyframe extraction model;

wherein the training operation comprises:

As an alternative embodiment, the training module comprises:

The feature extraction sub-module is used for respectively extracting features of the plurality of sample candidate key frames to obtain a plurality of sample candidate frame features respectively corresponding to the plurality of sample candidate key frames;

an evaluation value vector determination submodule for determining the evaluation value vector based on correlation between the plurality of sample candidate frame features and a reference vector; the reference vector is used for representing semantic features of the video sample;

and the sample key frame extraction sub-module is used for taking a preset number of sample candidate key frames with the largest evaluation values in the sample candidate key frames as the at least one sample key frame based on the evaluation values respectively corresponding to the sample candidate key frames in the evaluation value vector.

As an alternative embodiment, the evaluation value vector determination submodule is specifically configured to:

As an alternative embodiment, the apparatus further comprises:

and the parameter updating module is used for adjusting the parameters of the initial semantic extraction module based on the second loss function, and taking the initial semantic extraction module after the parameters are adjusted as the initial semantic extraction module corresponding to the next training operation.

As an alternative embodiment, the feature extraction submodule is specifically configured to:

As an alternative embodiment, the apparatus further comprises an initial video understanding model acquisition module for:

acquiring at least two video understanding tasks;

As an alternative embodiment, the apparatus further comprises a preset number adjustment module for:

As an alternative embodiment, the obtaining module is specifically configured to:

or (b)

Embodiments of the present disclosure provide a key frame extraction apparatus, as shown in fig. 7, which may include:

a candidate key frame obtaining module 410, configured to obtain a video to be processed, and extract frames from the video to be processed to obtain a plurality of candidate key frames;

a key frame extraction module 420, configured to determine, based on the plurality of candidate key frames through a trained key frame extraction model, an evaluation value vector corresponding to the plurality of candidate key frames, and determine at least one key frame from the plurality of candidate key frames based on the evaluation value vector corresponding to the plurality of candidate key frames;

The device of the embodiment of the disclosure can execute the method provided by the embodiment of the disclosure, has similar implementation principle and has corresponding technical effects. Actions performed by each module in the apparatus of the embodiments of the present disclosure correspond to steps in the method of the embodiments of the present disclosure, and detailed functional descriptions of each module of the apparatus may be referred to in the corresponding method shown in the foregoing, which is not repeated herein.

In the presently disclosed embodiments, the term "module" or "unit" refers to a computer program or a portion of a computer program having a predetermined function and working with other related portions to achieve a predetermined objective, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

An embodiment of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the method provided in any of the alternative embodiments of the present disclosure. Compared with the prior art, can realize: by acquiring a plurality of sample candidate key frames in a video sample, determining an evaluation value vector based on the plurality of sample candidate key frames by an initial key frame extraction model, and determining at least one key frame from the plurality of sample candidate key frames based on the evaluation value vector. The key frames in the video samples are pre-extracted once, the extracted candidate key frames of the samples are scored, the candidate key frames are screened based on the scores corresponding to the candidate key frames, and the candidate key frames with higher association degree with the video samples are used as the sample key frames, so that the extracted sample key frames can better represent the characteristics of the video samples, further, the trained key frame extraction model can accurately extract representative key frames from the video, and the accuracy of the extracted key frames is improved.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 8, the electronic device 4000 shown in fig. 8 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiments of the present disclosure.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program that executes an embodiment of the present disclosure, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), mobile terminals of wearable devices and the like, and stationary terminals such as digital TVs, desktop computers and the like.

The disclosed embodiments provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

The disclosed embodiments also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

It should be understood that, although various operational steps are indicated by arrows in the flowcharts of the disclosed embodiments, the order in which these steps are performed is not limited to the order indicated by the arrows. In some implementations of embodiments of the present disclosure, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the scenario that the execution time is different, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, and the embodiment of the disclosure is not limited to this.

The foregoing is merely an optional implementation manner of some implementation scenarios of the disclosure, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the disclosure may be adopted without departing from the technical ideas of the scheme of the disclosure, which also belongs to the protection scope of the embodiments of the disclosure.

Claims

1. A method of model training, comprising:

acquiring a plurality of sample candidate key frames in a video sample;

wherein the training operation comprises:

2. The model training method of claim 1, wherein the inputting the plurality of sample candidate key frames into the initial key frame extraction model determines an evaluation value vector for each sample candidate key frame, comprising:

3. The model training method of claim 2, wherein the determining the evaluation value vector based on correlation between the plurality of sample candidate frame features and a reference vector comprises:

4. A model training method as claimed in claim 3, further comprising:

5. The method of claim 2, wherein the feature extraction is performed on the plurality of sample candidate key frames to obtain a plurality of sample candidate frame features corresponding to the plurality of sample candidate key frames, respectively, includes:

6. The model training method of claim 1, wherein the method further comprises:

Acquiring at least two video understanding tasks;

7. The model training method of claim 2, wherein the method further comprises:

8. The model training method of claim 1, wherein the obtaining a plurality of sample candidate key frames in a video sample comprises:

or (b)

9. A key frame extraction method, comprising:

wherein the keyframe extraction model is trained based on the model training method of any one of claims 1-8 of the present disclosure.

10. A model training device, comprising:

wherein the training operation comprises:

11. A key frame extraction device, comprising:

12. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-9.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-9.

14. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-9.