CN115984742A - Training method of video frame selection model, video processing method and device - Google Patents
Training method of video frame selection model, video processing method and device Download PDFInfo
- Publication number
- CN115984742A CN115984742A CN202211696258.0A CN202211696258A CN115984742A CN 115984742 A CN115984742 A CN 115984742A CN 202211696258 A CN202211696258 A CN 202211696258A CN 115984742 A CN115984742 A CN 115984742A
- Authority
- CN
- China
- Prior art keywords
- video
- frame selection
- model
- video frame
- image feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 106
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000003672 processing method Methods 0.000 title claims abstract description 16
- 238000000605 extraction Methods 0.000 claims abstract description 115
- 238000012545 processing Methods 0.000 claims abstract description 70
- 238000004821 distillation Methods 0.000 claims description 43
- 238000004590 computer program Methods 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims 1
- 238000012216 screening Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011143 downstream manufacturing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The disclosure provides a training method of a video frame selection model, a video processing method and a video processing device, wherein the training method comprises the following steps: acquiring a sample video; determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video; determining first image features corresponding to the candidate video frames and second image features of the selected video frame based on an image feature extraction model; and training the video frame selection model to be trained based on the first image characteristics, the second image characteristics and the target frame selection result.
Description
Technical Field
The disclosure relates to the technical field of image processing, in particular to a training method of a video frame selection model, a video processing method and a video processing device.
Background
The information content contained in the video content is increased along with the increase of the video time length, and correspondingly, the processing difficulty and the required computing resource of the video are increased along with the increase of the video time length, so that the method is very important for reasonably screening out the video frames when the video is processed.
In the related art, video frames are generally sampled at fixed intervals for a video, however, in this way, the number of sampled video frames tends to increase with the increase of the video duration, and because the amount of information contained in different video frames of a video is different, this way may sample more video frames containing less information, and the precision of video processing is low.
Disclosure of Invention
The embodiment of the disclosure at least provides a training method of a video frame selection model, a video processing method and a video processing device.
In a first aspect, an embodiment of the present disclosure provides a training method for a video frame selection model, including:
acquiring a sample video;
determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video;
determining first image features corresponding to the candidate video frames and second image features of the selected video frame based on an image feature extraction model;
and training the video frame selection model to be trained based on the first image characteristics, the second image characteristics and the target frame selection result.
In a possible implementation manner, the image feature extraction model is a model to be trained;
the training of the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result includes:
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the first image feature, the second image feature and the target frame selection result.
In a possible embodiment, the training the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result includes:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature and the second image feature; determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result;
and training the video frame selection model to be trained based on the distillation loss and the frame selection loss.
In a possible implementation, the target frame selection result includes a probability that each candidate video frame is selected as the selected video frame;
the determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result includes:
and determining the frame selection loss corresponding to the image feature extraction model based on the probability of each candidate video frame being selected as the selected video frame and the sum of the probabilities corresponding to the selected video frames.
In a possible implementation, the training the video frame selection model to be trained and the image feature extraction model to be trained based on the first image feature, the second image feature, and the target frame selection result includes:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature, the second image feature; determining frame selection loss corresponding to the video frame selection model based on the target frame selection result; determining a feature extraction loss corresponding to the feature extraction model based on the surveillance data corresponding to the sample video and the first image feature;
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the distillation loss, the frame selection loss and the feature extraction loss.
In a possible implementation manner, the determining, based on the video frame selection model to be trained, a target frame selection result corresponding to the sample video includes:
respectively inputting a plurality of candidate video frames contained in the sample video into the video frame selection model to be trained, and determining an initial frame selection result corresponding to the sample video;
and under the condition that the frame number of the selected video frame included in the initial frame selection result meets a preset condition, determining the initial frame selection result as the target frame selection result.
In a possible embodiment, the method further comprises:
constructing a video processing model comprising the trained video frame selection model and the trained image feature extraction model;
carrying out fine adjustment processing on the video processing model based on the sample video;
after the video to be processed is obtained, processing the video to be processed based on the video processing model after the fine tuning processing, and determining a processing result corresponding to the video to be processed.
In a second aspect, an embodiment of the present disclosure further provides a video processing method, including:
acquiring a video to be processed;
determining a target video frame contained in the video to be processed based on a video frame selection model obtained by training the training method of the video frame selection model according to the first aspect or any one of the possible embodiments of the first aspect;
extracting target image features of the target video frame based on an image feature extraction model;
and determining a processing result corresponding to the video to be processed based on the target image characteristics.
In a third aspect, an embodiment of the present disclosure further provides a training apparatus for a video frame selection model, including:
the first acquisition module is used for acquiring a sample video;
the frame selection module is used for determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video;
the feature extraction module is used for determining first image features corresponding to the candidate video frames and second image features of the selected video frames based on an image feature extraction model;
and the training module is used for training the video frame selection model to be trained on the basis of the first image characteristics, the second image characteristics and the target frame selection result.
In a possible implementation manner, the image feature extraction model is a model to be trained;
the training module, when training the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result, is configured to:
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the first image feature, the second image feature and the target frame selection result.
In a possible embodiment, the training module, when training the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result, is configured to:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature and the second image feature; determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result;
and training the video frame selection model to be trained on the basis of the distillation loss and the frame selection loss.
In a possible implementation, the target frame selection result includes a probability that each candidate video frame is selected as the selected video frame;
the training module, when determining that the frame selection corresponding to the video frame selection model is lost based on the target frame selection result, is configured to:
and determining the frame selection loss corresponding to the image feature extraction model based on the probability of each candidate video frame being selected as the selected video frame and the sum of the probabilities corresponding to the selected video frames.
In a possible implementation manner, the training module, when training the video frame selection model to be trained and the image feature extraction model to be trained based on the first image feature, the second image feature and the target frame selection result, is configured to:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature, the second image feature; determining frame selection loss corresponding to the video frame selection model based on the target frame selection result; determining a feature extraction loss corresponding to the feature extraction model based on the surveillance data corresponding to the sample video and the first image feature;
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the distillation loss, the frame selection loss and the feature extraction loss.
In a possible implementation manner, when determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained, the frame selection module is configured to:
respectively inputting a plurality of candidate video frames contained in the sample video into the video frame selection model to be trained, and determining an initial frame selection result corresponding to the sample video;
and under the condition that the frame number of the selected video frame included in the initial frame selection result meets a preset condition, determining the initial frame selection result as the target frame selection result.
In a possible implementation, the apparatus further comprises an inference module configured to:
constructing a video processing model comprising the trained video frame selection model and the trained image feature extraction model;
carrying out fine adjustment processing on the video processing model based on the sample video;
after the video to be processed is obtained, processing the video to be processed based on the video processing model after the fine tuning processing, and determining a processing result corresponding to the video to be processed.
In a fourth aspect, an embodiment of the present disclosure further provides a video processing apparatus, including:
the second acquisition module is used for acquiring a video to be processed;
a first determining module, configured to determine a target video frame included in the video to be processed, based on a video frame selection model obtained by training a training method of a video frame selection model according to the first aspect or any one of possible embodiments of the first aspect;
the extraction module is used for extracting the target image characteristics of the target video frame based on an image characteristic extraction model;
and the second determining module is used for determining a processing result corresponding to the video to be processed based on the target image characteristic.
In a fifth aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any one of the possible implementations of the first aspect, or the second aspect described above.
In a sixth aspect, this disclosure also provides a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in the first aspect, or any one of the possible implementation manners of the first aspect, or to perform the steps of the second aspect.
In the training method and the video processing method and the device for the video frame selection model provided by the embodiment of the disclosure, candidate video frames contained in a sample video can be screened through the video frame selection model to obtain a selected video frame, image features of the selected video frame and each candidate video frame are extracted through the image feature extraction model, and then the video frame selection model is trained based on the first image feature, the second image feature and a target frame selection result.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
Fig. 1 shows a flowchart of a training method of a video frame selection model according to an embodiment of the present disclosure;
fig. 2 shows a flow chart of a video processing method provided by an embodiment of the present disclosure;
fig. 3 is a schematic overall flow chart illustrating a training method of a video frame selection model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating an architecture of a training apparatus for a video frame selection model according to an embodiment of the present disclosure;
fig. 5 is a schematic diagram illustrating an architecture of a video processing apparatus provided in an embodiment of the present disclosure;
fig. 6 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making any creative effort, shall fall within the protection scope of the disclosure.
In the related art, when video frames are sampled, the sampling is generally performed at fixed intervals, however, in this way, the number of the sampled video frames tends to increase with the increase of the video duration, and since the amount of information contained in different video frames of a video is different, the sampling in this way may result in more video frames containing less information.
Alternatively, a sampling method based on optical flow information assistance may be adopted in the related art, however, in this method, the optical flow information difference between every two adjacent video frames needs to be calculated, and when there are many video frames, the calculation amount is large.
Based on the research, the present disclosure provides a training method, a video processing method, and a device for a video frame selection model, in which candidate video frames included in a sample video may be screened through the video frame selection model to obtain a selected video frame, image features of the selected video frame and each candidate video frame may be extracted through an image feature extraction model, and then the video frame selection model may be trained based on a first image feature, a second image feature, and a target frame selection result.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, and C, and may mean including any one or more elements selected from the group consisting of a, B, and C.
To facilitate understanding of the present embodiment, first, a detailed description is given to a training method for a video frame selection model disclosed in an embodiment of the present disclosure, where an execution subject of the training method for a video frame selection model provided in an embodiment of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the training method of the video frame selection model may be implemented by a processor calling computer readable instructions stored in a memory.
Referring to fig. 1, a flowchart of a training method for a video frame selection model provided in an embodiment of the present disclosure is shown, where the method includes steps 101 to 104, where:
102, determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video.
And 104, training the video frame selection model to be trained based on the first image characteristics, the second image characteristics and the target frame selection result.
The following is a detailed description of the above steps:
for step 101,
The sample video may contain a plurality of incompletely identical videos, and the video lengths of the sample videos may be the same, for example, the videos may each be 10 minutes long; alternatively, the sample video pictures may have the same number of transmission Frames Per Second (FPS).
In a possible implementation, the sample video may be derived from a plurality of long videos, for example, when the sample video is obtained, for any long video, the plurality of sample videos may be cut from the long video according to a fixed video length, with different video time instants of the long video as starting points, respectively. Here, the fixed video length is a video length of the sample video.
With respect to step 102,
The video frame selection model is a model for screening video frames from the sample video, and optionally, the video frame selection model may be a multi-layer transform structure, for example, a 2-layer transform structure.
The sample video itself is composed of a plurality of video frames, and the video features of the sample video are composed of the features of the video frames in the sample video, so that theoretically, the more the sample video contains the more video frames, the richer the features contained in the sample video are, and the larger the information amount is, however, the more the sample video contains the video frames, the less the computing resources required for processing the video frames contained in the sample video are. Therefore, when a video frame is screened from a sample video frame, the computational resources and the amount of information contained in the video frame need to be weighed simultaneously.
In step 102, a target frame selection result corresponding to the sample video is the minimum video frame which can most embody the video characteristics of the sample video and is screened from the sample video.
In a possible implementation manner, the candidate video frames included in the sample video refer to all video frames included in the sample video; when determining the target frame selection result corresponding to the sample video based on the video frame selection model to be trained, the method may refer to inputting all video frames in the sample video into the video frame selection model, and determining the target frame selection result corresponding to the sample video.
However, since the sample video may include many video frames, if all the video frames are directly input into the video frame selection model, the video frame selection model has a large amount of data to be processed and a low calculation speed, and the sample video may include many repeated video frames.
For example, in the preliminary screening of the sample video, the screening may be performed at a fixed time interval, or may be performed at a fixed FPS. For example, if the video frames are filtered according to FPS =1, 1 frame of video frames can be filtered every second, and specifically, the video frames can be filtered every second with the highest image quality (for example, the clearest image quality).
Here, it should be noted that, in the preliminary screening, since the main purpose is to screen out most of the repeated video frames, the candidate video frames after the screening still retain the video features of the sample video. The candidate video frame selection model can further screen candidate video frames, so that the number of the candidate video frames included in the primary screening is possibly more; the number of frames of the candidate video frames that remain after the preliminary screening is performed for different sample videos may be the same.
In practical applications, the number of selected video frames screened by the video frame selection model may be large due to a problem of model accuracy of the video frame selection model, and in this case, although most features of the sample video can be retained, the large number of selected video frames still occupies a large amount of computing resources.
Therefore, in a possible implementation manner, when determining the target frame selection result corresponding to the sample video based on the video frame selection model to be trained, a plurality of candidate video frames included in the sample video may be input into the video frame selection model to be trained, respectively, to determine an initial frame selection result corresponding to the sample video, and then, when the number of frames of the selected video frames included in the initial frame selection result satisfies a preset condition, the initial frame selection result is determined as the target frame selection result.
For example, the preset condition may be that the frame number of the selected video frame is less than the frame number of the candidate video frame, and/or that a ratio between the frame number of the selected video frame and the frame number of the candidate video frame is less than a preset ratio.
The frame number of the selected video frame contained in the target frame selection result is limited, so that the screening strength of the video frame selection model can be controlled, and the precision of the video frame selection model in frame selection is indirectly improved.
For steps 103 and 104,
In a possible implementation manner, when determining first image features corresponding to the candidate video frames based on the image feature extraction model, the candidate video frames may be input into the image feature extraction model, and the image feature extraction model may extract image features of the candidate video frames respectively first, and then fuse the image features of the candidate video frames to obtain the first image features.
Correspondingly, when the second image feature of the selected video frame is determined based on the image feature extraction model, the selected video frame may be input into the image feature extraction model, and the image feature extraction model may extract the image features of each selected video frame, and then fuse the image features of each selected video frame to obtain the first image feature.
In a possible implementation manner, the image feature extraction model may be a pre-trained model, and when the video frame selection model to be trained is trained based on the first image feature, the second image feature and the target frame selection result, the video frame selection model to be trained may be trained by using the first image feature as supervision data.
Illustratively, the model parameter values of the video frame selection model can be adjusted in a gradient back propagation manner.
In another possible implementation, the image feature extraction model may be a model to be trained, and the image feature extraction model and the video frame selection model may be trained synchronously. Therefore, when the video frame selection model to be trained is trained based on the first image feature, the second image feature and the target frame selection result, the video frame selection model to be trained and the image feature extraction model to be trained may be trained based on the first image feature, the second image feature and the target frame selection result.
In the following, specific training methods will be described in the two embodiments.
In case 1, the image feature extraction model is trained in advance, and only the video frame selection model is trained.
When the video frame selection model to be trained is trained based on the first image feature, the second image feature and the target frame selection result, an exemplary method may specifically include the following steps:
step a1, determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature and the second image feature; and determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result.
And a2, training the video frame selection model to be trained based on the distillation loss and the frame selection loss.
Specifically, in step a1, the distillation loss is used to characterize the feature difference between the first image feature and the second image feature, theoretically, the number of frames of the selected video frame is less than that of the candidate video frame, the second image feature of the selected video frame must be different from the first image feature of the candidate video frame, and the training goal of the image feature extraction model should be to reduce the difference between the first image feature and the second image feature.
Illustratively, the distillation loss may be an euclidean distance, which may be calculated, for example, by the following equation:
L 1 =λ∑||F n -F ns || 2
wherein L is 1 Denotes the distillation loss, lambda denotes the hyperparameter, F n Representing a first image feature, F ns Representing a second image feature.
The video frame selection model is used for screening selected video frames from candidate video frames of the sample video, and during training, the supervision data corresponding to the sample video cannot be determined, so that the training process of the video frame selection model is an unsupervised training process.
Illustratively, the video frame selection model can perform frame selection in a mode of Gumbel-softmax activation by Gumbel-Softmax.
Specifically, the target frame selection result may include a probability that each candidate video frame is selected as the selected video frame, and the selected video frame is a candidate video frame whose corresponding probability is greater than a probability threshold.
When determining that the frame selection loss corresponding to the video frame selection model is determined based on the target frame selection result, determining the frame selection loss corresponding to the image feature extraction model based on the sum of the probability of each candidate video frame being selected as the selected video frame and the probability corresponding to the selected video frame.
Illustratively, the frame selection loss may be calculated by the following formula:
L 2 =∑ i softmax(log(Y i )+g i )+||n-P ns || 2 ,g i ∈Gumbel(0,1)
wherein, Y i Representing the probability that the ith candidate video frame is selected as the selected video frame, P ns Representing the sum of the probabilities corresponding to all selected video frames, g i Representing the dynamic parameters of the gunbell distribution subject to 0-1, and n representing the number of selected video frames.
In step a2, when the video frame selection model to be trained is trained based on the distillation loss and the frame selection loss, for example, the distillation loss and the frame selection loss may be weighted and summed to obtain a total loss value of the current training, and the video frame selection model to be trained is trained based on the total loss value.
It should be noted that, when the video frame selection model to be trained is trained, in addition to the distillation loss, the frame selection loss is also increased, so as to strengthen the frame selection capability of the video frame selection model itself, and further improve the model accuracy of the video frame selection model.
And 2, the image feature extraction model and the video frame selection model are all models to be trained.
When the video frame selection model to be trained and the image feature extraction model to be trained are trained based on the first image feature, the second image feature and the target frame selection result, in one possible implementation, the distillation loss and the frame selection loss can be calculated through the steps, then a total loss value is calculated based on the distillation loss and the distillation loss, and the model parameter values of the image feature extraction model and the video frame selection model are adjusted based on the total loss value; or adjusting the model parameter value of the image feature extraction model based on the distillation loss, and adjusting the model parameter value of the video frame selection model based on the total loss value.
However, in order to further improve the model accuracy of the image feature extraction model, a loss term of the image feature extraction model itself may also be introduced.
Illustratively, when training the video frame selection model to be trained and the image feature extraction model to be trained, the following steps may be performed:
step b1, determining distillation loss between the candidate video frame and the selected video frame based on the first image characteristic and the second image characteristic; determining frame selection loss corresponding to the video frame selection model based on the target frame selection result; and determining the feature extraction loss corresponding to the feature extraction model based on the surveillance data corresponding to the sample video and the first image feature.
And b2, training the video frame selection model to be trained and the image feature extraction model to be trained based on the distillation loss, the frame selection loss and the feature extraction loss.
The method for calculating the distillation loss and the frame selection loss in step b1 is the same as the calculation method in case 1, and will not be described herein again.
The surveillance data corresponding to the sample video is related to a downstream task of the feature extraction model, for example, if the downstream task of the feature extraction model is image retrieval, the surveillance data corresponding to the sample video may be an artificially labeled image retrieval result; if the downstream task of the feature extraction model is image classification, the supervision data corresponding to the sample video can be an artificially labeled image classification result.
In calculating the feature extraction loss, a processing result of a task downstream of the feature extraction model may be determined (for example, may be determined by a downstream processing unit) based on the first image feature, and then a feature extraction loss corresponding to the image feature extraction model may be determined based on the processing result and the supervision data.
In step b2, when the video frame selection model to be trained and the image feature extraction model to be trained are trained based on the distillation loss, the frame selection loss and the feature extraction loss, illustratively, the distillation loss, the frame selection loss and the feature extraction loss may be subjected to weighted summation to determine a total loss value of the current training, and then model parameter values of the feature extraction model and the video frame selection model are adjusted based on the total loss value.
In addition, considering that the frame selection loss mainly aims at a video frame selection model, and the feature extraction loss mainly aims at the image feature extraction model, when the video frame selection model to be trained and the image feature extraction model to be trained are trained based on the distillation loss, the frame selection loss and the feature extraction loss, the distillation loss and the frame selection loss can be weighted and summed to obtain a first loss value, and a model parameter value of the video frame selection model is adjusted based on the first loss value; and carrying out weighted summation on the distillation loss and the feature extraction loss to obtain a second loss value, and adjusting a model parameter value of the image feature extraction model based on the second loss value.
It should be noted that the purpose of extracting the first image feature corresponding to the candidate video frame is only to compare with the second image feature of the selected video frame, so as to perform distillation training on the video frame selection model, and after the training of the video frame selection model is completed, the image feature extraction model may be only used to extract the feature of the video frame selected by the video frame selection model.
The video frame selection model and the image feature extraction model may be used separately or jointly after training is completed, which is not limited in this disclosure.
Optionally, after the training of the video frame selection model is completed, a video processing model including the trained video frame selection model and the trained image feature extraction model may also be constructed; and then, carrying out fine adjustment processing on the video processing model based on the sample video, and after the video to be processed is obtained, processing the video to be processed based on the video processing model after the fine adjustment processing to determine a processing result corresponding to the video to be processed.
Illustratively, the video processing model may be a model associated with a downstream task, which may include, for example and without limitation, video quality detection, video classification, video segmentation, video retrieval, and the like.
Based on the same concept, the present disclosure further provides a video processing method, as shown in fig. 2, which is a flowchart of the video processing method provided by the present disclosure, and the method includes the following steps:
And 203, extracting the target image characteristics of the target video frame based on the image characteristic extraction model.
And 204, determining a processing result corresponding to the video to be processed based on the target image characteristics.
Here, the image feature extraction model in step 203 may be a model trained along with the video frame selection model, or may be a model trained in advance; the processing result may include, for example, but not limited to, a video quality detection result, a video classification result, a video segmentation result, a video retrieval result, and the like.
In summary, the overall process of the training method for the video frame selection model may be as shown in fig. 3, where a solid line part in fig. 3 indicates a model inference process, and a dotted line part indicates a model training part, specifically:
in the model training process, candidate video frames of a sample video can be input into a video frame selection model to obtain selected video frames, then the selected video frames and the candidate video frames are input into an image feature extraction model to obtain first image features and second image features, distillation loss is calculated based on the first image features and the second image features, meanwhile, frame selection loss is calculated based on a target frame selection result, and then model training is carried out based on the distillation loss and the frame selection loss;
in the model reasoning process, candidate video frames of the video to be processed can be input into the video frame selection model to obtain selected video frames, and then only the selected video frames are input into the image feature extraction model to output video features.
For the detailed description of the above steps, reference is made to the above embodiments, and details are not repeated in the present disclosure.
According to the training method of the video frame selection model provided by the embodiment of the disclosure, candidate video frames contained in a sample video can be screened through the video frame selection model to obtain a selected video frame, the image characteristics of the selected video frame and each candidate video frame are extracted through the image characteristic extraction model, and then the video frame selection model is trained based on the first image characteristics, the second image characteristics and the target frame selection result.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same inventive concept, the embodiment of the present disclosure further provides a training apparatus for a video frame selection model corresponding to the training method for the video frame selection model, and as the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the training method for the video frame selection model in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are omitted.
Referring to fig. 4, a schematic diagram of an architecture of a training apparatus for a video frame selection model according to an embodiment of the present disclosure is shown, where the apparatus includes: a first obtaining module 401, a frame selecting module 402, a feature extracting module 403, a training module 404 and an inference module 405; wherein,
a first obtaining module 401, configured to obtain a sample video;
a frame selection module 402, configured to determine a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video;
a feature extraction module 403, configured to determine, based on an image feature extraction model, first image features corresponding to the multiple candidate video frames and second image features of the selected video frame;
a training module 404, configured to train the video frame selection model to be trained based on the first image feature, the second image feature, and the target frame selection result.
In a possible implementation manner, the image feature extraction model is a model to be trained;
the training module 404, when training the video frame selection model to be trained based on the first image feature, the second image feature, and the target frame selection result, is configured to:
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the first image feature, the second image feature and the target frame selection result.
In a possible implementation, the training module 404, when training the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result, is configured to:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature and the second image feature; determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result;
and training the video frame selection model to be trained on the basis of the distillation loss and the frame selection loss.
In a possible implementation, the target frame selection result includes a probability that each candidate video frame is selected as the selected video frame;
the training module 404, when determining that the frame selection corresponding to the video frame selection model is lost based on the target frame selection result, is configured to:
and determining the frame selection loss corresponding to the image feature extraction model based on the probability of each candidate video frame being selected as the selected video frame and the sum of the probabilities corresponding to the selected video frames.
In a possible implementation manner, the training module 404, when training the video frame selection model to be trained and the image feature extraction model to be trained based on the first image feature, the second image feature and the target frame selection result, is configured to:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature, the second image feature; determining frame selection loss corresponding to the video frame selection model based on the target frame selection result; determining a feature extraction loss corresponding to the feature extraction model based on the surveillance data corresponding to the sample video and the first image feature;
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the distillation loss, the frame selection loss and the feature extraction loss.
In a possible implementation manner, the frame selection module 402, when determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained, is configured to:
respectively inputting a plurality of candidate video frames contained in the sample video into the video frame selection model to be trained, and determining an initial frame selection result corresponding to the sample video;
and under the condition that the frame number of the selected video frame included in the initial frame selection result meets a preset condition, determining the initial frame selection result as the target frame selection result.
In a possible implementation, the apparatus further comprises an inference module 405 for:
constructing a video processing model comprising the trained video frame selection model and the trained image feature extraction model;
carrying out fine adjustment processing on the video processing model based on the sample video;
after the video to be processed is obtained, processing the video to be processed based on the video processing model after the fine tuning processing, and determining a processing result corresponding to the video to be processed.
The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.
Based on the same inventive concept, a video processing apparatus corresponding to the video processing method is also provided in the embodiments of the present disclosure, and since the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to the video processing method described above in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not described again.
Referring to fig. 5, there is shown a schematic architecture diagram of a video processing apparatus according to an embodiment of the present disclosure, the apparatus includes: a second obtaining module 501, a first determining module 502, an extracting module 503 and a second determining module 504; wherein,
a second obtaining module 501, configured to obtain a video to be processed;
a first determining module 502, configured to determine a target video frame included in the video to be processed, based on a video frame selection model obtained by training a training method of a video frame selection model according to the first aspect or any one of possible embodiments of the first aspect;
an extracting module 503, configured to extract a target image feature of the target video frame based on an image feature extraction model;
a second determining module 504, configured to determine, based on the target image feature, a processing result corresponding to the video to be processed.
Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 6, a schematic structural diagram of a computer device 600 provided in the embodiment of the present disclosure includes a processor 601, a memory 602, and a bus 603. The memory 602 is used for storing execution instructions and includes a memory 6021 and an external memory 6022; the memory 6021 is also referred to as an internal memory, and is used for temporarily storing the operation data in the processor 601 and the data exchanged with the external memory 6022 such as a hard disk, the processor 601 exchanges data with the external memory 6022 through the memory 6021, and when the computer device 600 operates, the processor 601 communicates with the memory 602 through the bus 603, so that the processor 601 executes the following instructions:
acquiring a sample video;
determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video;
determining first image features corresponding to the candidate video frames and second image features of the selected video frame based on an image feature extraction model;
and training the video frame selection model to be trained on the basis of the first image characteristics, the second image characteristics and the target frame selection result.
In a possible implementation manner, in the instructions executed by the processor 601, the image feature extraction model is a model to be trained;
the training of the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result includes:
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the first image feature, the second image feature and the target frame selection result.
In a possible implementation, the instructions executed by the processor 601, which train the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result, include:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature and the second image feature; determining frame selection loss corresponding to the video frame selection model based on the target frame selection result;
and training the video frame selection model to be trained on the basis of the distillation loss and the frame selection loss.
In a possible embodiment, the processor 601 executes instructions, wherein the target frame selection result includes a probability that each candidate video frame is selected as the selected video frame;
the determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result includes:
and determining the frame selection loss corresponding to the image feature extraction model based on the probability of each candidate video frame being selected as the selected video frame and the sum of the probabilities corresponding to the selected video frames.
In one possible embodiment, in the instructions executed by the processor 601, the training the video frame selection model to be trained and the image feature extraction model to be trained based on the first image feature, the second image feature and the target frame selection result includes:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature, the second image feature; determining frame selection loss corresponding to the video frame selection model based on the target frame selection result; determining a feature extraction loss corresponding to the feature extraction model based on the surveillance data corresponding to the sample video and the first image feature;
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the distillation loss, the frame selection loss and the feature extraction loss.
In a possible implementation manner, in the instructions executed by the processor 601, the determining, based on the video frame selection model to be trained, a target frame selection result corresponding to the sample video includes:
respectively inputting a plurality of candidate video frames contained in the sample video into the video frame selection model to be trained, and determining an initial frame selection result corresponding to the sample video;
and under the condition that the frame number of the selected video frame included in the initial frame selection result meets a preset condition, determining the initial frame selection result as the target frame selection result.
In a possible implementation manner, the instructions executed by the processor 601 further include:
constructing a video processing model comprising the trained video frame selection model and the trained image feature extraction model;
carrying out fine adjustment processing on the video processing model based on the sample video;
after the video to be processed is obtained, processing the video to be processed based on the video processing model after the fine tuning processing, and determining a processing result corresponding to the video to be processed.
Alternatively, the processor 601 may be configured to execute the following instructions:
acquiring a video to be processed;
determining a target video frame contained in the video to be processed based on the video frame selection model obtained by training the training method of the video frame selection model in the embodiment;
extracting target image characteristics of the target video frame based on an image characteristic extraction model;
and determining a processing result corresponding to the video to be processed based on the target image characteristics.
The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the training method or the video processing method of the video frame selection model in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the video frame selection model training method or the video processing method in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer-readable storage medium executable by a processor. Based on such understanding, the technical solutions of the present disclosure, which are essential or part of the technical solutions contributing to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes and substitutions do not depart from the spirit and scope of the embodiments disclosed herein, and they should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.
Claims (12)
1. A training method of a video frame selection model is characterized by comprising the following steps:
acquiring a sample video;
determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video;
determining first image features corresponding to the candidate video frames and second image features of the selected video frame based on an image feature extraction model;
and training the video frame selection model to be trained based on the first image characteristics, the second image characteristics and the target frame selection result.
2. The method of claim 1, wherein the image feature extraction model is a model to be trained;
the training of the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result includes:
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the first image feature, the second image feature and the target frame selection result.
3. The method of claim 1, wherein the training the video frame selection model to be trained based on the first image feature, the second image feature and the target frame selection result comprises:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature and the second image feature; determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result;
and training the video frame selection model to be trained based on the distillation loss and the frame selection loss.
4. The method of claim 3, wherein the target frame selection result comprises a probability that each candidate video frame is selected as the selected video frame;
the determining the frame selection loss corresponding to the video frame selection model based on the target frame selection result includes:
and determining the frame selection loss corresponding to the image feature extraction model based on the probability of each candidate video frame being selected as the selected video frame and the sum of the probabilities corresponding to the selected video frames.
5. The method of claim 2, wherein the training the video frame selection model to be trained and the image feature extraction model to be trained based on the first image feature, the second image feature, and the target frame selection result comprises:
determining a distillation loss between the candidate video frame and the selected video frame based on the first image feature, the second image feature; determining frame selection loss corresponding to the video frame selection model based on the target frame selection result; determining a feature extraction loss corresponding to the feature extraction model based on the surveillance data corresponding to the sample video and the first image feature;
and training the video frame selection model to be trained and the image feature extraction model to be trained on the basis of the distillation loss, the frame selection loss and the feature extraction loss.
6. The method according to claim 1, wherein the determining a target frame selection result corresponding to the sample video based on the video frame selection model to be trained comprises:
respectively inputting a plurality of candidate video frames contained in the sample video into the video frame selection model to be trained, and determining an initial frame selection result corresponding to the sample video;
and under the condition that the frame number of the selected video frame included in the initial frame selection result meets a preset condition, determining the initial frame selection result as the target frame selection result.
7. The method of claim 1, further comprising:
constructing a video processing model comprising the trained video frame selection model and the trained image feature extraction model;
carrying out fine adjustment processing on the video processing model based on the sample video;
after the video to be processed is obtained, processing the video to be processed based on the video processing model after fine tuning processing, and determining a processing result corresponding to the video to be processed.
8. A video processing method, comprising:
acquiring a video to be processed;
determining a target video frame contained in the video to be processed based on a video frame selection model obtained by training the video frame selection model according to any one of claims 1 to 7;
extracting target image characteristics of the target video frame based on an image characteristic extraction model;
and determining a processing result corresponding to the video to be processed based on the target image characteristics.
9. A training device for a video frame selection model is characterized by comprising:
the first acquisition module is used for acquiring a sample video;
the frame selection module is used for determining a target frame selection result corresponding to the sample video based on a video frame selection model to be trained; the target frame selection result comprises a selected video frame used for characterizing the characteristics of the sample video in a plurality of candidate video frames contained in the sample video;
the feature extraction module is used for determining first image features corresponding to the candidate video frames and second image features of the selected video frames based on an image feature extraction model;
and the training module is used for training the video frame selection model to be trained on the basis of the first image characteristics, the second image characteristics and the target frame selection result.
10. A video processing apparatus, comprising:
the second acquisition module is used for acquiring a video to be processed;
a first determining module, configured to determine a target video frame included in the video to be processed based on the video frame selection model obtained by training the video frame selection model according to any one of claims 1 to 7;
the extraction module is used for extracting the target image characteristics of the target video frame based on an image characteristic extraction model;
and the second determining module is used for determining a processing result corresponding to the video to be processed based on the target image characteristics.
11. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is running, the machine-readable instructions, when executed by the processor, performing the steps of the method of training a video framing model according to any of claims 1 to 7, or performing the steps of the method of video processing according to claim 8.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, performs the steps of the method for training a video frame selection model according to any one of claims 1 to 7, or performs the steps of the method for processing video according to claim 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211696258.0A CN115984742A (en) | 2022-12-28 | 2022-12-28 | Training method of video frame selection model, video processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211696258.0A CN115984742A (en) | 2022-12-28 | 2022-12-28 | Training method of video frame selection model, video processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115984742A true CN115984742A (en) | 2023-04-18 |
Family
ID=85967669
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211696258.0A Pending CN115984742A (en) | 2022-12-28 | 2022-12-28 | Training method of video frame selection model, video processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115984742A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117132926A (en) * | 2023-10-27 | 2023-11-28 | 腾讯科技(深圳)有限公司 | Video processing method, related device, equipment and storage medium |
-
2022
- 2022-12-28 CN CN202211696258.0A patent/CN115984742A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117132926A (en) * | 2023-10-27 | 2023-11-28 | 腾讯科技(深圳)有限公司 | Video processing method, related device, equipment and storage medium |
CN117132926B (en) * | 2023-10-27 | 2024-02-09 | 腾讯科技(深圳)有限公司 | Video processing method, related device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107545889B (en) | Model optimization method and device suitable for pattern recognition and terminal equipment | |
CN111738357B (en) | Junk picture identification method, device and equipment | |
US20230401833A1 (en) | Method, computer device, and storage medium, for feature fusion model training and sample retrieval | |
CN113379176B (en) | Method, device, equipment and readable storage medium for detecting abnormal data of telecommunication network | |
CN110263733B (en) | Image processing method, nomination evaluation method and related device | |
CN112949662B (en) | Image processing method and device, computer equipment and storage medium | |
CN109766476B (en) | Video content emotion analysis method and device, computer equipment and storage medium | |
CN112132279A (en) | Convolutional neural network model compression method, device, equipment and storage medium | |
CN115984742A (en) | Training method of video frame selection model, video processing method and device | |
CN111639230B (en) | Similar video screening method, device, equipment and storage medium | |
CN114416260A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN110852224B (en) | Expression recognition method and related device | |
CN115909176A (en) | Video semantic segmentation method and device, electronic equipment and storage medium | |
CN112668718B (en) | Neural network training method, device, electronic equipment and storage medium | |
CN110781845B (en) | Method, device and electronic system for counting target object based on image | |
CN111797973A (en) | Method, device and electronic system for determining model structure | |
CN113657136B (en) | Identification method and device | |
CN114880363A (en) | Data center flow prediction system, training method and prediction method | |
CN114782721A (en) | Image processing and target detection method, device, equipment and storage medium | |
CN114842382A (en) | Method, device, equipment and medium for generating semantic vector of video | |
CN114549502A (en) | Method and device for evaluating face quality, electronic equipment and storage medium | |
CN113936308A (en) | Face recognition method and device and electronic equipment | |
CN117521737B (en) | Network model conversion method, device, terminal and computer readable storage medium | |
CN110765303A (en) | Method and system for updating database | |
CN118075030B (en) | Network attack detection method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |