CN114677623A

CN114677623A - Model training method, video processing method, computer device, and medium

Info

Publication number: CN114677623A
Application number: CN202210249167.6A
Authority: CN
Inventors: 卿志武; 张士伟; 唐铭谦
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-28

Abstract

The embodiment of the application provides a model training method, a video processing method, computer equipment and a medium. In the embodiment of the application, a plurality of video clips with visual consistency and a plurality of video clips with theme consistency are prepared, so that in the model training process of the feature extraction model, a self-supervision contrast learning mode is adopted, whether different video clips have visual consistency dimensions with the same visual characteristics or not is concerned, and whether different video clips have theme consistency dimensions with the same theme or not is concerned, the video representation is learned from a plurality of levels based on the self-supervision contrast learning mode, the model performance of the trained feature extraction model is good, the feature vectors of videos can be accurately extracted, the generalization performance is good, and the feature extraction method can be applied to feature extraction of short videos and long videos. In addition, a large amount of manual labeling cost is not required to be introduced in the model training process, time consumption is low, and the model training efficiency is high.

Description

Model training method, video processing method, computer device, and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model training method, a video processing method, a computer device, and a medium.

Background

The video is multimedia data formed by arranging a plurality of image frames according to time sequence, and the video has one-dimensional time sequence information more than the image frames. As video carries more abundant information, video has become a common representation form of information carriers, and video-based recognition processing is increasing. The video-based identification processing flow is roughly characterized in that a feature extraction model is utilized to extract features of a video to obtain a feature vector representing the video; and performing identification processing such as behavior identification, behavior positioning and anomaly analysis based on the feature vectors of the video. Therefore, the model performance of the feature extraction model is directly related to whether the extracted feature vector of the video can accurately represent the video or not, and the identification effect of the identification processing mode based on the video is directly influenced.

In the related art, a feature extraction model for extracting features of a video is usually trained by using a self-supervision contrast learning method. The training mode is mainly that in each round of training, a plurality of different video clips are randomly sampled from a video, and a feature extraction model can extract the same feature vector from the plurality of different video clips to be used as a training target for model training. However, it is difficult for the feature extraction model trained by such a training method to accurately extract feature vectors of the video, and the recognition effect of the video-based recognition processing method is further affected.

Disclosure of Invention

Aspects of the present application provide a model training method, a video processing method, a computer device, and a medium to improve model performance of a feature extraction model.

The embodiment of the application provides a model training method, which comprises the following steps: acquiring a group of video clips corresponding to a plurality of sample videos respectively, wherein each group of video clips comprises two video clips with visual consistency and other video clips without limiting the visual consistency, and the video clips in the same group of video clips have theme consistency; performing model training of the current round on the feature extraction model by using a plurality of groups of video clips to obtain a plurality of groups of feature vectors generated by the feature extraction model in the current round of model training; generating a first loss value corresponding to the training of the model in the current round in the visual consistency dimension according to the multiple groups of feature vectors, and generating a second loss value corresponding to the training of the model in the current round in the theme consistency dimension according to the theme consistency prediction model and the multiple groups of feature vectors; and when judging that the model training end condition is not met according to the first loss value and the second loss value, continuing to perform next round of model training on the feature extraction model.

An embodiment of the present application further provides a video processing method, including: acquiring a target video to be identified; extracting the features of the target video by using a feature extraction model to obtain a feature vector of the target video; identifying the target video according to the characteristic vector of the target video to obtain an identification result; the feature extraction model is a model obtained by training according to the model training method provided by the embodiment of the application.

An embodiment of the present application further provides a computer device, including: a memory and a processor; a memory for storing a computer program; a processor is coupled to the memory for executing a computer program for performing the model training method or the video processing method.

Embodiments of the present application also provide a computer storage medium storing a computer program that, when executed by a processor, enables the processor to implement a model training method or a video processing method.

In the embodiment of the application, a plurality of video clips with visual consistency and a plurality of video clips with theme consistency are prepared, so that in the model training process of the feature extraction model, a self-supervision contrast learning mode is adopted, whether different video clips have the same visual consistency dimensionality of visual features or not is concerned, whether different video clips have the same theme consistency dimensionality or not is concerned, the video representation is learned from multiple levels based on the self-supervision contrast learning mode, the model performance of the feature extraction model trained in the way is good, the feature vectors of videos can be accurately extracted, the generalization performance is good, and the feature extraction method can be applied to feature extraction of short videos and long videos. In addition, a large amount of manual labeling cost is not required to be introduced in the model training process, time consumption is low, and the model training efficiency is high.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a model training method provided in an embodiment of the present application;

fig. 2 is an application scenario diagram applicable to a model training method provided in the embodiment of the present application;

fig. 3 is a flowchart of a video processing method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First, terms related to embodiments of the present application are explained:

visual consistency: for a plurality of video clips sampled from a long video or a short video and having a short time distance (i.e. a time interval), there is a high probability that the plurality of video clips have the same visual information, for example, all describe the same thing, and therefore, there is visual consistency between the plurality of video clips.

Theme consistency: for a plurality of video clips sampled from a long video or a short video, because the plurality of video clips are all from the same long video or short video, the plurality of video clips have the same theme reflecting the main key of the video content with high probability.

It is noted that if there is visual consistency between multiple video clips, there is also theme consistency between the multiple video clips. However, if there is subject consistency between multiple video clips, there is not necessarily visual consistency between the multiple video clips.

Self-supervision contrast learning: refers to a process of completing self-supervised learning by contrast learning. No label can be used in the self-supervision contrast learning, in order to train a network, independent random data enhancement is carried out on each picture or video for a plurality of times, a plurality of samples which have difference visually but do not change the semantics of the samples are generated, and after the network extracts the features from the samples, the feature distance of the samples from the same picture or video is required to be close (namely, the feature vector similarity is high), and the feature distance of the samples from different pictures or videos is required to be far (namely, the feature vector similarity is low).

No matter long video or short video, the feature extraction model trained by the existing model training mode is difficult to accurately extract feature vectors of the video, and further the recognition effect of a video-based recognition processing mode is difficult to ensure. Especially for feature extraction of long videos, feature extraction models trained by existing model training modes are more difficult to accurately extract feature vectors of the long videos. This is because the duration of the long video is long, and the visual information of multiple video segments extracted from the long video is inconsistent, so it is not feasible to use the feature extraction model to extract the same feature vector for these multiple different video segments as the training target. In addition, the existing model training mode needs manual labeling on videos, a large amount of manual labeling cost can be introduced, time consumption is high, and the model training efficiency is low.

In order to solve the above technical problem, embodiments of the present application provide a model training method, a video processing method, a computer device, and a medium. In the embodiment of the application, a plurality of video clips with visual consistency and a plurality of video clips with theme consistency are prepared, so that in the model training process of the feature extraction model, a self-supervision contrast learning mode is adopted, whether different video clips have visual consistency dimensions with the same visual characteristics or not is concerned, and whether different video clips have theme consistency dimensions with the same theme or not is concerned, the video representation is learned from a plurality of levels based on the self-supervision contrast learning mode, the model performance of the trained feature extraction model is good, the feature vectors of videos can be accurately extracted, the generalization performance is good, and the feature extraction method can be applied to feature extraction of short videos and long videos. In addition, a large amount of manual labeling cost is not required to be introduced in the model training process, time consumption is low, and the model training efficiency is high.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a model training method provided in an embodiment of the present application. Referring to fig. 1, the method may include the steps of:

101. the method comprises the steps of obtaining a group of video clips corresponding to a plurality of sample videos, wherein each group of video clips comprises two video clips with visual consistency and other video clips without limiting the visual consistency, and theme consistency exists among the video clips in the same group of video clips.

102. And carrying out model training on the feature extraction model by utilizing the plurality of groups of video clips to obtain a plurality of groups of feature vectors generated by the feature extraction model in model training of the current round.

103. And generating a first loss value corresponding to the model training of the current round in the visual consistency dimension according to the plurality of groups of feature vectors, and generating a second loss value corresponding to the model training of the current round in the theme consistency dimension according to the theme consistency prediction model and the plurality of groups of feature vectors.

104. And when judging that the model training end condition is not met according to the first loss value and the second loss value, continuing to perform next round of model training on the feature extraction model.

In this embodiment, a plurality of different sample videos are prepared first, and at least a plurality of long videos with a long duration and a plurality of short videos with a short duration are required to be included in the plurality of sample videos. Of course, the time length condition for dividing the long video and the short video can be flexibly set according to the actual application requirement. For example, the plurality of sample videos includes the following videos: a long video with a duration of 5 minutes, a long video with a duration of 3 minutes, and a short video with a duration of 30 seconds. It is noted that, due to the long duration of the long video, the video content of multiple behavior actions or multiple events is likely to occur in the long video. Short videos are short in duration, and generally only include video content of one behavior action or event.

In this embodiment, in order to allow both the visual consistency dimension and the theme consistency dimension to be considered in model training, for each round of model training, two video segments with visual consistency and other video segments without limiting visual consistency are sampled from each sample video as a group of video segments of the sample video. It should be noted that two video segments with visual consistency have the same visual information, and it is not limited that other video segments with visual consistency may have the same visual information as the two video segments with visual consistency, or may have different visual information from the two video segments with visual consistency, and this is not limited.

Further alternatively, when a group of video clips corresponding to each sample video is sampled, two video clips with visual consistency can be sampled from the sample video according to a set sampling time interval, and other video clips can be randomly sampled from the sample video. The set sampling time interval is set according to actual requirements, and the set sampling time interval can ensure that two sampled video clips have the same visual information. Unlike the sampling rule of two video clips having visual consistency, other video clips are randomly sampled from the corresponding sample video and have randomness.

In practical application, the more times the model is trained, the better the model performance obtained by training. In this embodiment, the feature extraction model needs to undergo multiple model training runs, and each model training run may use the same sampling time interval or different sampling time intervals. Further optionally, a progressive sampling strategy may be adopted, and the training difficulty of the video segments is controlled to be easily reached by gradually increasing the time distance between two video segments in the training process, so that the generalization performance of the feature extraction model is further improved. It should be noted that, when the temporal distance between two video segments is relatively close, the two video segments probably share more visual features, and the training difficulty in the visual consistency dimension and the theme consistency dimension is relatively small; when the temporal distance between two video clips is long, the two video clips share the same visual features less, and the training difficulty in the visual consistency dimension and the theme consistency dimension is relatively large.

Based on the above, further optionally, when the sampling time interval used in each round of model training is determined, the sampling time interval used in the round of model training may be determined according to the training progress and the maximum time interval of the round of model training.

Further optionally, when the sampling time interval used for the model training of the current round is determined according to the training progress and the maximum time interval of the model training of the current round, the sampling time interval used for the model training of the current round can be continuously increased along with the increase of the number of times of the model training, so that the training difficulty is improved. In practical application, the maximum sampling time interval used by the model training of the current round can be determined according to the training progress and the maximum time interval of the model training of the current round, and the sampling time interval used by the model training of the current round is set to be smaller than or equal to the maximum sampling time interval.

As an example, the maximum sampling time interval used for the current round of model training may be calculated according to equation (1).

Wherein, delta_maxRepresents the maximum sampling time interval used by the model training of the current round, and the value is generally small and is far less than the total time length of the video, delta_maxFor example, 1 second; alpha is alpha_maxRepresenting the total number of rounds of model training; alpha represents the training progress of the model training of the current round, namely the model training of the current round is the model training of the second round; and delta represents the maximum time interval in the set model training process.

In this embodiment, after a plurality of sets of video clips are acquired from a plurality of sample videos, the feature extraction model is subjected to the model training of the current round by using the plurality of sets of video clips. Optionally, the model structure of the feature extraction model may include, but is not limited to: convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long-Short Term Memory Networks (LSTM). Specifically, in each round of model training, a set of feature vectors corresponding to a plurality of sets of video segments are extracted by using a feature extraction model, and visual consistency learning and theme consistency learning are performed by using a set of feature vectors corresponding to each set of video segments.

In this embodiment, when a set of feature vectors corresponding to each set of video segments is extracted by using the feature extraction model, each video segment in each set of video segments may be directly and respectively input to the feature extraction model for feature extraction, so as to obtain a feature vector of each video segment as a set of feature vectors. Further optionally, in order to improve the performance of the model, enhancement processing is performed on each video clip in each group of video clips; and respectively inputting each video segment after enhancement processing into a feature extraction model for feature extraction to obtain a feature vector of each video segment as a group of feature vectors. The image enhancement algorithm may be used to perform image color enhancement, image contrast enhancement, or image sharpness enhancement on each video segment, but is not limited thereto. Of course, according to the actual application requirement, only the other video segments sampled randomly may be subjected to image enhancement processing, or only two video segments with visual consistency may be subjected to image enhancement processing, which is not limited in this embodiment.

In this embodiment, when performing the visual consistency learning by using a set of feature vectors corresponding to each set of video segments, after extracting a plurality of sets of feature vectors corresponding to a plurality of sets of video segments based on the feature extraction model, a first loss value corresponding to the current round of model training in the visual consistency dimension may be generated according to the plurality of sets of feature vectors. Further optionally, before determining the first loss value according to the plurality of sets of feature vectors, each feature vector in the plurality of sets of feature vectors may be input to a feature space transformation network, so as to perform a dimension reduction process on a feature dimension of each feature vector. The feature space transformation network may be a full connection layer, and is configured to perform dimension reduction processing on feature dimensions of the feature vector.

In this embodiment, the first loss value is a loss value calculated based on a loss function associated with the visual consistency dimension, and the first loss value can reflect a degree of difference between the feature vector output by the feature extraction model and the real visual information of the sample video. In practical applications, the loss function associated with the visual consistency dimension may be flexibly set, which is not limited in this embodiment.

In this embodiment, when performing theme consistency learning by using a set of feature vectors corresponding to each set of video clips, a second loss value corresponding to the training of the model in the current round on the theme consistency dimension may be generated according to the theme consistency prediction model and the plurality of sets of feature vectors. The second loss value can reflect the degree of difference between the predicted case of a theme-consistent video segment and the actual theme-consistent video segment. The theme consistency prediction model may be a multi-layer perceptron, and is capable of performing theme consistency prediction on two video segments, where the theme consistency prediction result includes, but is not limited to: whether two video clips have the same theme, or the probability that two video clips have the same theme, etc. Alternatively, the model structure of the topic consensus prediction model may include, but is not limited to: convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long-Short Term Memory Networks (LSTM).

In each round of model training, the first loss value corresponding to the model training in the current round in the visual consistency dimension and the second loss value corresponding to the model training in the subject consistency dimension are subjected to numerical calculation such as averaging, accumulation or weighted summation to obtain the total loss value of the model training in the current round.

In some application scenarios, the model training end condition only considers the total loss value of the current round of model training. For the situation, if the total loss value of the model training of the current round is greater than the preset loss value, it is determined that the model training end condition is not met, model parameters of the feature extraction model and the theme consistency prediction model are adjusted, steps 101 to 104 are executed again, and the next round of model training of the feature extraction model is continued. If the total loss value of the model training of the current round is smaller than or equal to the preset loss value, the condition that the model training is finished is determined to be met, and the feature extraction model obtained by training can carry out model reasoning.

In other application scenarios, the model training end condition not only considers the total loss value of the model training of the current round, but also considers the current model training times. And under the condition that the current model training times are less than the preset maximum model training times, determining that the model training ending condition is not met no matter whether the total loss value of the current model training is greater than or equal to the preset loss value, adjusting model parameters of the feature extraction model and the theme consistency prediction model, re-executing the steps 101 to 104, and continuing to perform the next model training on the feature extraction model. And under the condition that the current model training times are greater than or equal to the preset maximum model training times, if the total loss value of the current model training is greater than the preset loss value, determining that the model training ending condition is not met, adjusting model parameters of the feature extraction model and the theme consistency prediction model, re-executing the steps 101 to 104, and continuing to perform the next model training on the feature extraction model. If the total loss value of the model training of the current round is smaller than or equal to the preset loss value, the condition that the model training is finished is determined to be met, and the feature extraction model obtained by training can carry out model reasoning. It is worth noting that the preset loss value and the preset maximum model training times are flexibly set according to the actual application requirements.

The model training method provided by the embodiment of the application is characterized in that a plurality of video clips with visual consistency and a plurality of video clips with theme consistency are prepared, so that in the model training process of the feature extraction model, a self-supervision contrast learning mode is adopted, whether different video clips have visual consistency dimensions with the same visual characteristics or not is concerned, whether different video clips have theme consistency dimensions with the same theme or not is concerned, the video representation is learned from a plurality of levels based on the self-supervision contrast learning mode, the model performance of the trained feature extraction model is good, the feature vectors of videos can be accurately extracted, the generalization performance is good, and the method can be applied to feature extraction of short videos and long videos. In addition, a large amount of manual labeling cost is not required to be introduced in the model training process, time consumption is low, and the model training efficiency is high.

In some embodiments, when generating the first loss values corresponding to the visual consistency dimension in the current round of model training according to the plurality of sets of feature vectors, the loss value corresponding to each set of feature vectors is first calculated, and for ease of understanding and distinction, the loss value corresponding to each set of feature vectors is referred to as a third loss value; and then, obtaining a first loss value corresponding to the current round of model training in the visual consistency dimension according to a plurality of third loss values corresponding to the plurality of groups of feature vectors. Further optionally, when the first loss value corresponding to the visual consistency dimension of the model training of the present round is obtained according to the plurality of third loss values corresponding to the plurality of groups of feature vectors, numerical calculations such as averaging, weighted summation, or accumulation may be performed on the plurality of third loss values corresponding to the plurality of groups of feature vectors to obtain the first loss value corresponding to the visual consistency dimension of the model training of the present round, but not limited thereto.

Further optionally, when the third loss value corresponding to each group of feature vectors is determined, the third loss value corresponding to each group of feature vectors may be obtained according to a similarity between the first feature vector and the second feature vector corresponding to two video segments with visual consistency in the group of feature vectors, and a similarity between the first feature vector and each of the second feature vectors and each of the other feature vectors different from the first feature vector and each of the other feature vectors in the group of feature vectors.

In an optional implementation manner, when determining the third loss value corresponding to each group of feature vectors, a fourth loss value may be calculated according to a similarity between the first feature vector and the second feature vector and a similarity between the first feature vector and another feature vector except the first feature vector in each group of feature vectors; calculating a fifth loss value according to the similarity between the first feature vector and the second feature vector and the similarity between the second feature vector and other feature vectors except the second feature vector in each group of feature vectors; and determining a third loss value corresponding to the group of eigenvectors according to the fourth loss value and the fifth loss value.

In this embodiment, the loss function for calculating the fourth loss value and the fifth loss value may be designed according to the actual application requirement. Assume that there are N sample videos, and a group of video segments corresponding to each sample video includes two video segments with identical subjects and other video segments sampled randomly, so that the N sample videos correspond to 3N video segments, the 3N video segments correspond to 3N feature vectors, and N is a positive integer.

As an example, the formula for calculating the loss function for the fourth loss value and the fifth loss value is as follows:

as another example, the formula for calculating the loss function for the fourth loss value and the fifth loss value is as follows:

of note is s_i,jRepresenting a similarity between the first feature vector and the second feature vector, which may be, for example but not limited to, a cosine similarity or a euclidean distance between the first feature vector and the second feature vector.

When k is not equal to i, the value is 1, otherwise,

the value is 0. τ is a constant coefficient. exp () represents an exponential operation and log () represents a logarithmic operation.

For the case where 2N feature vectors are selected to participate in calculating the fourth penalty value or the fifth penalty value that do not include randomly sampled feature vectors: when i represents the first feature vector, k represents the other feature vectors except the first feature vector among the 2N feature vectors, s_i,kRepresenting the similarity between the first feature vector and other feature vectors; when i represents the second feature vector, k represents the feature vectors except the second feature vector among the 2N feature vectors, s _i,kRepresenting the similarity between the second feature vector and other feature vectors;

for the case where 3N feature vectors participate in calculating the fourth penalty value or the fifth penalty value: when i represents the first feature vector, k represents the first feature vector excluding the 3N feature vectorsOther feature vectors than s_i,kRepresenting the similarity between the first feature vector and other feature vectors; when i represents the second feature vector, k represents the other feature vectors except the second feature vector among the 3N feature vectors, s_i,kRepresenting the similarity between the second feature vector and the other feature vectors.

For the convenience of understanding, the calculation of the first loss value corresponding to the visual consistency dimension in the training of the model in this round is described by taking the loss function for calculating the fourth loss value and the fifth loss value as formula (5) as an example.

Assuming that N sample videos are prepared, N groups of video clips are sampled from the N sample videos, and each group of video clips includes two video clips having video consistency and one randomly sampled video clip. As an example, the first loss value corresponding to the visual consistency dimension of the current round of model training can be calculated by using the loss function shown in formula (6)

In formula (6), the 2n-1 th eigenvector corresponds to the first eigenvector in each group of eigenvectors, the 2 n-th eigenvector corresponds to the second eigenvector in each group of eigenvectors, and l (2n-1,2n) + l (2n,2n-1) represents the third loss value corresponding to each group of eigenvectors. l (2n-1,2n) represents a fourth loss value corresponding to each group of feature vectors, and l (2n,2n-1) represents a fifth loss value corresponding to each group of feature vectors. When the fourth loss value is calculated by using the formula (5), i in the formula (5) takes a value of 2n-1, and j in the formula (5) takes a value of 2 n. When the fifth loss value is calculated by using the formula (5), i in the formula (5) takes a value of 2n, and j in the formula (5) takes a value of 2 n-1.

In some embodiments, when a second loss value corresponding to the training of the model in the current round in the theme consistency dimension is generated according to the theme consistency prediction model and the plurality of groups of feature vectors, any two feature vectors in the plurality of groups of feature vectors can be subjected to vector splicing to obtain a plurality of spliced vectors; inputting each splicing vector into a theme consistency prediction model so as to perform theme consistency prediction on two video segments corresponding to each splicing vector; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the theme consistency prediction result and the classification label corresponding to each of the plurality of splicing vectors. Wherein the classification tags may include a first classification tag and a second classification tag. The first classification label indicates that the topics of the two video segments corresponding to the splicing vector are consistent; the second classification label indicates that the two video segments corresponding to the stitching vector have inconsistent subject matter.

Further optionally, before each stitching vector is input into the theme consistency prediction model, each stitching vector may also be input into the feature space transformation network, so as to perform dimension reduction processing on the feature dimension of the stitching vector. The feature space transformation network may be a full connection layer, and is configured to perform dimension reduction processing on feature dimensions of the feature vector.

In practical application, when determining the second loss value corresponding to the theme consistency dimension in the current round of model training, the loss value corresponding to each stitching vector may be determined according to the theme consistency prediction result and the classification label corresponding to each stitching vector. And then, according to a plurality of sixth loss values corresponding to the plurality of splicing vectors, determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension. For example, a plurality of sixth loss values corresponding to the plurality of stitching vectors may be subjected to numerical calculation such as averaging, weighted summation or accumulation, and a second loss value corresponding to the topic consistency dimension in the current round of model training is determined. Further optionally, the multiple sixth loss values with the first classification label are averaged to obtain a first average value, and the multiple sixth loss values with the second classification label are averaged to obtain a second average value; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the first average value and the second average value.

In an optional implementation manner, if each splicing vector has a first classification label indicating that the topics of the two video segments corresponding to the splicing vector are consistent, the topic consistency prediction result of the splicing vector is used as an input parameter of a target loss function, and a sixth loss value corresponding to the splicing vector is calculated; if the video segment has a second classification label indicating that the topics of the two video segments corresponding to the splicing vector are inconsistent, taking a difference value between the topic consistency prediction result of the splicing vector and a specified numerical value as an input parameter of a target loss function, and calculating a sixth loss value corresponding to the splicing vector. Alternatively, the objective loss function may include, for example, but is not limited to: log logarithmic loss function, Cross-entropy loss function, and Focal loss function for solving the data imbalance problem.

Suppose that v is the three video segments sampled by each sample video in the N sample videos_i、v_j、v_kThe feature vector t can be obtained by processing the feature extraction model f and the feature space transformation network h according to the formula (7) in sequence_i、t_j、t_k。

{t_i,t_j,t_k}＝h(f({v_i,v_j,v_k}))……(7)

Wherein, in the formula (7), f ({ v) _i,v_j,v_kDenotes dividing three video segments v_i、v_j、v_kRespectively inputting the three video clips into a feature extraction model f for feature extraction to obtain three video clips v_i、v_j、v_kRespective corresponding feature vector, h (f ({ v))_i,v_j,v_k}) indicates that three video segments v are to be recorded_i、v_j、v_kRespectively corresponding feature vectors f ({ v)_i,v_j,v_kAnd inputting the data to a feature space conversion network h for dimension reduction.

After the 3N video segments are processed by the feature extraction model f and the feature space transformation network h, any two feature vectors in the 3N feature vectors can be subjected to vector splicing by 3N feature vectors, and 3 nx 3N spliced vectors can be obtained. Noting that the feature set including 3 nx 3N stitching vectors is U, the feature set U may be represented as:

it is to be noted that, in the formula (8), the feature vector

The superscript 1 … N of the equal feature vector represents the number of N sample videos;

a vector splicing operation is represented as a vector splicing operation,

C_Trepresenting characteristic dimensions, which are respectively t_i、t_j、t_k. R represents a set of real numbers and e represents belonging to a symbol.

In the present example, U is input into the topic consistency prediction model, and the topic consistency prediction model predicts whether two video segments corresponding to each stitching vector have the same topic, and outputs a topic consistency prediction result M. Wherein M is equal to R ^3N×3N. For supervised training, a class label G is defined for supervision, where G ∈ R^3N×3NEach classification label in G indicates whether the corresponding two video segments have the same theme, if the two video segments have the same theme, the classification label takes a value of 1, and the classification label taking the value of 1 may be referred to as a first classification label; if the two video clips have different subjects, the value of the classification label is 0, and the classification label with the value of 0 can be called a second classification label.

In order to make the network pay more attention to visual differences, and to better predict difficult samples in the topic consistency dimension, a local loss function can be used to supervise the topic consistency prediction result.

As an example, the second loss value may be calculated using a loss function shown in equation (9)

Wherein, gamma is₁Is the first class label in G (G)_i,j1), γ₂Is the second class label in G (G)_i,j0) is used. M_i,jRepresents the theme consistency prediction result of any splicing vector in 3 Nx 3N splicing vectors,

is represented by M_i,jAs input parameters and to the Focal loss function

Calculating to obtain a loss value;

as input parameters and to the Focal loss function

And (4) calculating to obtain a loss value.

For convenience of understanding, the model training method provided in the embodiment of the present application is described with reference to the application scenario diagram shown in fig. 2. As shown in fig. 2, the whole model architecture includes a feature extraction model f, a topic consistency prediction model Φ, a feature space transformation network g related to visual consistency, and a feature space transformation network h related to topic consistency. And (4) carrying out dimensionality reduction on the feature vector output by the feature extraction model f through a feature space conversion network g, and then carrying out visual consistency learning. And (4) after the feature vectors output by the feature extraction model f are subjected to dimensionality reduction processing through a feature space conversion network h, the feature vectors are input to a theme consistency prediction model to carry out theme consistency learning.

Specifically, assuming that N sample videos are prepared, each round of model training is performed according to the training progress α, the maximum time interval Δ, and the total number of rounds α of model training_maxThe maximum sampling time interval delta used by the model training of the current round can be calculated_max(ii) a Then, the video is not more than delta for each sample_maxThe sampling time interval of (1) samples two video segments with shorter time distance from the sample video, and the two video segments with shorter time distance are respectively marked as v _iAnd v_j(ii) a And randomly sampling a video segment from the sample video, wherein the randomly sampled video segment is marked as v_k。

Since there are N sample videos, for easy understanding and distinction, the N sample videos are numbered with 1, 2, 3 … … N, etc., respectively, and three video clips v corresponding to each sample video_i、v_j、v_kA superscript is added to indicate the coding of its associated sample video. For example,

three video clips respectively representing the 1 st sample video;

three video clips respectively representing the 2 nd sample video; by analogy in the following way,

and respectively representing three video segments of the nth sample video, wherein N is any positive integer from 1 to N.

After sampling N groups of video clips from N sample videos, each group of video clips comprises v with consistent subjects_iAnd v_jAnd randomly sampled v_k. And respectively inputting each video clip in the N groups of video clips into the feature extraction model f for feature extraction, so as to obtain N groups of feature vectors. Reducing each eigenvector in N groups of eigenvectors by using eigenspace transformation network gDimension processing is carried out, and visual consistency learning is carried out by utilizing the N groups of feature vectors after the dimension reduction processing. And performing dimensionality reduction on each feature vector in the N groups of feature vectors by using the feature space conversion network h, and inputting the N groups of feature vectors subjected to dimensionality reduction into a theme consistency prediction model for theme consistency learning.

In FIG. 2, two video clips from the 1 st sample video

v_j ¹Random video segments from the 1 st sample video marked with gray circles

Marked with white circles, video segments from the nth sample video

Marked with grey diamonds.

As can be seen from the topic consistency prediction result M in fig. 2, for the splicing vectors marked by two gray circles, the value of the topic consistency prediction result corresponding to the splicing vectors is 1, and a value of 1 indicates that two video segments corresponding to the splicing vectors are derived from the same sample video (both derived from the 1 st sample video).

For the splicing vector marked by one gray circle and one white circle, the value of the corresponding theme consistency prediction result is 1, and the value of 1 indicates that two video segments corresponding to the splicing vector are from the same sample video (both from the 1 st sample video).

For a splicing vector marked by a gray circle and a gray diamond, the value of the corresponding theme consistency prediction result is 0, and the value of 0 indicates that two video segments corresponding to the splicing vector are from different sample videos (one is from the 1 st sample video and the other is from the nth sample video);

for a splicing vector marked by a white circle and a gray diamond, the value of the corresponding theme consistency prediction result is 0, and the value of 0 indicates that two video segments corresponding to the splicing vector are from different sample videos (one is from the 1 st sample video and the other is from the nth sample video);

For the splicing vectors marked by the two gray diamonds, the value of the corresponding theme consistency prediction result is 1, and the value of 1 indicates that the two video segments corresponding to the splicing vectors are from the same sample video (both from the nth sample video).

Notably, fig. 2 shows three video clips from the 1 st sample video

v_j ¹、

And a video segment v from the nth sample video_j ⁿAnd 5 video segments are taken as examples, and the combination condition of the splicing vectors and the value taking condition of the theme consistency prediction result are introduced. By analogy, for the splicing vectors from the same sample video, the value of the corresponding theme consistency prediction result is 1; for the splicing vectors from different sample videos, the value of the corresponding theme consistency prediction result is 0.

The feature extraction model obtained by training with the model training method provided by the embodiment of the application can be applied to various application scenes needing video processing. For example, face recognition scenes, behavior recognition scenes, video recommendation scenes, video cover generation scenes, and the like. Therefore, the embodiment of the application also provides a video processing method based on the feature extraction model. Fig. 3 is a flowchart of a video processing method according to an embodiment of the present application. Referring to fig. 3, the method may include the steps of:

301. And acquiring a target video to be identified.

302. And performing feature extraction on the target video by using the feature extraction model to obtain a feature vector of the target video.

303. And identifying the target video according to the characteristic vector of the target video to obtain an identification result.

In this embodiment, the target video is a video that needs to be identified and processed, and the video content in the target video is different in different application scenes. The model performance of the trained feature extraction model obtained by training through the model training method provided by the embodiment of the application is good, so that the feature vector of the target video extracted by the feature extraction model can represent the visual information of the target video more accurately. Of course, the recognition processing is performed based on the feature vector of the visual information representing the target video more accurately, and a recognition result with higher precision can be obtained.

It is noted that in different application scenarios, different recognition processing algorithms are used for recognition processing. For example, a face recognition scene adopts a face recognition algorithm to perform face recognition, a behavior recognition scene adopts a behavior recognition algorithm to perform behavior recognition, a video recommendation scene adopts a big data recommendation technology to perform video recommendation, a video cover generation scene adopts a video content understanding technology to understand video content, and a most wonderful cover is selected for the understood video content.

According to the video processing method provided by the embodiment of the application, the feature vector of the target video extracted by the feature extraction model can represent the visual information of the target video more accurately, the identification processing is carried out based on the feature vector representing the visual information of the target video more accurately, and the identification result with higher precision can be obtained.

Further optionally, the video processing method may also be applied in an AR (Augmented Reality) scene and/or a VR (Virtual Reality) scene. Specifically, according to the recognition result, the augmented reality AR scene and/or the virtual reality VR scene may be triggered to execute corresponding operations.

It should be noted that the recognition result is different in different application scenarios, and accordingly, the operation triggering the execution of the augmented reality AR scenario and/or the virtual reality VR scenario is also different. Several exemplary application scenarios are described below.

Scene 1: the method can control the action linkage between a real target object in the real world and a virtual model in the virtual world. Specifically, if the identification result is the motion information of the target object in the target video, triggering the AR scene and/or the VR scene to perform corresponding operations specifically includes: and controlling the virtual model corresponding to the target object in the AR scene and/or the VR scene to execute corresponding action according to the motion information of the target object.

For example, when a target object in the real world performs an action such as running, walking, or sitting, a virtual model corresponding to the target object in the AR scene and/or the VR scene also performs an action such as running, walking, or sitting.

Scene 2: the method can control the real target object in the real world and the virtual model in the virtual world to perform expression linkage. Specifically, if the recognition result is expression information of a target object in the target video, triggering the AR scene and/or the VR scene to execute corresponding operations specifically includes: and controlling a virtual model corresponding to the target object in the AR scene and/or the VR scene to present a corresponding expression according to the expression information of the target object.

For example, when a target object in the real world presents a smiling face or a crying face, the virtual model corresponding to the target object in the AR scene and/or the VR scene also presents a smiling face or a crying face.

Scene 3: unlocking or locking of the AR device or the VR device may be controlled according to an action behavior of a real target object in the real world. Specifically, if the identification result indicates that the target object in the target video executes the unlocking operation, triggering AR equipment to which the AR scene belongs and/or VR equipment to which the VR scene belongs to execute the unlocking operation; and if the identification result indicates that the target object in the target video executes the locking operation, triggering the AR equipment to which the AR scene belongs and/or the VR equipment to which the VR scene belongs to execute the locking operation.

For example, while the AR device or the VR device is in the locked state, the user may input an unlocking gesture or unlocking voice information associated with the unlocking operation to trigger the AR device or the VR device to enter the unlocked state from the locked state. Likewise, while the AR device or the VR device is in the unlocked state, the user may input a locking gesture or locking voice information associated with the locking operation to trigger the AR device or the VR device to enter the locked state from the unlocked state.

Note that the target object may be a person, an animal, a plant, or the like; virtual models corresponding to target objects include, for example, but are not limited to: a three-dimensional model of the target object, or a three-dimensional model of another object associated with the target object.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 101 to 104 may be device a; for another example, the execution subject of steps 101 and 102 may be device a, and the execution subject of

steps

103 and 104 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 101, 102, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 4 is a schematic structural diagram of a model training device according to an embodiment of the present application. As shown in fig. 4, the model training apparatus may include: an acquisition module 41 and a training module 42.

The obtaining module 41 is configured to obtain a group of video clips corresponding to each of the plurality of sample videos, where each group of video clips includes two video clips having visual consistency and other video clips not limited to the visual consistency, and each video clip in the same group of video clips has theme consistency.

The training module 42 performs model training on the feature extraction model by using the plurality of groups of video clips to obtain a plurality of groups of feature vectors generated by the feature extraction model in the model training of the current round; generating a first loss value corresponding to the model training of the current round in the visual consistency dimension according to the plurality of groups of feature vectors, and generating a second loss value corresponding to the model training of the current round in the theme consistency dimension according to the theme consistency prediction model and the plurality of groups of feature vectors; and when judging that the model training end condition is not met according to the first loss value and the second loss value, continuing to perform next round of model training on the feature extraction model.

Further optionally, the training module 42 performs a current round of model training on the feature extraction model by using a plurality of groups of video segments to obtain a plurality of groups of feature vectors generated by the feature extraction model in the current round of model training, and is specifically configured to: enhancing each video clip in each group of video clips; and respectively inputting each video segment after enhancement processing into a feature extraction model for feature extraction to obtain a feature vector of each video segment as a group of feature vectors.

Further optionally, when the training module 42 generates a first loss value corresponding to the current round of model training in the visual consistency dimension according to the plurality of groups of feature vectors, the first loss value is specifically used for: for each group of feature vectors, obtaining a third loss value corresponding to the group of feature vectors according to the similarity between a first feature vector and a second feature vector corresponding to two video segments with visual consistency in the group of feature vectors and the similarity between the first feature vector and the second feature vector and other feature vectors different from the first feature vector and the second feature vector in each group of feature vectors; and obtaining a first loss value corresponding to the training of the model in the current round on the visual consistency dimension according to a plurality of third loss values corresponding to the plurality of groups of feature vectors.

Further optionally, the training module 42, for each group of feature vectors, specifically configured to, when obtaining a third loss value corresponding to the group of feature vectors according to a similarity between a first feature vector and a second feature vector corresponding to two video segments with visual consistency in the group of feature vectors, and a similarity between the first feature vector and the second feature vector and another feature vector different from the first feature vector and the second feature vector in each group of feature vectors, obtain a third loss value corresponding to the group of feature vectors: calculating a fourth loss value according to the similarity between the first feature vector and the second feature vector and the similarity between the first feature vector and other feature vectors except the first feature vector in each group of feature vectors; calculating a fifth loss value according to the similarity between the first feature vector and the second feature vector and the similarity between the second feature vector and other feature vectors except the second feature vector in each group of feature vectors; and determining a third loss value corresponding to the group of feature vectors according to the fourth loss value and the fifth loss value.

Further optionally, when the training module 42 generates a second loss value corresponding to the current round of model training in the theme consistency dimension according to the theme consistency prediction model and the multiple groups of feature vectors, the training module is specifically configured to: carrying out vector splicing on any two eigenvectors in the plurality of groups of eigenvectors to obtain a plurality of spliced vectors; inputting each splicing vector into a theme consistency prediction model so as to perform theme consistency prediction on two video segments corresponding to each splicing vector; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the theme consistency prediction result and the classification label corresponding to each of the plurality of splicing vectors.

Further optionally, when determining the second loss value corresponding to the topic consistency dimension in the current round of model training according to the topic consistency prediction result and the classification label corresponding to each of the plurality of stitching vectors, the training module 42 is specifically configured to: for each splicing vector, if the splicing vector has a first classification label indicating that the topics of the two video segments corresponding to the splicing vector are consistent, taking the topic consistency prediction result of the splicing vector as an input parameter of a target loss function, and calculating a sixth loss value corresponding to the splicing vector; if the video segment number of the video segment is not the same as the number of the video segments corresponding to the splicing vector, calculating a second classification label indicating that the topics of the two video segments corresponding to the splicing vector are not the same; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to a plurality of sixth loss values corresponding to the plurality of splicing vectors.

Further optionally, when the training module 42 determines, according to a plurality of sixth loss values corresponding to the plurality of stitching vectors, a second loss value corresponding to the current round of model training in the theme consistency dimension, specifically configured to: averaging the plurality of sixth loss values with the first classification label to obtain a first average value, and averaging the plurality of sixth loss values with the second classification label to obtain a second average value; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the first average value and the second average value.

Further optionally, the target loss function is a Focal loss function for solving the data imbalance problem.

Further optionally, the training module 42 is further configured to, before inputting each stitching vector into the topic consistency prediction model: and inputting each splicing vector into a feature space conversion network so as to perform dimension reduction processing on the feature dimension of the splicing vector.

Further optionally, when the obtaining module 41 obtains a group of video clips corresponding to each of the plurality of sample videos, the obtaining module is specifically configured to: for each sample video, two video segments with visual consistency are sampled from the sample video according to the sampling time interval used by the training of the model in the current round, and other video segments are randomly sampled from the sample video, and the two video segments with visual consistency and the other video segments are used as a group of video segments.

Further optionally, the obtaining module 41 is further configured to: and determining the sampling time interval used by the model training according to the training progress and the maximum time interval of the model training of the round.

The model training apparatus shown in fig. 4 may perform the model training method in the embodiment shown in fig. 1, and details of the implementation principle and the technical effect are not repeated. The specific manner in which each module and unit of the model training apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Fig. 5 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the video processing apparatus may include: an acquisition module 51, a feature extraction module 52 and a recognition processing module 53.

The acquiring module 51 is configured to acquire a target video to be identified;

the feature extraction module 52 is configured to perform feature extraction on the target video by using a feature extraction model to obtain a feature vector of the target video; the feature extraction model is a model obtained by training according to the model training method provided by the embodiment;

and the identification processing module 53 is configured to perform identification processing on the target video according to the feature vector of the target video to obtain an identification result.

The video processing apparatus shown in fig. 5 can execute the video processing method in the embodiment shown in fig. 3, and details of implementation principles and technical effects are not repeated. The specific manner in which each module and unit of the video processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 6, the computer apparatus may include: a memory 61 and a processor 62;

memory 61 is used to store computer programs and may be configured to store other various data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 61 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 62, coupled to the memory 61, for executing computer programs in the memory 61 for: acquiring a group of video clips corresponding to a plurality of sample videos respectively, wherein each group of video clips comprises two video clips with visual consistency and other video clips without limiting the visual consistency, and the video clips in the same group of video clips have theme consistency; carrying out model training on the feature extraction model by using a plurality of groups of video clips to obtain a plurality of groups of feature vectors generated by the feature extraction model in the model training of the current round; generating a first loss value corresponding to the model training of the current round in the visual consistency dimension according to the plurality of groups of feature vectors, and generating a second loss value corresponding to the model training of the current round in the theme consistency dimension according to the theme consistency prediction model and the plurality of groups of feature vectors; and when judging that the model training end condition is not met according to the first loss value and the second loss value, continuing to perform next round of model training on the feature extraction model.

Further optionally, the processor 62 performs a current round of model training on the feature extraction model by using a plurality of groups of video segments to obtain a plurality of groups of feature vectors generated by the feature extraction model in the current round of model training, and specifically is configured to: enhancing each video clip in each group of video clips; and respectively inputting each video segment after enhancement processing into a feature extraction model for feature extraction to obtain a feature vector of each video segment as a group of feature vectors.

Further optionally, when the processor 62 generates the first loss value corresponding to the current round of model training in the visual consistency dimension according to the multiple groups of feature vectors, the method is specifically configured to: for each group of feature vectors, obtaining a third loss value corresponding to the group of feature vectors according to the similarity between the first feature vector and the second feature vector corresponding to the two video segments with visual consistency in the group of feature vectors and the similarity between the first feature vector and the second feature vector and other feature vectors different from the first feature vector and the second feature vector in each group of feature vectors; and obtaining a first loss value corresponding to the current round of model training in the visual consistency dimension according to a plurality of third loss values corresponding to the plurality of groups of feature vectors.

Further optionally, when obtaining, for each group of feature vectors, a third loss value corresponding to the group of feature vectors according to a similarity between a first feature vector and a second feature vector corresponding to two video segments with visual consistency in the group of feature vectors, and a similarity between the first feature vector and the second feature vector and another feature vector different from the first feature vector and the second feature vector in each group of feature vectors, the processor 62 is specifically configured to: calculating a fourth loss value according to the similarity between the first feature vector and the second feature vector and the similarity between the first feature vector and other feature vectors except the first feature vector in each group of feature vectors; calculating a fifth loss value according to the similarity between the first feature vector and the second feature vector and the similarity between the second feature vector and other feature vectors except the second feature vector in each group of feature vectors; and determining a third loss value corresponding to the group of eigenvectors according to the fourth loss value and the fifth loss value.

Further optionally, when the processor 62 generates a second loss value corresponding to the training of the model in the current round in the theme consistency dimension according to the theme consistency prediction model and the multiple groups of feature vectors, the method is specifically configured to: carrying out vector splicing on any two eigenvectors in the plurality of groups of eigenvectors to obtain a plurality of spliced vectors; inputting each splicing vector into a theme consistency prediction model so as to carry out theme consistency prediction on two video clips corresponding to each splicing vector; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the theme consistency prediction result and the classification label corresponding to each of the plurality of splicing vectors.

Further optionally, when determining the second loss value corresponding to the theme consistency dimension in the current round of model training according to the theme consistency prediction result and the classification label corresponding to each of the multiple stitching vectors, the processor 62 is specifically configured to: for each splicing vector, if the splicing vector has a first classification label indicating that the topics of the two video segments corresponding to the splicing vector are consistent, taking the topic consistency prediction result of the splicing vector as an input parameter of a target loss function, and calculating a sixth loss value corresponding to the splicing vector; if the video segment number is provided with a second classification label which indicates that the themes of the two video segments corresponding to the splicing vector are inconsistent, taking the difference value between the theme consistency prediction result of the splicing vector and the specified numerical value as an input parameter of a target loss function, and calculating a sixth loss value corresponding to the splicing vector; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to a plurality of sixth loss values corresponding to the plurality of splicing vectors.

Further optionally, when the processor 62 determines, according to a plurality of sixth loss values corresponding to the plurality of stitching vectors, a second loss value corresponding to the current round of model training in the theme consistency dimension, specifically: averaging the plurality of sixth loss values with the first classification label to obtain a first average value, and averaging the plurality of sixth loss values with the second classification label to obtain a second average value; and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the first average value and the second average value.

Further optionally, the objective loss function is a Focal loss function for solving the data imbalance problem.

Further optionally, the processor 62 is further configured to, before inputting each stitching vector into the topic consistency prediction model: and inputting each splicing vector into a feature space conversion network so as to perform dimension reduction processing on the feature dimension of the splicing vector.

Further optionally, when the processor 62 obtains a group of video clips corresponding to each of the plurality of sample videos, the processor is specifically configured to: for each sample video, two video segments with visual consistency are sampled from the sample video according to the sampling time interval used by the training of the model in the current round, and other video segments are randomly sampled from the sample video, and the two video segments with visual consistency and the other video segments are used as a group of video segments.

Further optionally, the processor 62 is further configured to: and determining the sampling time interval used by the model training according to the training progress and the maximum time interval of the model training of the round.

For details of the implementation process of the processor to perform each action, reference may be made to the related description in the foregoing method embodiment or apparatus embodiment, and details are not described herein again.

Further, as shown in fig. 6, the computer apparatus further includes: communication components 63, display 64, power components 65, audio components 66, and the like. Only some of the components are schematically shown in fig. 6, and it is not intended that the computer device includes only the components shown in fig. 6. In addition, the components within the dashed box in fig. 6 are optional components, not necessary components, and may be determined according to the product form of the scheduling equipment. The computer device of this embodiment may be implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, or an IOT device, and may also be a server device such as a conventional server, a cloud server, or a server array. If the computer device of this embodiment is implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, etc., the computer device may include components within a dashed line frame in fig. 6; if the computer device of this embodiment is implemented as a server device such as a conventional server, a cloud server, or a server array, the components in the dashed box in fig. 6 may not be included.

The embodiment of the present application further provides a computer device, which has the same structure as the computer device shown in fig. 6, but has different processing logic. Specifically, the computer device includes: a memory and a processor;

the memory may be configured to store a computer program and may be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory may be implemented by any type or combination of volatile and non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor, coupled to the memory, for executing the computer program in the memory to: acquiring a target video to be identified; extracting the features of the target video by using a feature extraction model to obtain a feature vector of the target video; identifying the target video according to the characteristic vector of the target video to obtain an identification result; the feature extraction model is a model obtained by training according to the model training method provided by the above embodiment.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program is capable of implementing the steps that can be executed by a computer device in the foregoing method embodiments when executed.

Accordingly, the present application also provides a computer program product, which includes a computer program/instruction, when the computer program/instruction is executed by a processor, the processor is enabled to implement the steps that can be executed by a computer device in the method embodiments described above.

The communication component is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The display includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply assembly provides power for various components of the equipment where the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The audio component may be configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of model training, comprising:

the method comprises the steps that a plurality of sample videos respectively correspond to a group of video clips, each group of video clips comprises two video clips with visual consistency and other video clips without limiting the visual consistency, and theme consistency exists among the video clips in the same group of video clips;

carrying out model training on the feature extraction model by utilizing a plurality of groups of video clips to obtain a plurality of groups of feature vectors generated by the feature extraction model in model training of the current round;

generating a first loss value corresponding to the model training of the current round in the visual consistency dimension according to the plurality of groups of feature vectors, and generating a second loss value corresponding to the model training of the current round in the theme consistency dimension according to the theme consistency prediction model and the plurality of groups of feature vectors;

and when judging that the model training end condition is not met according to the first loss value and the second loss value, continuing to perform next round of model training on the feature extraction model.

2. The method of claim 1, wherein performing a current round of model training on the feature extraction model by using a plurality of groups of video segments to obtain a plurality of groups of feature vectors generated by the feature extraction model during the current round of model training comprises:

enhancing each video clip in each group of video clips;

and respectively inputting each video segment after the enhancement processing into the feature extraction model for feature extraction to obtain a feature vector of each video segment as a group of feature vectors.

3. The method of claim 1, wherein generating a first loss value corresponding to a current round of model training in a visual consistency dimension according to the plurality of sets of feature vectors comprises:

aiming at each group of feature vectors, obtaining a third loss value corresponding to the group of feature vectors according to the similarity between a first feature vector and a second feature vector corresponding to two video segments with visual consistency in the group of feature vectors and the similarity between the first feature vector and the second feature vector and other feature vectors different from the first feature vector and the second feature vector in each group of feature vectors;

and obtaining a first loss value corresponding to the current round of model training in the visual consistency dimension according to a plurality of third loss values corresponding to the plurality of groups of feature vectors.

4. The method according to claim 3, wherein for each set of eigenvectors, obtaining a third loss value corresponding to each set of eigenvectors according to a similarity between a first eigenvector and a second eigenvector corresponding to two video segments with visual consistency in the set of eigenvectors and a similarity between the first eigenvector and the second eigenvector and other eigenvectors different from the first eigenvector and the second eigenvector in each set of eigenvectors, comprises:

calculating a fourth loss value according to the similarity between the first feature vector and the second feature vector and the similarity between the first feature vector and other feature vectors except the first feature vector in each group of feature vectors;

calculating a fifth loss value according to the similarity between the first feature vector and the second feature vector and the similarity between the second feature vector and other feature vectors except the second feature vector in each group of feature vectors;

and determining a third loss value corresponding to the group of eigenvectors according to the fourth loss value and the fifth loss value.

5. The method of claim 1, wherein generating second loss values corresponding to the current round of model training in the topic consistency dimension according to the topic consistency prediction model and the plurality of sets of feature vectors comprises:

Carrying out vector splicing on any two eigenvectors in the plurality of groups of eigenvectors to obtain a plurality of spliced vectors;

inputting each splicing vector into a theme consistency prediction model so as to carry out theme consistency prediction on two video clips corresponding to each splicing vector;

and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the theme consistency prediction result and the classification label corresponding to each of the plurality of splicing vectors.

6. The method of claim 5, wherein determining a second loss value corresponding to the current round of model training in the topic consistency dimension according to the topic consistency prediction result and the classification label corresponding to each of the plurality of stitching vectors comprises:

for each splicing vector, if the splicing vector has a first classification label which indicates that the topics of the two video segments corresponding to the splicing vector are consistent, taking the topic consistency prediction result of the splicing vector as an input parameter of a target loss function, and calculating a sixth loss value corresponding to the splicing vector; if the video segment number of the video segment is not the same as the number of the video segments corresponding to the splicing vector, calculating a second classification label indicating that the topics of the two video segments corresponding to the splicing vector are not the same;

And determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to a plurality of sixth loss values corresponding to the plurality of splicing vectors.

7. The method of claim 6, wherein determining a second loss value corresponding to the current round of model training in the topic consistency dimension according to a plurality of sixth loss values corresponding to a plurality of stitching vectors comprises:

averaging the sixth loss values with the first classification label to obtain a first average value, and averaging the sixth loss values with the second classification label to obtain a second average value;

and determining a second loss value corresponding to the training of the model in the current round on the theme consistency dimension according to the first average value and the second average value.

8. The method of claim 6, wherein the objective loss function is a Focal loss function for solving a data imbalance problem.

9. The method of claim 6, further comprising, prior to inputting each stitching vector into the topic consistency prediction model:

and inputting each splicing vector into a feature space conversion network so as to perform dimension reduction processing on the feature dimension of the splicing vector.

10. The method according to any one of claims 1 to 9, wherein obtaining a set of video segments corresponding to each of the plurality of sample videos comprises:

and for each sample video, sampling two video clips with visual consistency from the sample video according to the sampling time interval used by the current round of model training, randomly sampling other video clips from the sample video, and taking the two video clips with visual consistency and the other video clips as a group of video clips.

11. The method of claim 10, further comprising:

and determining the sampling time interval used by the model training according to the training progress and the maximum time interval of the model training of the round.

12. A video processing method, comprising:

acquiring a target video to be identified;

performing feature extraction on the target video by using a feature extraction model to obtain a feature vector of the target video;

identifying the target video according to the characteristic vector of the target video to obtain an identification result;

wherein the feature extraction model is a model trained according to the model training method of any one of claims 1 to 11.

13. The method of claim 12, further comprising:

and triggering the augmented reality AR scene and/or the virtual reality VR scene to execute corresponding operation according to the identification result.

14. A computer device, comprising: a memory and a processor; the memory for storing a computer program; the processor is coupled to the memory for executing the computer program for performing the steps of the method of any of claims 1-13.

15. A computer storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to carry out the steps of the method of any one of claims 1 to 13.