CN112153370B

CN112153370B - Video action quality evaluation method and system based on group sensitivity contrast regression

Info

Publication number: CN112153370B
Application number: CN202010857886.7A
Authority: CN
Inventors: 鲁继文; 周杰; 饶永铭; 于旭敏
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2021-12-24
Anticipated expiration: 2040-08-24
Also published as: CN112153370A

Abstract

The invention discloses a video motion quality evaluation method and system based on group-sensitive contrast regression, wherein the method comprises the following steps: selecting a corresponding example video and an example video score according to the current video; respectively extracting space-time characteristics of a current video and an example video by using a deep learning model, and constructing combined characteristics; and constructing a cluster sensitive regression tree network, regressing the combined characteristics to obtain a final difference score, and combining the final difference score with the example video score to obtain a current video score. According to the method, the final target action score is obtained by modeling the difference between the target action and the example action, and the action quality evaluation accuracy of the model is improved.

Description

Video action quality evaluation method and system based on group sensitivity contrast regression

Technical Field

The invention relates to the technical field of computer vision and deep learning, in particular to a video motion quality evaluation method and system based on group-sensitive contrast regression.

Background

Video quality of Action Assessment (AQA), which is intended to assess the performance of a particular Action, has received increasing attention in recent years because it plays a vital role in many real-world applications, including sports and healthcare. Unlike conventional motion analysis tasks such as motion detection and recognition, AQA is more challenging because it requires prediction of fine-grained scores from videos containing the same category of motion. Considering the difference between the difference of different videos themselves and their motion scores, we consider the key to solving this problem to find the difference between videos and predict scores from the difference.

In recent years, most are based on regression algorithms, where scores are predicted directly from a single video. Despite some promising results, AQA still faces two challenges: first, since score labels are typically annotated by human judges (e.g., scores for a diving game are calculated by aggregating scores from different judges), subjective assessment of judges is difficult to predict accurately for scores; second, the difference between videos for AQA tasks is very small, as actors typically perform the same actions in similar environments.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a video motion quality evaluation method based on group-sensitive contrast regression, which improves the motion quality evaluation accuracy of the model.

Another objective of the present invention is to provide a video motion quality evaluation system based on group-sensitive contrast regression.

In order to achieve the above object, an embodiment of the present invention provides a video motion quality evaluation method based on group-sensitive contrast regression, including the following steps: step S1, selecting a corresponding example video and an example video score according to the current video; step S2, performing space-time feature extraction on the current video and the example video respectively by using a deep learning model, and constructing a merging feature; step S3, a group-sensitive regression tree network is constructed, the merged features are regressed to obtain final difference scores, and the final difference scores are combined with the example video scores to obtain current video scores.

The video action quality evaluation method based on the group-sensitive comparison regression provided by the embodiment of the invention provides a learning method of the comparison regression, and the action quality evaluation problem is modeled into a regression problem of the score difference between the regression current video and the example video, so that the action quality evaluation accuracy of the model is improved; meanwhile, a population-sensitive regression tree structure is constructed, and the traditional fractional regression is converted into two simpler sub-problems: the interpretability and evaluation capability of the regressor are improved from coarse classification to fine classification and inter-cell regression.

In addition, the video motion quality evaluation method based on the group-sensitive contrast regression according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the step S2 includes: and respectively carrying out space-time information coding on the current video and the example video through the deep learning model, splicing the current video and the example video in a feature dimension, adding the example video score, and forming the merging feature together.

Further, in an embodiment of the present invention, each leaf node in the cluster-sensitive regression tree network represents a preset difference score interval, and samples in each interval are balanced.

Further, in an embodiment of the present invention, in the cluster-sensitive regression tree network, a cluster sensitivity analysis is performed on each leaf node to obtain a classification probability and a relative position in a group.

In order to achieve the above object, another embodiment of the present invention provides a video motion quality evaluation system based on group-sensitive contrast regression, including: a selection module for selecting a corresponding example video and an example video score according to a current video; the extraction module is used for respectively extracting the space-time characteristics of the current video and the example video by utilizing a deep learning model and constructing combined characteristics; and the regression and score combination module is used for constructing a group sensitive regression tree network, performing regression on the merged features to obtain a final difference score, and combining the final difference score with the example video score to obtain the current video score.

According to the video action quality evaluation system based on the group-sensitive comparison regression, a learning method of the comparison regression is provided, an action quality evaluation problem is modeled into a regression problem of the score difference between a regression current video and an example video, and the action quality evaluation accuracy of a model is improved; meanwhile, a population-sensitive regression tree structure is constructed, and the traditional fractional regression is converted into two simpler sub-problems: the interpretability and evaluation capability of the regressor are improved from coarse classification to fine classification and inter-cell regression.

In addition, the video motion quality evaluation system based on the group-sensitive contrast regression according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the extracting module is specifically configured to: and respectively carrying out space-time information coding on the current video and the example video through the deep learning model, splicing the current video and the example video in a feature dimension, adding the example video score, and forming the merging feature together.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a method for evaluating motion quality of a video based on cluster-sensitive contrast regression according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the detailed operation of a method for evaluating the motion quality of a video based on cluster-sensitive contrast regression according to an embodiment of the present invention;

FIG. 3 is a diagram of a group-sensitive regression tree structure, according to one embodiment of the present invention;

fig. 4 is a schematic structural diagram of a video motion quality evaluation system based on group-sensitive contrast regression according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Rather than learning directly a score for which prediction is unknown, the present invention re-models the AQA problem as returning difference scores with reference to other videos with the same attributes (like sports of a genre or videos with the same difficulty level). By introducing examples for score prediction, the regressor will reference the known scores given by human officials and encourage them to predict the current video score from subtle differences between the current video and the examples.

The following describes a video motion quality evaluation method and system based on group-sensitive contrast regression according to an embodiment of the present invention with reference to the drawings, and first, a video motion quality evaluation method based on group-sensitive contrast regression according to an embodiment of the present invention will be described with reference to the drawings.

Fig. 1 is a flowchart of a video motion quality evaluation method based on group-sensitive contrast regression according to an embodiment of the present invention.

As shown in fig. 1, the method for evaluating video motion quality based on cluster-sensitive contrast regression includes the following steps:

in step S1, the corresponding example video and example video score are selected according to the current video.

Specifically, a current input video is acquired, a corresponding example video and a score of the example video are selected for the current input video, and preparation is made for later calculation.

In step S2, spatio-temporal feature extraction is performed on the current video and the example video using the deep learning model, respectively, and merged features are constructed.

Further, in one embodiment of the present invention, step S2 includes:

and respectively carrying out space-time information coding on the current video and the example video through a deep learning model, splicing the current video and the example video in a feature dimension, and adding an example video score to form a merging feature together.

Specifically, as shown in fig. 2, to model contrast difference information between a current video and a target video, in the embodiment of the present invention, two segments of video are input into a pre-trained deep learning model (e.g., I3D) to perform spatio-temporal information coding, and f of spatio-temporal features of the current input video is extracted₁Extracting spatio-temporal features f of the example video₂(the current input video and the example video share the weight with each other in the extraction process), the current video and the example video are spliced in the characteristic dimension, and simultaneously,and adding the example video scores during splicing to finally obtain the merging characteristics.

It should be noted that, during training, for each current video, the present invention randomly selects one video from the eligible example videos for comparison regression. During testing, N sample videos meeting the conditions are randomly selected, comparison regression is carried out one by one, and finally N evaluation results are averaged to obtain the final prediction evaluation result.

In step S3, a group-sensitive regression tree network is constructed, the merged features are regressed to obtain a final difference score, and the final difference score is combined with the example video score to obtain a current video score.

Further, in one embodiment of the present invention, each leaf node in the cluster-sensitive regression tree network represents a predetermined variance score interval, and the samples in each interval are balanced.

Further, in an embodiment of the present invention, a group sensitivity analysis is performed on each leaf node in the group-sensitive regression tree network to obtain a classification probability and a relative position in the group.

Specifically, as shown in fig. 3, in order to meet the nature of contrast and improve the interpretability of the deep learning model, the embodiment of the present invention designs a regression tree network in the form of a binary tree, that is, a group-sensitive regression tree network, and inputs the merged features in step S2 into the group-sensitive regression tree network for regression, so as to obtain the difference score.

Further, as shown in fig. 2, after obtaining the difference score, first, the whole interval of the difference score is distributed, so that each leaf node of the regression tree represents a specific difference score interval, and the sample balance in each interval is ensured. For each node of the regression tree, a comparison of the difference score to the threshold of the node is made once, resulting in a two-class, i.e. "greater" or "less". The split probabilities of each layer are multiplied to obtain the probability of a final leaf node. The leaf node with the maximum probability is taken out, so that the difference score can be restricted to a specific subinterval from the whole subinterval. Finally, the score regression in the cells is carried out, and the final difference score can be obtained.

And finally, obtaining a video score difference through regression through the output of each leaf node of the regression tree, and finally obtaining an accurate current video score by combining the score of the example video.

The video action quality evaluation method based on the group-sensitive contrast regression provided by the embodiment of the invention is based on the contrast learning strategy (contrasting) in the metric learning literature, provides a learning method of the contrast regression, and models the action quality evaluation problem into a regression problem of the score difference between the regression current video and the example video, so that the action quality evaluation accuracy of the model is improved; meanwhile, a population-sensitive regression tree structure is constructed, and the traditional fractional regression is converted into two simpler sub-problems: the interpretability and evaluation capability of the regressor are improved from coarse classification to fine classification and inter-cell regression.

Next, a video motion quality evaluation system based on group-sensitive contrast regression according to an embodiment of the present invention will be described with reference to the drawings.

As shown in fig. 4, the system 10 includes: a selection module 100, an extraction module 200 and a regression and score combination module 300.

Wherein the selection module 100 is configured to select a corresponding example video and an example video score according to the current video. The extraction module 200 is configured to perform spatio-temporal feature extraction on the current video and the example video respectively by using a deep learning model, and construct a merged feature. The regression and score combination module 300 is configured to construct a crowd-sourced regression tree network, perform regression on the combined features to obtain a final difference score, and combine the final difference score with the example video score to obtain a current video score.

Further, in an embodiment of the present invention, the extraction module is specifically configured to: and respectively carrying out space-time information coding on the current video and the example video through a deep learning model, splicing the current video and the example video in a feature dimension, and adding an example video score to form a merging feature together.

According to the video action quality evaluation system based on the group-sensitive comparison regression, a learning method of the comparison regression is provided based on the comparison learning strategy in the metric learning literature, the action quality evaluation problem is modeled into a regression problem of the score difference between the regression current video and the example video, and the action quality evaluation accuracy of the model is improved; meanwhile, a population-sensitive regression tree structure is constructed, and the traditional fractional regression is converted into two simpler sub-problems: the interpretability and evaluation capability of the regressor are improved from coarse classification to fine classification and inter-cell regression.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A video motion quality evaluation method based on group-sensitive contrast regression is characterized by comprising the following steps:

step S1, selecting a corresponding example video and an example video score according to the current video;

step S2, performing spatio-temporal feature extraction on the current video and the example video respectively by using a deep learning model, and constructing a merged feature, which includes: respectively carrying out space-time information coding on the current video and the example video through the deep learning model, splicing the current video and the example video in a feature dimension, adding the example video score, and forming the merging feature together; and

step S3, a group-sensitive regression tree network is constructed, the merged features are regressed to obtain difference scores, after the difference scores are obtained, firstly, the whole range of the difference scores is distributed, each leaf node of the regression tree represents a specific difference score range, the sample balance in each range is ensured, for each node of the regression tree, comparing the difference score with the threshold value of the node once, the result is a two-class classification, the shunting probability of each layer is multiplied to obtain the final probability of the leaf node, the leaf node with the maximum probability is taken out, the difference score can be restricted to a specific subinterval from the whole interval, finally, the score regression in the subdistrict is carried out to obtain the final difference score, and combining the final difference score with the example video score to obtain a current video score.

2. The method of claim 1, wherein each leaf node in the cluster-sensitive regression tree network represents a predetermined difference score interval, and the samples in each interval are balanced.

3. The method of claim 2, wherein the cluster-sensitive regression tree network performs cluster sensitivity analysis on each leaf node to obtain a classification probability and a relative position in the group.

4. A video motion quality evaluation system based on group-sensitive contrast regression is characterized by comprising:

a selection module for selecting a corresponding example video and an example video score according to a current video;

an extraction module, configured to perform spatio-temporal feature extraction on the current video and the example video respectively by using a deep learning model, and construct a merged feature, where the extraction module includes: respectively carrying out space-time information coding on the current video and the example video through the deep learning model, splicing the current video and the example video in a feature dimension, adding the example video score, and forming the merging feature together; and

a regression and score combination module, configured to construct a cluster-sensitive regression tree network, perform regression on the merged features to obtain a difference score, after obtaining the difference score, first allocate the whole region of the difference score to make each leaf node of the regression tree represent a specific difference score region and ensure sample balance in each region, perform a comparison between the difference score and a threshold of the node once for each node of the regression tree, so that the result is a binary classification, multiply the split probability of each layer to obtain a final probability of the leaf node, extract the leaf node with the highest probability, constrain the difference score from the whole region to a specific subinterval, perform score regression in the subinterval to obtain a final difference score, and combine the final difference score with the example video score, and obtaining the current video score.

5. The system of claim 4, wherein each leaf node in the cluster-sensitive regression tree network represents a predetermined variance score interval, and the samples in each interval are balanced.

6. The system of claim 5, wherein the cluster-sensitive regression tree network performs cluster sensitivity analysis on each leaf node to obtain classification probability and relative position in the group.