CN118015708A

CN118015708A - Diving movement quality assessment method, device and equipment based on judge score learning

Info

Publication number: CN118015708A
Application number: CN202410411050.2A
Authority: CN
Inventors: 张洪博; 丘鸿铭; 雷庆; 徐威腾
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2024-04-08
Filing date: 2024-04-08
Publication date: 2024-05-10
Anticipated expiration: 2044-04-08
Also published as: CN118015708B

Abstract

A diving movement quality assessment method, device and equipment based on referee score learning relates to the technical field of movement quality assessment. The method comprises the steps of S1, obtaining a video to be evaluated. S2, sampling is carried out according to the video to be evaluated, and a video frame sequence to be evaluated is obtained. S3, selecting a comparison video frame sequence from the training set. S4, respectively encoding through a spatial feature encoder ViT to obtain two image feature sequences. S5, respectively encoding through a time sequence feature encoder TE to obtain a video level feature sequence to be evaluated and a comparison video level feature sequence. S6, inputting the two feature sequences into a referee score learning evaluation network to obtain an action quality score. The judge score learning evaluation network learns to generate judge score features with differences by using a cross feature fusion network based on a cross attention mechanism and a comparison action feature decoder based on a trans former decoder, and then adopts a score prediction network to predict action quality scores according to the judge score features.

Description

Diving movement quality assessment method, device and equipment based on judge score learning

Technical Field

The invention relates to the technical field of action quality assessment, in particular to a diving movement quality assessment method, device and equipment based on judge score learning.

Background

Diving has a long history and is deeply favored by people. Like many action quality assessment tasks, diving exercises rely on decisions by human referees, making scoring more subjective. Therefore, the motion quality evaluation technique of diving sports is deeply focused by academia and industry.

Action quality assessment as an extended field of action recognition, action quality assessment research aims at not only action recognition, but also quality assessment results and feedback of action execution, and has the difficulty that people must acquire very significant fine difference information from a continuous video sequence, and the information jointly determines the final assessment level. Although there are many studies and attempts in the field of motion quality assessment regarding the task, there are still many unresolved difficulties in the task of motion quality assessment, so that the research in this field is of great significance.

The existing action quality assessment methods are mainly divided into three types: the first is to consider motion quality assessment as a quality score regression problem, directly predicting motion quality scores, and typically extracting key features in video using backbone networks such as C3D, P D and I3D. The second category of methods combines auxiliary tasks such as class classification and object detection to enhance the evaluation effect. The third class of methods employs a pairwise contrast learning strategy to improve the accuracy of the assessment by learning the relative scores between different video samples.

However, despite some progress in motion quality assessment, there are some drawbacks to these methods. For example: the existing methods tend to ignore the importance of small differences in video, which are critical to assessing diving action quality.

In addition, the shortcomings of the existing action quality assessment methods are mainly manifested in the following aspects:

limitations of feature extraction: traditional 3D CNN methods extract features by sampling video segments, which may ignore the complete consistency of motion, thereby affecting the accuracy of the assessment.

The lack of feature fusion: the existing methods often mine differences between feature sequences through feature stitching or feature fusion based on convolutional neural networks, but the methods cannot analyze association relations of the feature sequences well.

Inadaptation of scoring rules: the existing method often fails to fully consider specific rules of referee scoring, such as refinement of quality scores and prediction problems of relative scores, when processing diving sport scoring rules, which may cause the assessment result to be inconsistent with the actual scoring standard.

In view of the above, the applicant has studied the prior art and has made the present application.

Disclosure of Invention

The invention provides a diving movement quality assessment method, device and equipment based on referee score learning, so as to improve at least one of the technical problems.

In a first aspect, an embodiment of the present invention provides a method for evaluating quality of diving sports based on referee score learning, which includes steps S1 to S6.

S1, acquiring a video to be evaluated of diving movement.

S2, sampling is carried out according to the original video frame sequence of the video to be evaluated, and the video frame sequence to be evaluated is obtained.

S3, selecting a comparison video frame sequence from the training set.

S4, encoding the video frame sequence to be evaluated and the contrast video frame sequence through a spatial feature encoder ViT respectively to extract spatial features and obtain an image feature sequence to be evaluated and a contrast image feature sequence.

S5, encoding the image feature sequence to be evaluated and the contrast image feature sequence through a time sequence feature encoder TE respectively to extract time sequence features, and obtaining a video level feature sequence to be evaluated and a contrast video level feature sequence.

S6, inputting the video-level feature sequence to be evaluated and the comparison video-level feature sequence into a preselected and trained judge score learning evaluation network, and obtaining action quality scores of the video to be evaluated. The judgment score learning and evaluating network learns and generates judgment score features with differences by using a cross feature fusion network based on a cross attention mechanism and a comparison action feature decoder based on a transducer decoder, and then adopts a score prediction network to predict action quality scores according to the judgment score features.

In an alternative embodiment, step S2 is specifically: and sampling by adopting a fixed-interval sampling strategy according to the original video frame sequence of the video to be evaluated, and obtaining the video frame sequence to be evaluated. Wherein the video frame sequence to be evaluated comprises a complete motion sequence.

In an alternative embodiment, step S2 during model training is specifically: and sampling the original frame sequence by adopting fixed-interval sampling and off-center sampling strategies respectively according to the input video frame sequence to generate new samples so as to expand the training set.

Preferably, the fixed interval sampling strategy is: for containingAn original video frame sequence of frames, sampled from it/>Frame, sampling step size is/>Obtain/>New samples.

Preferably, the off-center sampling strategy is: dividing the complete sequence equally into three parts, wherein front and rear samples are takenFrame, middle sample/>The frame, on which each part is sampled in odd and even frames, can obtain 2 new samples.

In an alternative embodiment, step S5 specifically includes steps S51 to S53.

S51, splicing class marks at the heads of the image feature sequence to be evaluated and the contrast image feature sequence respectively to represent video-level features.

S52, embedding position codes in the image feature sequences to be evaluated and the contrast image feature sequences of the splicing class marks respectively so as to keep the relative position information of the feature sequences.

S53, a time sequence feature encoder TE respectively encodes the image feature sequence to be evaluated and the contrast image feature sequence embedded with the position codes so as to extract time sequence features and obtain a video level feature sequence to be evaluated and a contrast video level feature sequence. Wherein the timing characteristic encoder TE is composed of stacked Transformer blocks.

In an alternative embodiment, step S6 specifically includes steps S61 through S63.

S61, performing cross feature fusion according to the video-level feature sequence to be evaluated and the comparison video-level feature sequence after being encoded by the time sequence feature encoder TE, and obtaining relative representation of video features. The cross feature fusion takes a comparison video level feature sequence after time sequence coding as a Query in cross attention operation, and takes a video level feature sequence to be evaluated after time sequence coding as Key and Value in cross attention operation, and the Value is iteratively updated to capture the difference between a target video feature sequence and a comparison video feature sequence. And then splicing the cross-compared video-level features to be evaluated and the compared video-level features to obtain the relative representation of the video features.

S62, decoding by a comparison action analysis decoder according to the video feature relative representation to obtain the referee score feature. Wherein the contrast action resolution decoder is comprised of stacked Transformer blocks. The video features obtained through cross feature fusion relatively represent Key and Value which are used as multi-head attention mechanisms in a transducer block, and a plurality of learnable Query are set as judge score features. The input of each layer of the transducer block is Query output by the previous layer, the Query is updated iteratively, and the output of the last layer is the generated referee score characteristic.

S63, according to the judge score characteristics, predicting by a score prediction network through a score prediction method from coarse to fine, and obtaining action quality scores of the video to be evaluated.

In an alternative embodiment, step S63 specifically includes steps S631 to S634.

S631, predicting a plurality of relative referee score intervals through a classification network according to the referee score characteristics; and predicting the deviation of the scores in the interval through the regression network, and obtaining a plurality of relative referee scores through calculation.

S632, obtaining a plurality of real referee scores of the comparison video frame sequence.

S633, adding the multiple relative referee scores and the multiple real referee scores respectively to obtain multiple predicted referee scores.

S634, obtaining the action quality score according to the plurality of predicted referee scores; and calculating the final score of the action quality according to the actual scoring rule to obtain the predicted referee score.

In an alternative embodiment, the score prediction network is comprised of a classifier for predicting the relative referee score interval and a regressive for determining the final predicted relative referee score. The classifier comprises 4 full-connection layers, and the node numbers of the four full-connection layers are 1536, 512, 128 and 11 respectively. The regressor comprises 4 fully connected layers with nodes 1536, 512, 128 and 1 respectively.

The prediction process is defined as:

In the method, in the process of the invention, Representing prediction score intervals,/>Representation classifier,/>For/>Individual referee score feature,/>For predicted magnitude shift,/>Representation regressor,/>For/>Individual prediction relative referee score,/>Representing the left end point of the prediction score interval,/>Representing the right endpoint of the prediction score interval.

In a second aspect, an embodiment of the present invention provides a diving movement quality assessment device based on referee score learning, which includes:

and the initial video acquisition module is used for acquiring the video to be evaluated of the diving movement.

And the sampling module is used for sampling according to the original video frame sequence of the video to be evaluated to obtain the video frame sequence to be evaluated.

And the contrast video acquisition module is used for selecting a contrast video frame sequence from the training set.

And the spatial coding module is used for respectively coding the video frame sequence to be evaluated and the contrast video frame sequence through a spatial feature coder ViT so as to extract spatial features and obtain an image feature sequence to be evaluated and a contrast image feature sequence.

And the time coding module is used for respectively coding the image characteristic sequence to be evaluated and the contrast image characteristic sequence through a time sequence characteristic coder TE so as to extract time sequence characteristics and obtain a video level characteristic sequence to be evaluated and a contrast video level characteristic sequence.

And the scoring module is used for inputting the video-level feature sequence to be evaluated and the comparison video-level feature sequence into a preselected and trained judge score learning evaluation network to obtain the action quality score of the video to be evaluated. The judgment score learning and evaluating network learns and generates judgment score features with differences by using a cross feature fusion network based on a cross attention mechanism and a comparison action feature decoder based on a transducer decoder, and then adopts a score prediction network to predict action quality scores according to the judgment score features.

In a third aspect, an embodiment of the present invention provides a diving sport quality assessment device based on referee score learning, which is characterized by comprising a processor, a memory, and a computer program stored in the memory. The computer program is executable by the processor to implement a diving sport quality assessment method based on referee score learning as described in any one of the paragraphs of the first aspect.

By adopting the technical scheme, the invention can obtain the following technical effects:

the diving movement quality assessment method based on referee score learning provided by the embodiment of the invention has the following advantages:

1. The method has the advantages that the video sequence is decoupled into a space stream and a time stream, the space-time characteristic learning method based on the transducer encoder is used for learning the video characteristics of the complete motion sequence, and the method is different from the traditional method for sampling the complete segment into the video segment and accords with the specific rule of the motion quality evaluation task.

2. The traditional relative score prediction problem is converted into the relative referee score prediction problem, and the prediction target is refined, so that the problems of low classification accuracy and high classification error cost of the traditional method are solved. The invention also provides a judge score learning method based on the transform decoder to represent judge scores with differences in evaluation under the real scene, and the end-to-end evaluation effect is realized, so that the interpretation and rationality of the evaluation model are improved.

3. In the test stage and the model application, the original frame sequence is sampled by a fixed interval sampling strategy to be used as the model input, which is different from the traditional method that the whole video frame sequence is used, so that the model reasoning speed is improved to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a diving movement quality assessment method based on referee score learning.

Fig. 2 is a process and scoring graph of diving sports.

Fig. 3 is a network structure of a diving movement quality assessment method based on referee score learning.

FIG. 4 is a graph comparing relative referee score predictions with relative score predictions.

Fig. 5 is a schematic diagram of fixed-interval frame sampling.

Fig. 6 is a schematic diagram of off-center frame sampling.

Fig. 7 is a network structure diagram of a space-time feature encoder.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 to 7, a first embodiment of the present invention provides a diving movement quality assessment method based on referee score learning, which can be executed by a diving movement quality assessment device (hereinafter referred to as "assessment device") based on referee score learning. In particular, the steps S1 to S6 are implemented by one or more processors in the evaluation device.

It is understood that the evaluation device may be an electronic device with computing capabilities, such as a portable notebook computer, a desktop computer, a server, a smart phone, or a tablet computer.

The invention uses real diving match video data to evaluate the action quality, and the proposed method realizes the action quality evaluation flow based on pairwise comparison learning: and (3) characteristic representation, constructing a regression network model, and predicting action quality scores. The method comprises five steps: preprocessing video data, extracting features, constructing an action quality assessment model based on referee score learning, training the model and predicting the action quality score. The flow chart is shown in fig. 3, and each step is described in detail below:

s1, acquiring a video to be evaluated of diving movement.

S2, sampling is carried out according to the original video frame sequence of the video to be evaluated, and the video frame sequence to be evaluated is obtained. Preferably, step S2 specifically includes: and sampling by adopting a fixed-interval sampling strategy according to the original video frame sequence of the video to be evaluated, and obtaining the video frame sequence to be evaluated. Wherein the video frame sequence to be evaluated comprises a complete motion sequence.

Based on the above embodiment, step S2 during model training in an alternative embodiment of the present invention is specifically: and sampling the original frame sequence by adopting fixed-interval sampling and off-center sampling strategies respectively according to the input video frame sequence to generate new samples so as to expand the training set.

Preferably, in the present embodiment, the given video frame sequence is set to beOn the premise of ensuring the integrity of video actions, the invention samples the original video frames in a mode of sampling at fixed intervals and off-center sampling in the model training process so as to expand sample diversity and remove the influence of redundant frames on the model to a certain extent.

As shown in fig. 5, the fixed interval sampling strategy is: for containingAn original video frame sequence of frames, sampled from it/>Frame, sampling step size is/>Obtain/>New samples. In FIG. 5/>Set to 96 frames,/>Set to 32 frames.

The middle part may contain more and important action frames considering a complete sequence of diving actions. Therefore, the embodiment of the invention also adopts an off-center sampling strategy to sample the video frames. As shown in figure 6 of the drawings,And/>The arrangement of (2) is the same as that of figure 5. The off-center sampling strategy is: the complete sequence was aliquoted into three parts, with front and rear samples/>Frame, middle sample/>The frame, on which each part is sampled in odd and even frames, can obtain 2 new samples.

S3, selecting a comparison video frame sequence from the training set.

Specifically, the training set contains a plurality of comparison videos. The motion actions in the at least partially compared video are the same as the motion actions in the video to be evaluated. And randomly selecting one video from the comparison videos with the same motion actions, thereby obtaining a comparison video frame sequence.

Specifically, given a pair of input videosSpatial feature extraction is performed by a pre-trained spatial feature encoder ViT model video frame sequence. Unlike conventional motion quality assessment methods, which take sampled or segmented video segments as inputs, embodiments of the present invention take a sequence of sampled video frames comprising a complete sequence of motion as inputs to a motion quality assessment network to obtain a more reasonable, more robust representation of video features.

S5, encoding the image feature sequence to be evaluated and the contrast image feature sequence through a time sequence feature encoder TE respectively to extract time sequence features, and obtaining a video level feature sequence to be evaluated and a contrast video level feature sequence. Preferably, step S5 specifically includes steps S51 to S53.

Specifically, the image feature sequence is to be embedded with position codes before being input into the temporal feature encoder TETo retain the relative position information of the feature sequences. Category label/>Spliced to the header of the sequence of image features to represent video level features.

After the image space feature extraction, the time information of the feature sequence is input to a time sequence feature encoder TE, and the structure of the time-space feature encoder is shown in fig. 7. The timing feature encoder consists of stacked Transformer blocks,

The video feature extraction process may be represented as follows:

。

Wherein, And/>Respectively representing input video pairs/>Image feature sequence of/>Representing image feature extraction models ViT,/>And/>Representing the resulting video level features,/>, respectivelyRepresenting the timing signature encoder.

S6, inputting the video-level feature sequence to be evaluated and the comparison video-level feature sequence into a preselected and trained judge score learning evaluation network, and obtaining action quality scores of the video to be evaluated. The judge score learning evaluation network learns and generates judge score features with difference by using a cross feature fusion network based on a cross attention mechanism and a comparison action feature decoder based on a transducer decoder, and then predicts action quality scores according to the judge score features by using a score prediction network. Preferably, step S6 specifically includes steps S61 to S63.

Specifically, the invention provides a cross feature fusion method based on a transducer. The fusion method is based on a cross attention mechanism, and video pair features obtained through time sequence coding are respectively used as a contrast video feature and a target video feature. The contrast video features are used as Query, the target video features are used as Key and Value, and the Value is iteratively updated to capture action differences between the target video and the contrast video. And finally, splicing the target video features subjected to cross comparison with the comparison video features to obtain the relative representation of the video features.

Specifically, in a real scene, the execution score of the diving player is composed of a plurality of referee scores excluding the highest score and the lowest score. Different referees also evaluate the same action performance, with slight differences in referee scores in most cases. Accordingly, the present invention proposes a referee score learning method based on a transform decoder to learn the characteristic representation of different referee scores. In the module, a plurality of learnable Query are firstly set, and video features obtained through cross feature fusion are relatively expressed as Key and Value in cross attention calculation. The module is also composed of stacked transformers, the input of each layer is Query output by the previous layer, the Query is updated iteratively, and the output of the last layer is the generated referee score feature.

Preferably, the embodiment of the invention adopts a coarse-to-fine score prediction method, the score prediction network consists of a classifier and a regressive, the classifier is used for predicting a relative referee score interval, and the regressive is used for determining a final predicted relative referee score.

The classifier and regressor use similar structures, each comprising 4 fully connected layers. The number of nodes for the four fully connected layers of the classifier is 1536, 512, 128 and 11, respectively. The number of nodes for the four fully connected layers of the regressor is 1536, 512, 128 and 1, respectively. It should be noted that the last layer can be automatically adjusted according to the data size.

The prediction process is defined as:

。

In the method, in the process of the invention, Representing prediction score intervals,/>For solving the function of parameter (set) of the function,/>Representing predicted class probability,/>Representation classifier,/>For/>Individual referee score feature,/>For predicted magnitude shift,/>Representation regressor,/>For/>Individual prediction relative referee score,/>Representing the left end point of the prediction score interval,/>Representing the right endpoint of the prediction score interval.

On the basis of the above embodiment, in an alternative embodiment of the present invention, step S63 specifically includes steps S631 to S634.

Specifically, in a real scene, the quality scores of diving athletes are commonly assessed by a plurality of referees. As shown in FIG. 2, the invention introduces referee scores, refines quality scores to referee scores, defines the difference of video versus referee scores as relative referee scores, and converts traditional relative quality score prediction problems into relative referee score prediction problems.

In the prediction phase, the invention redefines the action quality assessment problem as follows:

。

In the method, in the process of the invention, Representing predicted relative referee scores,/>For predictive model,/>For a sequence of video frames to be evaluated,For comparison of video frame sequences,/>Representing a sequence of video frames to be evaluated/>(1 /)Individual referee predictive score,/>Is the firstScore of relative referee,/>To compare video frame sequences/>(1 /)Score of true referee,/>For action quality score, it consists of a number of predicted referee scores,/>Is the judge number.

According to the invention, before model training, the relative judge scores of all possible training sample pairs are firstly counted according to data distribution, and the relative judge score (-10 to 10) is divided into a plurality of score intervals by taking the relative balance of the sample number in each score interval as a division basis.

Based on the above motion quality assessment model and model training objectives, the loss function may be defined as follows:

。

In the method, in the process of the invention, For category number,/>True category probability/>Is a numerical offset tag. /(I)In order to classify the loss of the device,For regression loss,/>Total loss for training.

According to the embodiment of the invention, the trained diving movement quality assessment model based on referee score learning for diving movement score prediction is subjected to quality score regression, so that action quality scores are obtained. Aiming at the defects of the existing action quality assessment method, the invention designs a more reasonable and accurate action quality assessment model based on a deep learning method and combining with a special rule under a diving sports scene so as to solve the following problems:

First, in the conventional sports quality assessment method, it is first to extract video features, and most methods use 3D CNN as a feature extraction backbone network to extract video features with timing information. Generally, limited by the specificity of the 3D CNN feature learning method inputs, these methods first require sampling fixed length video clips from the original video as inputs to the 3D CNN, and then training a regression network to predict the athlete's final score. Because the video clip sampling strategy splits the complete motion sequence into multiple motion clips, features extracted from shorter clips ignore complete and consistent motion timing information.

Therefore, the invention uses a space-time feature encoder-based feature extraction method to decouple the input video into a spatial stream and a temporal stream, firstly, the space-time feature encoder is used for extracting the image feature of each video frame in the complete video, and then the input time sequence feature encoder is used for extracting the time sequence information in the spatial feature sequence so as to obtain higher-level and more stable video features.

Second, in the field of motion quality assessment, quality assessment methods based on pairwise contrast learning exhibit competitive performance. Pairwise contrast learning aims at capturing action differences between random video pairs, but most of current quality assessment methods based on pairwise contrast learning are used for mining differences between feature sequences through feature stitching or feature fusion modes based on convolutional neural networks. The transducer can analyze the association relation of the feature sequences well through an attention mechanism, but due to the characteristics of the action quality assessment task, the application of the transducer in the action quality assessment task is still limited.

Therefore, the invention provides a cross feature fusion network based on a transducer, wherein the contrast video features are used as Query, the target video features are used as Key and Value in cross attention calculation, and the Value is iteratively updated to obtain the enhancement features with difference information.

Thirdly, the invention refines the quality score to the judge score level on the basis of considering the diving sport scoring rule, and converts the traditional relative score prediction problem into the relative judge score problem. And a transform decoder based contrast motion feature decoder is designed to learn to generate referee score features with variability. On the basis, the problems of low classification precision and high classification error cost of the traditional related method are solved through the fraction prediction from coarse to fine.

It should be noted that the embodiment of the invention provides a diving movement quality assessment (LEARNING REFEREE Evaluation for Action Quality ASSESSMENT IN DIVING Sport, abbreviated as LRE-AQA) method based on referee score learning. The most similar approach in the prior art to the diving sport quality assessment method of the present invention is the Group-aware Contrastive Regression (CoRe) method.

The diving movement quality assessment method and the CoRe method of the embodiment of the invention both introduce a pairwise contrast learning strategy. The CoRe method takes two comparison videos as input, extracts video features by using an I3D feature extraction network, designs a group perception regression tree to capture the difference of actions between the two videos, converts an action quality score regression task into an action quality score classification and regression task through score interval division, and finally evaluates to obtain the relative scores of the two videos.

The diving movement quality assessment method of the embodiment of the invention is different from the CoRe method in the following steps:

1. The invention decouples the input video into a spatial stream and a temporal stream and sequentially obtains video level features through a spatial feature encoder and a temporal feature encoder. The CoRe method employs an I3D video feature extraction network that uses a sliding window to sample the complete video into multiple video segments to extract segment-level video features. The task of motion quality assessment is different from the task of motion recognition, which requires the assessment of a complete sequence of motions, and the fragment sampling strategy ignores this particular rule.

2. The CoRe method improves the evaluation accuracy by defining the relative fractional regression problem as the relative fractional classification and regression problem, but has the problems of low classification accuracy and high classification error cost. The invention refines the relative score on the strategy, introduces the referee score, defines the difference of the referee scores of two athletes as the relative referee score, and further converts the relative score prediction problem into the relative referee score prediction problem, and the difference is shown in figure 2.

3. The CoRe method is based on convolutional neural network design group awareness regression trees to capture contrast video motion differences. The invention designs the association relation between the cross feature fusion network in the feature space based on the Transformer to deeply mine the video action difference.

The diving movement quality assessment method based on referee score learning has the following advantages.

1. The method has the advantages that the video sequence is decoupled into a space stream and a time stream, the space-time characteristic learning method based on the transducer encoder is used for learning the video characteristics of the complete motion sequence, and the method is different from the traditional method for sampling the complete segment into the video segment and accords with the specific rule of the motion quality evaluation task. From the visual results, it can be confirmed that the model can extract accurate and rich video features.

The second embodiment of the invention provides a diving movement quality assessment device based on referee score learning, which comprises an initial video acquisition module, a sampling module, a comparison video acquisition module, a space coding module, a time coding module and a scoring module.

In an alternative embodiment of the present invention based on the above embodiment, the sampling module is specifically configured to: and sampling by adopting a fixed-interval sampling strategy according to the original video frame sequence of the video to be evaluated, and obtaining the video frame sequence to be evaluated. Wherein the video frame sequence to be evaluated comprises a complete motion sequence.

In an alternative embodiment, the model training sampling module is specifically configured to: and sampling the original frame sequence by adopting fixed-interval sampling and off-center sampling strategies respectively according to the input video frame sequence to generate new samples so as to expand the training set.

In an alternative embodiment of the present invention, the time encoding module specifically includes a splicing unit, an embedding unit, and a timing feature extraction unit.

And the splicing unit is used for respectively splicing category marks at the heads of the image feature sequence to be evaluated and the contrast image feature sequence so as to represent video-level features.

And the embedding unit is used for respectively embedding position codes in the image feature sequences to be evaluated and the comparison image feature sequences of the splicing class marks so as to keep the relative position information of the feature sequences.

The time sequence feature extraction unit is used for respectively encoding the image feature sequence to be evaluated and the contrast image feature sequence embedded with the position codes by the encoder TE so as to extract time sequence features and obtain a video level feature sequence to be evaluated and a contrast video level feature sequence. Wherein the timing characteristic encoder TE is composed of stacked Transformer blocks.

In an alternative embodiment of the present invention, the scoring module specifically includes a contrast fusion unit, a transform decoding unit, and a score prediction unit.

And the comparison and fusion unit is used for carrying out cross feature fusion on the video-level feature sequence to be evaluated and the comparison video-level feature sequence after being encoded by the time sequence encoder TE, and obtaining the relative representation of the video features. The cross feature fusion takes a comparison video level feature sequence after time sequence coding as a Query in cross attention operation, and takes a video level feature sequence to be evaluated after time sequence coding as Key and Value in cross attention operation, and the Value is iteratively updated to capture the difference between a target video feature sequence and a comparison video feature sequence. And then splicing the cross-compared video-level features to be evaluated and the compared video-level features to obtain the relative representation of the video features.

And the transform decoding unit is used for obtaining judge score characteristics by decoding through the comparison action analysis decoder according to the video characteristic relative representation. Wherein the contrast action resolution decoder is comprised of stacked Transformer blocks. The video features obtained through cross feature fusion relatively represent Key and Value which are used as multi-head attention mechanisms in a transducer block, and a plurality of learnable Query are set as judge score features. The input of each layer of the transducer block is Query output by the previous layer, the Query is updated iteratively, and the output of the last layer is the generated referee score characteristic.

And the score prediction unit is used for predicting by adopting a score prediction method from coarse to fine through a score prediction network according to the judge score characteristics to obtain the action quality score of the video to be evaluated.

In an alternative embodiment of the present invention, the score prediction unit specifically includes a relative referee score prediction subunit, a true referee score acquisition unit, a predicted referee score calculation unit, and an action quality score acquisition unit.

A relative referee score predicting subunit, configured to predict a plurality of relative referee score intervals through a classification network according to the referee score feature; and predicting the deviation of the scores in the interval through the regression network, and obtaining a plurality of relative referee scores through calculation.

And the real referee score acquisition unit is used for acquiring a plurality of real referee scores of the comparison video frame sequence.

And the prediction referee score calculation unit is used for adding the multiple relative referee scores and the multiple real referee scores respectively to obtain multiple prediction referee scores.

An action quality score obtaining unit configured to obtain the action quality score according to the plurality of predicted referee scores; and calculating the final score of the action quality according to the actual scoring rule to obtain the predicted referee score.

The prediction process is defined as:

An embodiment III of the present invention provides a diving sport quality assessment device based on referee score learning, which is characterized by comprising a processor, a memory, and a computer program stored in the memory. The computer program is executable by the processor to implement a diving sport quality assessment method based on referee score learning as described in any of the embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

References to "first\second" in the embodiments are merely to distinguish similar objects and do not represent a particular ordering for the objects, it being understood that "first\second" may interchange a particular order or precedence where allowed. It is to be understood that the "first\second" distinguishing aspects may be interchanged where appropriate, such that the embodiments described herein may be implemented in sequences other than those illustrated or described herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A diving movement quality assessment method based on referee score learning, comprising:

acquiring a video to be evaluated of diving movement;

Sampling according to the original video frame sequence of the video to be evaluated to obtain the video frame sequence to be evaluated;

selecting a contrast video frame sequence from the training set;

Encoding the video frame sequence to be evaluated and the contrast video frame sequence through a spatial feature encoder ViT respectively to extract spatial features and obtain an image feature sequence to be evaluated and a contrast image feature sequence;

encoding the image feature sequence to be evaluated and the contrast image feature sequence through a time sequence feature encoder TE respectively so as to extract time sequence features and obtain a video level feature sequence to be evaluated and a contrast video level feature sequence;

Inputting the video-level feature sequence to be evaluated and the comparison video-level feature sequence into a preselected and trained judge score learning evaluation network to obtain action quality scores of the video to be evaluated; the judge score learning evaluation network learns and generates judge score features with difference by using a cross feature fusion network based on a cross attention mechanism and a comparison action feature decoder based on a transducer decoder, and then predicts action quality scores according to the judge score features by using a score prediction network.

2. The method for evaluating the quality of diving sports based on referee score learning according to claim 1, wherein the steps of sampling according to an original video frame sequence of the video to be evaluated to obtain the video frame sequence to be evaluated comprise: sampling by adopting a fixed-interval sampling strategy according to the original video frame sequence of the video to be evaluated to obtain the video frame sequence to be evaluated; wherein the video frame sequence to be evaluated comprises a complete action sequence;

model training is as follows: sampling an original frame sequence by adopting fixed-interval sampling and off-center sampling strategies according to an input video frame sequence to generate a new sample so as to expand a training set;

The fixed interval sampling strategy is: for containing An original video frame sequence of frames, sampled from it/>Frame, sampling step size is/>Obtain/>New samples;

the off-center sampling strategy is: dividing the complete sequence equally into three parts, wherein front and rear samples are taken Frame, middle sample/>The frame, on which each part is sampled in odd and even frames, can obtain 2 new samples.

3. The method for evaluating the quality of diving sports based on referee score learning according to claim 1, wherein the sequence of image features to be evaluated and the sequence of image features to be compared are respectively encoded by a temporal feature encoder TE to extract temporal features, and the video-level feature sequence to be evaluated and the sequence of video-level feature to be compared are obtained, specifically comprising:

Splicing class marks at the heads of the image feature sequence to be evaluated and the contrast image feature sequence respectively to represent video-level features;

Embedding position codes in the image feature sequences to be evaluated and the contrast image feature sequences of the splicing class marks respectively so as to keep the relative position information of the feature sequences;

Encoding the image feature sequence to be evaluated and the contrast image feature sequence embedded with the position codes through a time sequence feature encoder TE respectively so as to extract time sequence features and obtain a video level feature sequence to be evaluated and a contrast video level feature sequence; wherein the timing characteristic encoder TE is composed of stacked Transformer blocks.

4. The method for evaluating the quality of diving sports based on referee score learning according to claim 1, wherein the video-level feature sequence to be evaluated and the comparison video-level feature sequence are input into a preselected trained referee score learning evaluation network, and the action quality score of the video to be evaluated is obtained, specifically comprising:

Performing cross feature fusion according to the video level feature sequence to be evaluated and the comparison video level feature sequence after being encoded by the time sequence feature encoder TE, and obtaining relative representation of video features; the cross feature fusion takes a comparison video level feature sequence after time sequence coding as a Query in cross attention operation, takes a video level feature sequence to be evaluated after time sequence coding as Key and Value in cross attention operation, and iteratively updates the Value to capture the difference between a target video feature sequence and the comparison video feature sequence; then splicing the cross-compared video-level features to be evaluated and the compared video-level features to obtain relative representation of the video features;

decoding by a comparison action analysis decoder according to the relative representation of the video features to obtain judge score features; wherein the contrast action resolution decoder is composed of stacked Transformer blocks; relatively representing Key and Value which are used as multi-head attention mechanisms in a transducer block by using video features obtained through cross feature fusion, and setting a plurality of learnable Query as judge score features; the input of each layer of the transducer block is Query output by the previous layer, the Query is updated iteratively, and the output of the last layer is the generated judge score characteristic;

And predicting by a score prediction network according to the judge score characteristics by adopting a score prediction method from coarse to fine, and obtaining the action quality score of the video to be evaluated.

5. The method for evaluating the quality of diving sports based on referee score learning according to claim 4, wherein the method for predicting the quality of diving sports based on referee score learning by using a score prediction method from coarse to fine through a score prediction network comprises the following steps:

Predicting a plurality of relative referee score intervals through a classification network according to the referee score characteristics; predicting the deviation of scores in the interval through a regression network, and obtaining a plurality of relative judge scores through calculation;

obtaining a plurality of real referee scores of the comparison video frame sequence;

adding the multiple relative referee scores to the multiple real referee scores respectively to obtain multiple predicted referee scores;

Obtaining the action quality score according to the plurality of predicted referee scores; and calculating the final score of the action quality according to the actual scoring rule to obtain the predicted referee score.

6. The method according to claim 4, wherein the score prediction network comprises a classifier for predicting a relative referee score interval and a regressive for determining a final predicted relative referee score; the classifier comprises 4 full-connection layers, wherein the node numbers of the four full-connection layers are 1536, 512, 128 and 11 respectively; the regressor comprises 4 full-connection layers, and the node numbers of the four full-connection layers are 1536, 512, 128 and 1 respectively;

the prediction process is defined as:

；

7. A diving movement quality assessment device based on referee score learning, comprising:

the initial video acquisition module is used for acquiring videos to be evaluated of diving motions;

The sampling module is used for sampling according to the original video frame sequence of the video to be evaluated to obtain the video frame sequence to be evaluated;

the contrast video acquisition module is used for selecting a contrast video frame sequence from the training set;

The spatial coding module is used for respectively coding the video frame sequence to be evaluated and the contrast video frame sequence through a spatial feature coder ViT so as to extract spatial features and obtain an image feature sequence to be evaluated and a contrast image feature sequence;

The time coding module is used for respectively coding the image feature sequence to be evaluated and the contrast image feature sequence through a time sequence feature coder TE so as to extract time sequence features and obtain a video level feature sequence to be evaluated and a contrast video level feature sequence;

The scoring module is used for inputting the video-level feature sequence to be evaluated and the comparison video-level feature sequence into a preselected and trained judge score learning evaluation network to obtain action quality scores of the video to be evaluated; the judgment score learning and evaluating network learns and generates judgment score features with differences by using a cross feature fusion network based on a cross attention mechanism and a comparison action feature decoder based on a transducer decoder, and then adopts a score prediction network to predict action quality scores according to the judgment score features.

8. A diving sport quality assessment device based on referee score learning, characterized by comprising a processor, a memory, and a computer program stored in the memory; the computer program is executable by the processor to implement a diving sport quality assessment method based on referee score learning as claimed in any one of claims 1 to 6.