CN114677765A

CN114677765A - Interactive video motion comprehensive identification and evaluation system and method

Info

Publication number: CN114677765A
Application number: CN202210448232.8A
Authority: CN
Inventors: 罗明宇; 易秋晨
Original assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Current assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-06-28

Abstract

The invention provides an interactive video motion comprehensive identification evaluation system and method; the invention expands the action description scope based on the human body skeleton modeling algorithm in the prior art, can model the actions generated by single person, interaction between person and object and interaction between multiple persons, provides a rich comprehensive evaluation index system, forms a universal video action recognition and evaluation method solution, can describe the difference between the action to be evaluated and the standard from multiple qualitative and quantitative layers, and can integrate the latest data, data processing technology and the most advanced algorithm model along with the technical development so as to realize the optimal action recognition and evaluation effect.

Description

Interactive video motion comprehensive identification and evaluation system and method

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to an interactive video motion comprehensive identification and evaluation system and method.

Background

The identification and evaluation of video actions are widely applied to industries such as security protection, sports and the like, generally, a static or moving camera is used for collecting a character movement video under a specific scene, the human body limb angle, the action category and the like are identified and analyzed based on an artificial intelligence algorithm, and the identification and analysis are compared with a set standard so as to judge whether abnormal conditions exist or not or evaluate the normative of the actions.

In the prior art, in terms of algorithms for video motion recognition and evaluation, some scholars propose a plurality of expert rules and artificial intelligence algorithms for some specific application scenes.

In the prior art, a video motion recognition method is to collect coordinates of joint points of a human body by using a Kinect sensor, calculate angles between the joint points by a series of artificially designed expert rules, and compare the calculated angles with a preset standard value to judge whether a sit-up motion is qualified.

The other video motion recognition mode is an artificial intelligence-based motion recognition method, and mainly utilizes a classic human body skeleton sequence feature classification model which comprises 3 sub-models respectively used for single-frame image human body skeleton feature extraction, time sequence feature coding and motion classification.

In another video motion recognition method, the characteristics of the target object and the characteristics of the background area are considered at the same time, the influence value of the background area is determined through a series of methods, and the influence value is fused with the characteristics of the target object and then motion classification is performed. Considering that the human body action is likely to be related to the background environment, the method widens the video action identification to more information beyond the human body range, and finally improves the identification precision of the algorithm by utilizing richer features.

However, the definition of motion in the above three prior art techniques is still one-sided. They mostly only take into account the actions reflected by the human skeleton itself and its timing variations (e.g. walking, running, sit-up) or use some ambiguous appearance (e.g. the above-mentioned background area impact values) for improving the recognition accuracy. The broader definition of actions includes not only actions reflected by the shape and changes of the human skeleton, but also actions reflected by local shape and changes (such as facial expression and visual direction) which are not related to the human skeleton, actions reflected by the common changes of the human body and external things (such as drinking water and raising flags, if the external objects such as water cups and flags are not considered, the meaning of the actions is uncertain), and actions reflected by multi-person actions (such as fighting a blow). The scenes are quite common in the fields of sports, security and the like, but the existing method cannot solve the problem. In addition, the existing action evaluation methods generally provide a similarity score or a similarity level based on expert rules or overall similarity calculation, are fuzzy, and cannot provide detail differences and correction directions (such as action delay).

The defects enable the existing method to be only elaborately designed for specific tasks, and the generalization performance is weak, so that the existing method cannot be used as a solution for a universal video motion recognition and evaluation method. In addition, with continuous accumulation of business data and continuous breakthrough of algorithm research, the model precision has the possibility of being improved, the action recognition requirement has the possibility of being expanded or changed, and the existing method cannot be well met.

The above is only for the purpose of assisting understanding of the technical solution of the present invention, and does not represent an admission that the above is the prior art.

Disclosure of Invention

In order to solve the above technical problems mentioned in the background art, the present invention provides an interactive video motion comprehensive identification and evaluation system and method, wherein the system comprises:

the system includes a data collection component 100, a data annotation component 200, an action recognition component 300, and an action evaluation component 400:

the data acquisition component 100 is used for acquiring original video data;

the action recognition component 300 is used for receiving a request of a user for adding an action recognition algorithm model component and adding the action recognition algorithm model component into an action recognition algorithm model library;

the action evaluation component 400 is used for receiving a request of a user for adding an action evaluation algorithm model component and adding the action evaluation algorithm model component into an action evaluation algorithm model library;

the action recognition component 300 is used for receiving action recognition categories and algorithm model combination configuration set by a user to form a video action comprehensive recognition method, entrusts the data labeling component 200 to start data labeling service of a corresponding task for an algorithm which needs to be trained by the action recognition component, and is used for carrying out data labeling on the basis of an interactive interface by the user to generate first labeling data;

the action evaluation component 400 receives action evaluation indexes and algorithm model combination configuration set by a user to form a video action comprehensive evaluation method, and for an algorithm to be trained by the action evaluation component 400, the data annotation component 200 is entrusted to start data annotation service of the corresponding task, so that the user can perform data annotation based on an interactive interface to generate second annotation data; the action evaluation component 400 uses the second labeled data to train an algorithm to be trained in the video action comprehensive evaluation method, and obtains and stores a corresponding evaluation model;

the motion recognition component 300 is configured to perform inference on the video data acquired by the data acquisition component in real time based on the video motion comprehensive recognition method, and output a video motion comprehensive feature recognition result;

the action evaluation component 400 executes reasoning on the video data collected by the data collection component in real time and the recognition result of the video action comprehensive characteristic output by the action recognition component based on the video action comprehensive evaluation method, and outputs a video action comprehensive evaluation result.

Accordingly, the video motion comprehensive feature recognition result output by the motion recognition component 300 at least includes the human body self-feature and the external object feature which has the relevant change with the human body self-feature.

In addition, in order to achieve the above object, the present invention further provides an interactive video motion comprehensive identification and evaluation method, including the following steps:

calling a data acquisition component to acquire original video data;

receiving a request of a user for adding a motion recognition algorithm model component by a motion recognition component, and adding the request into a motion recognition algorithm model library;

receiving a request of a user for adding a new action evaluation algorithm model component by an action evaluation component, and adding the request into an action evaluation algorithm model library;

receiving an action recognition category and algorithm model combination configuration set by a user by an action recognition component to form a video action comprehensive recognition method, entrusting a data annotation component to start a data annotation service of a corresponding task for an algorithm which needs to be trained by the action recognition component, and carrying out data annotation on the basis of an interactive interface by the user to generate first annotation data;

the action evaluation component receives action evaluation indexes and algorithm model combination configuration set by a user to form a video action comprehensive evaluation method, and for the algorithm which needs to be trained by the action evaluation component, the data annotation component is entrusted to start the data annotation service of the corresponding task, so that the user can perform data annotation based on an interactive interface to generate second annotation data; the action evaluation component uses the second labeling data to train an algorithm needing to be trained in the video action comprehensive evaluation method, and obtains and stores a corresponding evaluation model;

the action recognition component executes reasoning on the video data collected by the data collection component in real time based on the video action comprehensive recognition method and outputs a video action comprehensive characteristic recognition result;

the action evaluation component executes reasoning on the video data acquired by the data acquisition component in real time and the identification result of the video action comprehensive characteristic output by the action identification component based on the video action comprehensive evaluation method, and outputs a video action comprehensive evaluation result.

Correspondingly, the video motion comprehensive characteristic recognition result output by the motion recognition component at least comprises the human body self characteristic and the external object characteristic which has relevant change with the human body self characteristic.

Preferably, the step of the motion recognition component performing inference on the video data collected by the data collection component in real time based on the video motion comprehensive recognition method and outputting a recognition result of the video motion comprehensive characteristics includes:

the data acquisition component inputs acquired video data into the action recognition component, the action recognition component calls the video action comprehensive recognition method to perform reasoning to obtain a video frame pool and an action characteristic pool, and simultaneously outputs a recognition result;

correspondingly, the step of the action evaluation component executing reasoning on the video data collected by the data collection component in real time and the video action comprehensive characteristic representation output by the action recognition component based on the video action comprehensive evaluation method and outputting a video action comprehensive evaluation result comprises the following steps:

the action evaluation component analyzes the video frame pool and the action characteristic pool, carries out action evaluation algorithm reasoning based on the video action comprehensive evaluation method and combined with a preset standard action video, and outputs an evaluation result.

Preferably, the action recognition category and the algorithm model set by the user are configured in a combined manner, and the action evaluation category and the algorithm model set by the user are configured in a project configuration file in a combined manner; the project configuration file comprises meta-information configuration file paths of all algorithms or models which a user desires to run in the algorithm model library and runtime configuration file paths corresponding to the algorithms or models.

Preferably, the video motion comprehensive identification method is based on a single-frame image or a video time sequence in a video, and uses a plurality of feature detection algorithms and models for describing the human body and the motion definition and feature depiction formed by interaction of the human body and the environment;

wherein the action definitions and feature types include, but are not limited to:

image features coded by various coding modes in a single-frame or multi-frame sequence;

the coordinate of key points of a single person in a single frame and a coordinate sequence formed in multiple frames;

coordinates of key points of a plurality of human bodies in a single frame and a coordinate sequence formed in a plurality of frames;

the object attributes such as the category, the number and the color of the interested objects in a single frame, the position information such as the coordinate of a bounding box and the coordinate of a boundary point, and the sequence formed by the various characteristics in multiple frames;

a composite attribute defined by characteristics of a human body and various things in a single frame;

comprehensive attributes defined by characteristic sequences of human bodies and various things in multiple frames;

the characteristics and comprehensive properties of the human body and various things which are possibly appeared in the future are determined by the characteristic sequence of the human body and various things at present.

Preferably, the optimal algorithm adopted by the video motion comprehensive identification method can comprise a target detection algorithm, a human body key point detection algorithm, a target tracking algorithm, a skeleton modeling algorithm and a sequence classification algorithm, and combination nesting among various algorithms exists for describing motion definition and feature depiction formed by interaction of human body self features and external object features in a single frame and a video sequence.

Preferably, the action recognition algorithm model library and the action evaluation algorithm model library are a set of code files, model files and other related files of an action recognition algorithm, a model and an action evaluation algorithm and a model; each algorithm/model must include a meta-information configuration file, a runtime configuration file, and an inference script, and the trainable algorithm must also include a training start script.

Preferably, the meta-information configuration file of the algorithm and the model specifically includes: algorithm and model name; algorithm and model type; training starting script path and reasoning starting script path of the algorithm and the model; the algorithm model infers the input data type; the data type of the algorithm and the model inference output;

the data types of the algorithm and the model inference input/output comprise the following conditions: the real type of the object in the memory and the parameters for describing the attributes of the object, and the nesting structure between the real type and the parameters. The operation configuration file of the algorithm and the model configures information for all parameters related to or depended on the operation training process or the reasoning process of the algorithm and the model.

The invention has the beneficial effects that: the interactive video motion comprehensive identification and evaluation system and method expand the motion description scope based on the human body skeleton modeling algorithm in the prior art, can model the motion generated by single person, human-object interaction and multi-person interaction, provides rich comprehensive evaluation index systems, forms a universal video motion identification and evaluation method solution, can describe the difference between the motion to be evaluated and the standard from multiple qualitative and quantitative layers, and can integrate the latest data, data processing technology and the most advanced algorithm model along with the technical development so as to realize the optimal motion identification and evaluation effect.

Drawings

FIG. 1 is a schematic diagram of components of an interactive video motion integrated recognition and evaluation system according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a training process of a motion recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an inference flow of a motion recognition model according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a training flow of a motion estimation model according to an embodiment of the invention;

FIG. 5 is a schematic diagram of an inference flow of an action-assessment model according to an embodiment of the invention;

fig. 6 is a schematic diagram of a physical deployment of an interactive video motion integrated recognition evaluation system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Interactive video motion comprehensive identification evaluation system and method embodiment

First, an interactive video motion integrated recognition evaluation system proposed according to an embodiment of the present invention will be described with reference to the accompanying drawings. Fig. 1 is a diagram illustrating an interactive video motion comprehensive recognition and evaluation system according to an embodiment of the present invention.

As shown in fig. 1, the system 10 includes: a data collection component 100, a data annotation component 200, an action recognition component 300, and an action evaluation component 400.

The data acquisition assembly 100 is used for acquiring original video data; the raw video data collected by the data collection component 100 is used as an input of the data annotation component in a model training phase to generate annotation data, and is used for generating raw video data to be identified and evaluated in an inference phase.

it should be noted that the request of the newly added motion recognition algorithm model component has a potential association with the original video data acquired by the data acquisition component 100, and the request of the newly added motion evaluation algorithm model component has a potential association with the original video data acquired by the data acquisition component 100;

that is, the system needs to identify a person in the video data collected by the data collection component 100, the motion recognition component 300 needs an algorithm model capable of being used for detecting a person, but the embodiment does not verify the relationship between the newly added motion recognition algorithm model component request/motion evaluation algorithm model component request and the collected original video data, because the same algorithm model can be used for a plurality of different data types, and the same data type can also adopt different algorithm models; therefore, the newly added action recognition algorithm model and the newly added action evaluation algorithm model component in the embodiment can be used for a plurality of different data types; in the concrete implementation, taking the video motion identification of a railway signal semaphore as an example, the identification task needs to identify all human bodies in a real-time video, identify whether the motion made by the human bodies is the railway signal semaphore or not and classify the specific semaphore category; then the original video data is the semaphore video of the certain railway signal; the user needs to add a motion recognition algorithm model component and a motion evaluation algorithm model component request which are both in potential association with the semaphore video of the railway signal, namely, the added motion recognition and motion evaluation related algorithms are both in potential association with the motion of the railway signal; the action recognition algorithm model component and the action evaluation algorithm model component which are added by the user refer to a set of the code file and other related files of the algorithm model. The files are added into the respective algorithm model libraries by the action recognition algorithm model component or the action evaluation algorithm model component.

The action recognition component 300 is used for receiving action recognition categories and algorithm model combination configuration set by a user to form a video action comprehensive recognition method, entrusts the data labeling component 200 to start data labeling service of a corresponding task for an algorithm which needs to be trained by the action recognition component, and is used for carrying out data labeling on the basis of an interactive interface by the user to generate first labeling data; the motion recognition component uses the first labeled data to train an algorithm needing to be trained in the video motion comprehensive recognition method, and obtains and stores a corresponding recognition model;

the video motion comprehensive identification method of the embodiment adopts an optimal algorithm which can comprise a target detection algorithm, a human body key point detection algorithm, a target tracking algorithm, a skeleton modeling algorithm and a sequence classification algorithm and is used for describing motion definitions and feature portraits formed by interaction of human body self features and external object features in a single frame and a video sequence;

it can be understood that the user-defined action recognition category and algorithm model combination configuration, and the user-defined action evaluation category and algorithm model combination configuration refer to a project profile. The file contains meta-information profile paths of all algorithms/models in the algorithm model library that the user desires to run and runtime profile paths corresponding to the algorithms/models.

The action evaluation component 400 is used for receiving action evaluation indexes and algorithm model combination configuration set by a user to form a video action comprehensive evaluation method, entrusts the data labeling component 200 to start data labeling service of the corresponding task for the algorithm to be trained by the action evaluation component 400, and is used for carrying out data labeling on the basis of an interactive interface by the user to generate second labeling data; the action evaluation component 400 uses the second labeled data to train an algorithm to be trained in the video action comprehensive evaluation method, and obtains and stores a corresponding evaluation model;

the motion recognition component 300 is configured to perform inference on the video data acquired by the data acquisition component in real time based on the video motion comprehensive recognition method, and output a video motion comprehensive feature recognition result; the video motion comprehensive feature recognition result output by the motion recognition component 300 at least comprises the human body self-feature and the external object feature which has related change with the human body self-feature;

in a specific implementation, the original video data is the semaphore video of the certain railway signal, a staff waving a flag in the video belongs to the characteristics of the human body, and the flag waved by the staff belongs to the characteristics of external objects;

it should be noted that, the video motion comprehensive identification method of the embodiment uses multiple feature detection algorithms and models based on a single frame image or a video time sequence in a video to describe the human body itself and the motion definition and feature depiction formed by interaction with the environment. The action definitions and feature types include, but are not limited to:

the key point coordinates of a single person in a single frame and a coordinate sequence formed in multiple frames;

the characteristics and comprehensive properties of the human body and various things which are possibly appeared in the future are determined by the characteristic sequence of the current human body and various things;

In this embodiment, the video motion comprehensive identification method includes an artificial feature extraction algorithm directly used for images and optical flows, and also includes a deep learning algorithm based on image classification, target detection, keypoint detection, a skeleton modeling algorithm, sequence classification, and the like, and there is a combination nesting between various algorithms. These deep learning algorithms require supervised learning using labeled data in the training phase to fit neural network parameters before feature extraction and prediction can be performed in the inference phase. It can be understood that the deep learning algorithm is an algorithm which takes an artificial neural network as a framework and performs characterization learning on data. The learning of the deep learning algorithm is to use a set of hyper-parameters to carry out iterative training on the neural network so as to obtain an estimated value of the neural network parameters. Different deep learning algorithms require different types and formats of data, for example, a target detection algorithm requires single-frame images and the coordinates of a bounding box of an object of interest, a skeleton modeling algorithm requires single-frame or sequence coordinates of human key points, and the data can be derived from an original video frame sequence and the operation output of other algorithms.

Specifically, in this embodiment, the interactive video motion comprehensive identification and evaluation process is mainly divided into two periods, namely training and reasoning. During the training period, the data collection component 100 collects or receives a large amount of image and video data, the action recognition component 300 and the action evaluation component 400 receive an algorithm model component and an algorithm model combination scheme added by a user, and the data annotation component 200 is entrusted to start a data annotation service. The motion recognition component 300 and the motion evaluation component 400 train on the labeled data corresponding to the respective algorithms, and generate and store models. In the inference period, the data acquisition component 100 acquires images or videos and inputs the images or videos into the action recognition component 300, and the action recognition component 300 performs inference of an action recognition algorithm, stores a video frame pool and an action feature pool with a certain length, and outputs a recognition result. The action evaluation component 400 analyzes the video frame pool and the action feature pool, performs action evaluation algorithm reasoning by combining evaluation criteria, and outputs an evaluation result.

Referring to fig. 6, fig. 6 shows a physical deployment diagram of the system, which is composed of an image/video capture device and a server. The server is responsible for the operation of all system components and can carry out data transmission and interaction with the data acquisition device. Specifically, the data acquisition component 100 in the server controls the data acquisition device to acquire and store image and video data in the server when receiving a data acquisition request, and other component operations in the system such as data annotation, action recognition and action evaluation are all operated in the server.

Correspondingly, based on the system of fig. 1, the present invention corresponds to a set of method embodiments, in which the method includes the following steps:

video acquisition and manual operation stage:

calling a data acquisition component to acquire original video data;

and (3) a motion recognition model training stage:

receiving an action recognition category and algorithm model combination configuration set by a user by an action recognition component to form a video action comprehensive recognition method, entrusting a data annotation component to start a data annotation service of a corresponding task for an algorithm which needs to be trained by the action recognition component, and carrying out data annotation on the basis of an interactive interface by the user to generate first annotation data; the motion recognition component uses the first labeling data to train an algorithm needing to be trained in the video motion comprehensive recognition method, and obtains and stores a corresponding recognition model; the video action comprehensive identification method preferably at least comprises a target detection algorithm, a human key point detection algorithm, a target tracking algorithm, a skeleton modeling algorithm and a sequence classification algorithm, and is used for describing action definitions and feature portraits formed by interaction of human self features and external object features in a single frame and a video sequence;

and (3) a motion evaluation model training stage:

and (3) action identification reasoning process stage:

and (3) action evaluation reasoning process stage:

and the action evaluation component executes reasoning on the video data acquired by the data acquisition component in real time and the identification result of the video action comprehensive characteristic output by the action identification component based on the video action comprehensive evaluation method, and outputs a video action comprehensive evaluation result.

Further, for each stage of the interactive video motion comprehensive identification and evaluation method of the present invention, detailed descriptions are respectively provided in different method embodiments:

method embodiment 1< motion recognition model training phase >

In the present embodiment, a motion recognition model training embodiment is provided, which can be performed by the motion recognition component 300 in the present embodiment. As a specific example, this embodiment takes the video motion recognition of a railway signal semaphore as an example, and the recognition task needs to recognize all human bodies in the real-time video, recognize whether the motion they do is a railway signal semaphore, and classify a specific semaphore category.

In this embodiment, the motion recognition algorithm model component includes: the system comprises a target detection algorithm for detecting a human body and a flag (distinguishing colors), a target tracking algorithm for tracking the human body and the flag, a key point detection algorithm for detecting key points of human body parts, a human body skeleton modeling algorithm for skeleton modeling, a feature encoder for general image feature encoding and a sequence feature classification algorithm for classifying action categories. Except for the tracking algorithm, the skeleton modeling algorithm and the universal feature encoder, the other algorithms need to be trained. The algorithm listed here is designed according to the recognition requirements of the embodiment, and the invention does not limit the specific motion recognition scenario nor the specific algorithm adopted.

As shown in fig. 2, the motion recognition model training process in this embodiment includes the following steps:

in step S101, the action recognition component 300 requests the data annotation component 200 to start a data annotation function for the above algorithm.

Step S102, the data annotation component 200 starts the data annotation service of the corresponding algorithm, and the user performs a specific annotation operation. In this embodiment, the specific contents to be labeled are as follows:

human body and flag (color discrimination) category labels and bounding box coordinates in the single frame image;

the category and the coordinates of key points of the human body in the single-frame image;

an action category label corresponding to the information of each person in the continuous frame sequence;

step S103, the motion recognition component 300 trains each algorithm, and the specific data used by each algorithm is as follows:

the human body and flag target detection algorithm is marked by using a single frame image and corresponding categories and bounding box coordinates;

the human body key point detection algorithm uses a single image slice defined by a human body boundary frame in a single frame image and human body key point coordinates and category labels relative to the slice; (the following formula is not added:.)

Calling a sequence feature classification algorithm to perform certain processing on the original label: firstly, the target tracking algorithm is used for extracting a key point coordinate sequence of the same human body ID (human body self-characteristic) in any continuous frame sequence, the key point coordinate sequence is converted into a skeleton characteristic sequence by using a skeleton modeling algorithm, and the skeleton characteristic sequence is recorded as an actual human body skeleton characteristic sequence

Wherein L is the sequence length, D_skAnd outputting the characteristic dimension for the skeleton modeling algorithm. The flags (extrinsic features) present in these frames are extracted using the object tracking algorithm, and only flags (object extrinsic features) whose distance from the person (the human body self-features) is less than a preset threshold are considered. If a target flag (target external object characteristics) meeting the requirements exists, converting the absolute coordinates of the center point of the coordinate frame of the target flag into relative coordinates relative to the center point of the human body coordinate frame of the person, normalizing the width and the height of the coordinate frame of the target flag by using the width and the height of the human body coordinate frame, recording the coordinate frame sequence of the processed target flag as the width and the height of the human body coordinate frame, and recording the coordinate frame sequence of the processed target flag as the coordinate frame sequence of the target flag

The color class sequence of the target flag after one-hot coding is recorded as

C_flagThe number of color categories. Encoding depth feature vectors corresponding to the image slices defined by the original bounding box of the target flag by using a universal feature encoder, compressing the depth feature vectors to a fixed length, and recording the feature sequences as

D_feIs the feature dimension of the generic feature encoder output. BBox is added_flag，i，Cls_flag，i，FE_flag，iStitching along a characteristic dimension of F_flag，i＝Concat(BBox_flag，i，Cls_flag，i，FE_flag，i) And all 0 s are used to fill in the feature values corresponding to the frames which do not appear. Splicing all target flag features meeting the requirements into F according to feature dimensions_flag＝Concat(F_flag，i) The flag number is truncated according to a preset maximum number threshold n, and the features corresponding to the missing values are filled with all 0's. Finally F is mixed_skAnd F_flagSpliced according to characteristic dimension into

Sequence length is according to preset sequence window length L_maxTruncated to F', if the sequence exceeds the length, generating several sub-sequence data based on the sliding window

And using the action class label of the complete sequence as the action class label of each segment of the truncated subsequence. The sequence feature classification algorithm needs labeled data of sequence features of each sub-sequence and action class labels { F', Y }, and Y is an action class label index.

In step S104, the motion recognition component 300 stores each model in the computer storage after training.

Method embodiment 2< action recognition inference flow stage >

In the present embodiment, a data inference approach for a motion recognition model is provided, which can be performed by the motion recognition component 300. Taking the specific scenario in < method embodiment 1> as an example, all the outputs of each model are stored in the motion feature pool, and since the scenario only concerns the identified motion category, only the sequence feature classification result needs to be presented as the motion identification result.

As shown in fig. 3, the action recognition model inference process in this embodiment includes the following steps:

step S201, the action recognition component 300 initializes the video frame Pool_videoPool of action characteristics Pool_{rec_feat}And identification result Pool_{rec_result}；

Step S202, the action recognition component 300 receives the video stream provided by the data acquisition component 100;

step S203, traversing each frame of the video stream in sequence, stopping if the traversal is finished, or executing step S204;

step S204, the traversed target Frame image Frame_iAdding the video frame into a video frame pool, and marking a frame serial number i;

step S205, Frame_iForward propagation Detect (-) for inputting running target detection algorithm to obtain all human body bounding box coordinates bbox in current frame_humanFlag category cls_flagAnd the bounding box coordinates bbox_flagTemporarily storing the data into an action characteristic pool;

step S206, from the Frame image Frame_iExtracting all human body regions bbox_humanThe image slice of (1) runs the forward propagation KPDetect (DEG) of the human body key point detection algorithm to obtain all the human body key point information s of the current frame_kpTemporarily storing the data into an action characteristic pool;

step S207, for all the human body key point detection results, operating forward propagation SK (-) of a skeleton modeling algorithm, and enabling a skeleton feature result f to be_skTemporarily storing the data into an action characteristic pool;

step S208, from the Frame image Frame_iExtracting all flag regions bbox_flagThe forward propagation FE (-) of the general feature extraction model is run to obtain the flag image feature FE (f) compressed to a fixed length_flagTemporarily storing the data into an action characteristic pool;

step S209, using all the detected bounding boxes bbox in the current frame_humanAnd bbox_flagFor inputting, running a target tracking algorithm Track (·), and determining identity indexes id corresponding to all the target detection results in the current frame_humanAnd id_flagTo make itAdding the identity index information into the recognition result of the corresponding target feature in the action feature pool, and marking a frame number i;

step S210, when the identity index id of any human body exists in the action feature pool_humanWhich continuously exist over<Method example 1>The sequence window length L as described in_maxThe number of frames of (c). Obtaining all flag detection results bbox appearing in these frame numbers_flag，i，cls_flag，iRetention of<Method example 1>The flag whose distance from the person is less than a certain threshold value is then set according to<Method example 1>The preset maximum number threshold n of the flags in (1) retains n flags with the largest number of the appeared frames. For the flags, absolute coordinates of the center point of the coordinate frame of the flag are converted into relative coordinates relative to the center point of the human body coordinate frame, the width and the height of the human body coordinate frame are used for normalization, and the width and the height of the coordinate frame of the flag are used for normalization. Then, the processed flag bounding box sequence BBox is processed according to the characteristic dimension_flagFlag color category sequence Cls_flagFlag image feature sequence FE_flagSpliced into flag overall characteristic sequence F_flag＝Concat(BBox_flag，Cls_flag，FE_flag). And filling all 0 for the feature vector corresponding to the frame number where a certain flag does not appear or the vacant feature vector where the number of flags is less than the maximum truncation value n. The skeleton characteristic f of the person output in the step S207_skThe formed actual human skeleton characteristic sequence F_skAnd the flag general characteristic sequence F_flagSpliced together by characteristic dimension F ═ Concat (F)_sk，F_flag) Then enters the sequence feature classification model Action (). And writing the action classification act result into an action feature pool and an identification result pool simultaneously, and marking the frame number range contained in the sequence.

Method embodiment 3< action evaluation model training phase >

In the present embodiment, a manner of motion estimation model training is provided, which may be performed by the motion estimation component 400. Taking the specific scenario in < method embodiment 1> as an example, in addition to detecting the motion category, this embodiment introduces a new task of evaluating the degree of difference between the motion and the standard motion on the basis of recognizing that the motion performed by the human body is the motion of the railroad semaphore signal.

In this embodiment, the motion estimation algorithm model component includes: a numerical difference calculation method and a sequence execution difference prediction algorithm. The numerical difference calculation method outputs the characteristic numerical difference of two sequences at the same time point based on a fixed rule without training; the sequence execution difference prediction algorithm is based on deep learning, is used for predicting the frame rate difference and the time delay executed by the two sections of sequences and needs training. In addition, in the data processing process, the human target detection algorithm, the target tracking model and the human skeleton modeling algorithm in < method embodiment 1> are also used. The algorithm listed here is designed according to the action evaluation requirement of the embodiment, and the invention does not limit the specific action evaluation scenario nor the specific algorithm adopted.

As shown in fig. 4, the motion evaluation model training process in this embodiment includes the following steps:

in step S301, the action evaluation component 400 requests the data labeling component 200 to start a data labeling function for the above algorithm.

Step S302, the data annotation component 200 starts the data annotation service of the corresponding algorithm, and the user performs a specific annotation operation. In this embodiment, the specific contents to be labeled are as follows:

the coordinates of the human body label and the bounding box in the single-frame image;

the above annotation can also be obtained by the same annotation in the data annotation step of the motion recognition model training in < method embodiment 1> of the present invention.

Step S303, the motion evaluation 400 trains each algorithm, and a special training data generation method is required for the sequence execution difference prediction algorithm, which is specifically as follows:

firstly, the target tracking algorithm is used for extracting any continuous frame sequence in the original video dataIn the column, the key point coordinate sequence of the same (human body) human body self-feature ID. In each iteration of training, use<Method example 1>Is a predetermined sequence window length L_maxFirst, a certain sub-sequence of the sequence is randomly intercepted and recorded as

C_kpThe number of keypoint categories. Then, in a random distribution

Sample one phase compared with

The frame number interval delta of the start frame of the sequence being in another random distribution

A frame rate scaling factor gamma is sampled. By comparison with

The start frame interval delta frame of the sequence is used as the start position, int (gamma L)_max) Truncating the sequence as the length of the sequence

Then to the sequence

Using interpolation (e.g. bilinear interpolation) to adjust its length to L_maxTo obtain

In a random distribution

Jitter distance of sampling key point coordinates

And adjust

Is composed of

Finally, human key point sequence is extracted

And

transforming into skeletal features using the skeletal modeling algorithm SK (-)

And

and spliced according to the time-series dimension into

The model is trained with δ and γ as model output supervision values as training inputs for the model.

In step S304, the action evaluation module 400 stores each model in the computer memory after the training is completed.

Method embodiment 4< action recognition inference flow stage >

In the present embodiment, a data inference approach for an action valuation model is provided, which can be performed by the action valuation component 400. This embodiment evaluates and compares a video to be evaluated with a preset standard motion video using the algorithm model proposed in < method embodiment 3 >. Wherein, the standard action video is a preset segment or some videos, and includes all the categories of the railroad semaphore action in < method embodiment 1>, and there may be multiple standard videos in the same action category.

According to fig. 5, the action evaluation model inference flow in the present embodiment includes the following steps:

step S401: action evaluation component 400 initializes a pool of standard action features Pool_{eval_std_feat}And action evaluation output Pool_{eval_result}；

Step S402: the action evaluation component 400 extracts human key points s corresponding to all standard actions in the marked action video by using a target detection algorithm Detect (-), a target tracking algorithm Track (-), and a skeleton modeling algorithm SK (-), wherein the human key points s correspond to all standard actions in the marked action video_{std_kp}(comprising the sequence S)_{std_kp}) And human skeleton characteristics f_{std_sk}(comprising sequence F)_{std_sk}) And storing the data in a standard action feature pool.

Step S403: the action evaluation component 400 receives the action feature Pool output by the action recognition component 300 in real time_{rec_feat}；

Step S404: the motion evaluation component 400 calculates the human body key points s actually detected in the motion feature pool frame by frame using a numerical difference calculation method ValueDiff (·)_kp and Standard human Key points s in the Standard motion feature pool_{std_kp}Difference diff of_kpDifference diff_kpAnd marking the corresponding frame number i in the write action evaluation output pool.

Step S405: if the length in the action characteristic pool is the same<Method example 3>Presetting a sequence window length L_maxHuman skeleton characteristic sequence F_skThen the action evaluation component 400 will compare the actual human skeleton feature sequence F_skAnd a standard human skeleton feature sequence F of a corresponding frame in the standard action feature pool_{std_sk}Splicing according to the time sequence dimension to obtain F ═ Concat (F)_{std_sk}，F_sk) The input sequence executes a difference prediction model ExeDiff (-) to obtain an action delay prediction value

And motion frame rate difference prediction

And writing the result into an action evaluation output pool, and marking a corresponding frame number range.

The embodiment of the invention has the beneficial effects that:

the comprehensive action identification and evaluation method constructs a set of universal video action identification and evaluation method system, greatly expands the scope of action description of the existing method, can identify actions and characteristics under various conditions of single person, multiple persons, person interaction, static state, dynamic state and the like, also expands the indexes of action evaluation, and has good expandability.

Through the design of the action recognition/evaluation algorithm meta-information and the runtime configuration, a user can add any specific algorithm to a preset algorithm library to realize and complete data labeling and model training. In the inference period, starting from a video original frame, according to an input data type and an output data type in algorithm meta-information configuration adopted by a user, each algorithm model can be implicitly and orderly executed in sequence, data of the required input type is taken from a feature pool as input, and output features are put into the feature pool to be used by other algorithm models later. The expert rules, the human skeleton sequence feature classification model and the background feature fusion in the background technology can be regarded as a special case or specific implementation of an algorithm model combination of the method in a certain application, but the method also has the capability of being expanded to scenes related to character interaction recognition, evaluation and the like, which is not possessed by other methods.

In addition, the comprehensive action evaluation and recognition system provided by the invention has the capability of dynamically expanding data and algorithm models, and can add, change configuration or retrain newly added data as required to adapt to the change of requirements, so that the current latest data, data processing technology and the most advanced algorithm model can be integrated along with the technical development to realize the optimal action identification and evaluation effect.

In addition, in the specific implementation of the above embodiment, the motion recognition algorithm model library and the motion evaluation algorithm model library refer to a collection of code files/model files and other related files of the motion recognition algorithm/model and the motion evaluation algorithm/model. Each algorithm/model must contain meta-information configuration files, runtime configuration files and inference scripts, and the trainable algorithm must also contain a training startup script.

The meta-information configuration file of the algorithm/model specifically includes: algorithm/model name; algorithm/model type; training start script path (if any) and reasoning start script path of the algorithm/model; the algorithm model infers the input data type; the data type of the algorithm/model reasoning output; the data type of the algorithm/model reasoning input/output comprises the following conditions: the real type (such as integer, character string, list, dictionary, numpy array, etc.) of the object in the memory and the parameters (such as list length, array shape, element type, etc.) for describing the property of the object, and the nesting structure between the real type and the parameter. An artificially defined data type tag, for example, uses a character string "obj _ id" to indicate the target identity index (whose real data type is integer or integer list) output by the target tracking algorithm.

The runtime configuration file of the algorithm/model refers to all parameter configuration information related to or dependent on the algorithm/model runtime training process or the inference process, for example: data set paths in model training, hyper-parameters, etc. The same algorithm/model may have multiple different runtime profiles.

It is to be noted that the foregoing description is only exemplary of the invention and that the principles of the technology may be employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in some detail by the above embodiments, the invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the invention, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. An interactive video action comprehensive recognition and evaluation system is characterized by comprising a data acquisition component 100, a data annotation component 200, an action recognition component 300 and an action evaluation component 400:

the data acquisition component 100 is used for acquiring original video data;

the action recognition component 300 is used for receiving action recognition categories and algorithm model combination configuration set by a user to form a video action comprehensive recognition method, entrusts the data labeling component 200 to start data labeling service of a corresponding task for an algorithm which needs to be trained by the action recognition component, and is used for carrying out data labeling on the basis of an interactive interface by the user to generate first labeling data; the motion recognition component uses the first labeling data to train an algorithm needing to be trained in the video motion comprehensive recognition method, and obtains and stores a corresponding recognition model;

the action evaluation component 400 is configured to perform inference on the video data acquired by the data acquisition component in real time and the recognition result of the video action comprehensive characteristic output by the action recognition component based on the video action comprehensive evaluation method, and output a video action comprehensive evaluation result.

2. The interactive video action comprehensive recognition and evaluation system according to claim 1, wherein the video action comprehensive feature recognition result outputted by the action recognition component 300 at least comprises a human body self-feature and an external object feature having a change related to the human body self-feature.

3. An interactive video motion comprehensive identification and evaluation method is characterized by comprising the following steps:

calling a data acquisition component to acquire original video data;

receiving an action recognition category and algorithm model combination configuration set by a user by an action recognition component to form a video action comprehensive recognition method, entrusting a data annotation component to start a data annotation service of a corresponding task for an algorithm which needs to be trained by the action recognition component, and carrying out data annotation on the basis of an interactive interface by the user to generate first annotation data; the motion recognition component uses the first labeling data to train an algorithm needing to be trained in the video motion comprehensive recognition method, and obtains and stores a corresponding recognition model;

4. The method according to claim 3, wherein the video motion comprehensive feature recognition result output by the motion recognition component at least comprises human body self-features and external object features with related changes of the human body self-features.

5. The method of claim 3, wherein the action recognition component performs inference on the video data collected by the data collection component in real time based on the video action comprehensive recognition method, and outputs a recognition result of the video action comprehensive characteristics, comprising:

6. The method according to any one of claims 3-5, wherein the user-defined action recognition category and algorithm model combination configuration, and the user-defined action evaluation category and algorithm model combination configuration are configured as a project profile; the project configuration file comprises meta-information configuration file paths of all algorithms or models which a user desires to run in the algorithm model library and runtime configuration file paths corresponding to the algorithms or models.

7. The method according to any one of claims 3-5, wherein the video motion comprehensive identification method is based on a single frame image or video time sequence in a video, and uses a plurality of feature detection algorithms and models for describing motion definitions and feature portrayal formed by the human body and the interaction with the environment;

comprehensive properties defined by characteristics of the human body and various things in a single frame;

8. The method according to claim 7, wherein the preferred algorithms adopted by the video motion comprehensive identification method can include a target detection algorithm, a human body key point detection algorithm, a target tracking algorithm, a skeleton modeling algorithm and a sequence classification algorithm, and there is a combination nesting between various algorithms for describing the motion definition and feature depiction formed by the interaction of the human body self-features and the external object features in a single frame and a video sequence.

9. The method according to any one of claims 3-5, wherein the action recognition algorithm model library and the action evaluation algorithm model library are a collection of code files, model files and other related files of action recognition algorithms and models and action evaluation algorithms and models; each algorithm/model must include a meta-information configuration file, a runtime configuration file, and an inference script, and the trainable algorithm must also include a training start script.

10. The method according to any one of claims 3 to 5, wherein the meta-information profiles of the algorithms and models specifically include: algorithm and model name; algorithm and model type; training starting script path and reasoning starting script path of the algorithm and the model; the algorithm model infers the input data type; the data type of the algorithm and the model inference output;