CN114067429B

CN114067429B - Action recognition processing method, device and equipment

Info

Publication number: CN114067429B
Application number: CN202111291956.8A
Authority: CN
Inventors: 王小娟; 金磊; 何明枢; 初佳明; 阳柳
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2023-08-29
Anticipated expiration: 2041-11-02
Also published as: CN114067429A

Abstract

The embodiment of the specification particularly relates to a motion recognition processing method, device and equipment, wherein an adaptive anchor point is added in a motion recognition model, motion recognition is carried out on motion to be recognized by calculating the distance between motion data to be recognized and each adaptive anchor point in the model, the adaptive anchor point realizes more accurate and more effective spatial clustering on each human motion, the intra-class distance of an embedded layer is reduced, the inter-class distance is increased, the recognition capability of the model on the human motion is enhanced, and the accuracy and the efficiency of motion recognition are improved.

Description

Action recognition processing method, device and equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for motion recognition processing, and an electronic device.

Background

With the development of computer internet technology, products for assisting people in exercise by adopting a scientific and technological means are more and more, and people can be helped to conduct action correction and the like by utilizing the scientific and technological means, so that people can be helped to conduct scientific exercises or exercises.

Generally, to assist the movement of people, it is first necessary to identify the movement of the user, and an image sensor or a depth camera may be used to visually capture the movement of the human body, and the image data contains enough information but also contains much noise. However, the application range of the image data is also limited because the photographing angle of the camera is fixed or limited. The other method can use a wearable acceleration sensor, and has the characteristics of low cost, strong portability and high flexibility compared with image data. Both schemes generally use intelligent learning models, but recognition models of general human actions may not be accurate.

Therefore, how to improve the accuracy of human motion recognition is a technical problem that needs to be solved at present.

Disclosure of Invention

In view of the foregoing problems of the prior art, an object of the present invention is to provide a motion recognition processing method, apparatus, and electronic device, which can improve accuracy and efficiency of motion recognition.

In order to solve the technical problems, the specific technical scheme is as follows:

in one aspect, provided herein is a method of action recognition processing, the method comprising:

collecting action data to be identified;

inputting the motion data to be identified into a motion identification model, wherein the motion identification model is a neural network model, and an adaptive anchor point is arranged in an embedded layer of the motion identification model, and each adaptive anchor point corresponds to one type of motion;

and calculating Euclidean distances between the mapping vector of the motion data to be identified and each self-adaptive anchor point in the motion identification model by using the motion identification model, and taking the motion category corresponding to the self-adaptive anchor point with the smallest Euclidean distance as the motion category of the motion data to be identified.

Further, the training method of the action recognition model comprises the following steps:

Collecting known action sample data of known action categories;

initializing and setting a self-adaptive anchor point corresponding to a known action type according to the known action sample data, and setting model parameters of the action recognition model;

and taking the known action sample data as input data of the action recognition model, taking the known action category as output data, and performing model training on the action recognition model until the accuracy of the action recognition model reaches a preset threshold value or the training times reach preset times so as to adjust the initialized self-adaptive anchor points to target positions, so that the self-adaptive anchor points of the same action category are gathered in a designated range, and the self-adaptive anchor points of different categories are gathered outside the designated range.

Further, the method further comprises:

when an unknown action is identified, collecting action data of a specified number of the unknown actions as unknown action data samples;

and inputting the unknown action data sample into the action recognition model, calculating the average value of mapping vectors of the unknown action data sample in an embedding layer of the action recognition model by using the action recognition model, and taking coordinates corresponding to the average value as an adaptive anchor point of the unknown action type.

Further, before calculating, using the motion recognition model, euclidean distances between the mapping vectors of the motion data to be recognized and the respective adaptive anchor points in the motion recognition model, the method further comprises:

calculating reject scores between the motion data to be identified and each adaptive anchor point in the motion identification model;

and if the reject fraction is larger than a preset fraction threshold, taking the action data to be identified as a new unknown action, and if the reject fraction is smaller than the preset fraction threshold, calculating Euclidean distances between mapping vectors of the action data to be identified and each self-adaptive anchor point in the action identification model by using the action identification model.

Further, the rejection score is calculated according to the following formula:

in the above-mentioned method, the step of,confidence level of mapping vector representing the motion data to be identified nearest to the ith adaptive anchor, +.>And representing the Euclidean distance between the motion data to be identified and the ith self-adaptive anchor point in the motion identification model, wherein k represents the kth self-adaptive anchor point, N represents the total number of the self-adaptive anchor points in the motion identification model, c represents the confidence level, and s represents the rejection score.

Further, the loss function of the motion recognition model is a Euclidean distance-based loss function.

In another aspect, the present disclosure provides an action recognition processing apparatus, a data acquisition module, configured to acquire action data to be recognized;

the data input module is used for inputting the motion data to be identified into a motion identification model, the motion identification model is a neural network model, self-adaptive anchor points are arranged in an embedded layer of the motion identification model, and each self-adaptive anchor point corresponds to one category of motion;

and the action recognition module is used for calculating Euclidean distances between the mapping vector of the action data to be recognized and each self-adaptive anchor point in the action recognition model by using the action recognition model, and taking the action category corresponding to the self-adaptive anchor point with the smallest Euclidean distance as the action category of the action data to be recognized.

Further, the device also comprises a model training module for training to obtain the action recognition model by adopting the following method:

collecting known action sample data of known action categories;

Further, the apparatus further comprises a model training module for:

In another aspect, there is provided an electronic device including a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement the above-described method for implementing the above-described method for identifying actions using lightweight network middleware.

According to the motion recognition processing method, the motion recognition processing device and the electronic equipment, the self-adaptive anchor points are added in the motion recognition model, motion recognition is carried out on the motion to be recognized by calculating the distance between the motion data to be recognized and each self-adaptive anchor point in the model, more accurate and more effective spatial clustering is achieved on each human motion by the self-adaptive anchor points, the intra-class distance of the embedded layer is reduced, the inter-class distance is increased, the recognition capability of the model on the human motion is enhanced, and the accuracy and the efficiency of motion recognition are improved.

The foregoing and other objects, features and advantages will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments herein or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments herein and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow diagram of a method of action recognition processing in one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an embedded layer structure of an action recognition model in one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of training and recognition principle of an action recognition model in different scenarios in one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the overall framework of an action recognition model in one embodiment of the present disclosure;

fig. 5 is a schematic diagram of a configuration of an action recognition processing device in one embodiment of the present specification;

fig. 6 shows a schematic structural diagram of an electronic device for an action recognition process provided in an embodiment herein.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the disclosure. All other embodiments, based on the embodiments herein, which a person of ordinary skill in the art would obtain without undue burden, are within the scope of protection herein.

It should be noted that the terms "first," "second," and the like in the description and claims herein and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.

Along with the progress of science and technology, various products for assisting people in exercise are favored by people, and the products need to collect action data and identify the collected action data so as to judge whether the action meets the requirements or not. The general action can be identified by adopting an intelligent learning model, the intelligent learning model firstly carries out learning training on action data of known actions, and then the trained model is utilized to identify the action data of people. However, general motion recognition can recognize only the motion that the model has learned, and basically does not consider motions other than the motion that the model has learned. In the actual problem of human motion recognition, since human motion is continuous, not all application scenarios require the human body to maintain the same motion, the conventional method may have recognition errors when facing the human motion that has not occurred, for example:

when a human body is switched from one action to another action, transition actions of two human body actions appear in the middle, and the transition actions are often not learned and classified into one of the two human body actions during model training, but because the previous methods are basically aiming at classification with a closed set, the model classifies the transition actions into a certain type of action with highest similarity or confidence degree with the action learned by the model, and in practical cases, the possibility of classification errors is high. Or when the model is required to have recognition capability for some new human actions after training is completed, the model is often required to be retrained. Retraining means that a certain amount of training data needs to be provided, and it takes a certain time and cost to re-fit the model to get a new embedding space. For a model that is sufficiently generalizable, retraining the model can result in some waste of time and space resources.

In most human body motion recognition scenes, the model maps human body motions into an embedded layer space through learning, self-adaptive anchor points are introduced, different anchor points represent different types of motions, euclidean distances between different motions represent the similarity between different motions, and in addition, the boundary between the known human body motion and the unknown human body motion can be defined through introducing a reject score mechanism. Meanwhile, by utilizing the trained embedded layer space, the model can obtain the recognition capability of the new human body action by setting an anchor point for the new action.

Fig. 1 is a flow chart of an action recognition processing method in an embodiment of the present disclosure, as shown in fig. 1, the action recognition processing method provided in the present disclosure may be applied to a server, a client, and the like: in terminal equipment such as a computer, a smart phone, intelligent wearable equipment, a tablet personal computer and the like, the method comprises the following steps:

step 102, collecting action data to be identified.

In a specific implementation process, the motion data to be identified can be acquired through an image acquisition device or an intelligent wearing device, the motion data to be identified can be understood as motion data which needs to be identified, and the motion data can be pictures or data acquired through a sensor and the like. In some embodiments of the present disclosure, a human motion recognition method based on a wearable acceleration sensor may be adopted, and after a participant wears the acceleration sensor on a body key point, acceleration data of different nodes are collected in real time in the background, and sampling and cleaning are performed to obtain motion data to be recognized. Of course, according to actual application requirements, other modes can be adopted to collect the action data to be identified.

Step 104, inputting the motion data to be identified into a motion identification model, wherein the motion identification model is a neural network model, and an adaptive anchor point is arranged in an embedded layer of the motion identification model, and each adaptive anchor point corresponds to one type of motion.

In a specific implementation process, the motion data to be identified can be input into a pre-trained motion identification model, and the motion identification model is utilized to perform motion identification on the motion data to be identified. The motion recognition model in the embodiments of the present description may be obtained based on historical motion data training, and neural network models such as: CNN (Convolutional Neural Networks, convolutional neural network model) or LSTM (Long Short-Term Memory model), etc. Fig. 2 is a schematic diagram of an embedded layer structure of an action recognition model in an embodiment of the present disclosure, as shown in fig. 2, the action recognition model in the embodiment of the present disclosure may include an embedded layer, where the embedded layer is mainly formed by a convolution layer and an activation function thereof, a pooling layer, an LSTM block, and a full connection layer, and the model structures of the embedded layer are different in different combination modes, so that a good recognition effect is obtained by using a Hybrid embedded layer model commonly used by the convolution layer, the LSTM block, and the full connection layer. In addition, in the embodiment of the present disclosure, adaptive anchors are provided in the embedded layer, each adaptive anchor corresponds to a class of action, and in the model training process, each adaptive anchor may be initialized to one row in the orthogonal matrix first, so that the correlation between the adaptive anchors is minimum, and then in the model training process, the adaptive anchors are set to be changeable along with training of the model, so that the positions of the adaptive anchors may be changed according to the loss function and the counter-propagation of the model, thereby implementing classification of actions of different classes.

In some embodiments of the present disclosure, the training method of the motion recognition model includes:

collecting known action sample data of known action categories;

In a specific implementation process, motion data of a known motion class can be collected as known motion sample data, and then an adaptive anchor point corresponding to the known motion class is initialized and set according to the known motion sample data, for example: initializing each adaptive anchor point to one row in an orthogonal matrix and the like according to known action sample data, and setting model parameters of the action recognition model such as: parameters in the loss function, structural parameters of the embedded layer, etc. And taking the known action sample data as input data of the action recognition model, taking the known action types corresponding to the known action sample data as output data of the model, and performing model training on the action recognition model until the accuracy of the model reaches a preset threshold value or the training times meet preset times, and the like, thereby completing the model training. In the model training process, the initialized self-adaptive anchor points can be used for adjusting positions along with model training, so that the self-adaptive anchor points of the same action category are gathered into a specified range, and the self-adaptive anchor points of different categories are gathered out of the specified range.

The motion recognition model in the embodiments of the present description applies anchor points related to metric learning in addition to the portions of the convolutional layer, LSTM block, and fully-connected layer used for feature extraction and matrix operation in the embedded layer model. Metric learning may maximize the distance between classes and minimize the distance within the classes by learning some distance function dependent loss functions, such as euclidean distance, cosine distance, mahalanobis distance. This enables metric learning to build an embedded layer space next to tag semantic logic when solving classification problems. As shown in FIG. 2, the embedded layer can be regarded as a mapping black box from the input vector to the output vector, and the recognition and classification result of the human motion also depend on the mapping vector output by the black box. In order to train an embedded layer model which can well divide the known human body actions and the unknown human body action boundaries, the training directions of the model are mainly two. Firstly, the embedded layer model gathers human actions of the same category as much as possible, namely, reduces the intra-category distance; secondly, the embedded layer model needs to keep different kinds of human actions away from each other, namely, increase the distance between the kinds. The training of the embedded model in the embodiment of the present description introduces adaptive anchor points, which are centers of settings for known human body actions, exist in the embedded space in the form of intelligible points, represent different types of actions with each other, and along with the training of the model, their distance components between each other conform to the similarity between actions. In the model training process, the model utilizes the Euclidean distance-based loss function to reversely propagate, the weight value of each layer of the model is adjusted, the loss value calculated by the loss function and the human body action are in direct proportion to the Euclidean distance of the anchor point, and the loss value is in inverse proportion to the Euclidean distance of the anchor point of different types. The model is trained in a direction of reducing the loss value, so that the human body actions are mapped closer to the anchor points of the same class and farther from the anchor points of other classes, and the anchor points are adjusted to more proper space positions.

In the embodiment of the present disclosure, the loss function of the motion recognition model may be a loss function based on the euclidean distance, such as: class center clustering Loss (class center cluster loss function):

L _A (x,y)＝d _y ＝||f(x)-C _y || ₂

L _CAC (x,y)＝L _T (x,y)+λL _A (x,y)

in the above formula, x represents input motion data, y represents output motion type, f (x) represents a mapping vector of input motion data, C represents an anchor point, C _y An anchor point represented by the action y, d _y Representing the Euclidean distance between anchor points represented by the input action data distance y, λ represents the weight parameter between the two loss functions, and M represents the dimension of the anchor point, i.e. the anchor point coordinate dimension. L (L) _A For Anchor Loss, representing the distance between the mapping vector and the correct Anchor point; l (L) _T Is a tuple Loss and the sum of the distances between the mapping vector and other anchor points is inversely related. L (L) _CAC Class center clustering Loss, a weighted combination of two loss functions, which are positively correlated with the distance between the mapping vector and the correct anchor point and negatively correlated with the sum of the distances between the mapping vector and the other anchor points.

The sum of the above equations is partly to calculate the difference between the distance of the other anchor point to the mapping vector and the nearest distance. Loss function L _T As the model is trained, the sum becomes smaller and the sum becomes smaller as L _T Becomes smaller and smaller, d _y -d _j Also get smaller, d _j The larger the mapping vector is, the farther from other anchors.

Of course, other loss functions related to euclidean distance or mahalanobis distance can be used according to actual needs, and it is required to ensure that adaptive anchor points are reserved, and these loss functions have similar effects on reducing the intra-class distance, but may have loss on increasing the inter-class distance and the effect of opening the set problem.

In addition, in the embodiment of the present disclosure, the setting of the anchor point dimension is consistent with the dimension of the final output of the embedded layer, and the adaptive anchor points are initialized according to a diagonal matrix or an orthogonal matrix, where the correlation between the adaptive anchor points is reduced as much as possible. In the embodiment of the present disclosure, a high-dimensional embedding layer is adopted, and in a common human motion recognition classification model, the final output dimension of the model is often the same as the number of human motions that can be recognized by the model. The common classification problem is usually a closed set classification problem, i.e. the labels classified when the model is tested are the same as the labels used for learning when the model is trained. However, classification problems in the real world, which are often not the case, often require facing many labels that have not been learned during model training, are referred to as open-set classification problems. The open set classification problem has higher research value in human body action recognition application scenes, such as judgment of suspected dangerous human body actions, human body action standard assistance and the like.

In the closed set human motion recognition and classification problem, the model can recognize the input human motion into the human motion with highest confidence degree because the occurrence of the unknown human motion is not needed to be considered, and when the dimension of the model output layer is equal to the number of human motion categories, the vector space dimension of the embedded layer can provide enough numbers of mutually steamed dumpling positions for anchor points so as to solve the closed set classification problem. However, in the open-set problem, since the number of human actions involved in the classification problem is not fixed, the training purpose of the embedding layer is changed from distinguishing each human action as far as possible, to mapping each known human action as reasonably as possible, and leaving a spatial position for the unknown human actions which may exist, which requires the embedding layer to output a vector space with a higher dimension. The embedded layer with higher dimension number can provide more anchor point position selection for unknown human body actions, so that the trained embedded layer vector space has better universality. The specific number of dimensions may be set based on actual use requirements, which is not specifically limited in the embodiments of the present disclosure.

And 106, calculating Euclidean distances between the mapping vector of the motion data to be identified and each self-adaptive anchor point in the motion identification model by using the motion identification model, and taking the motion category corresponding to the self-adaptive anchor point with the smallest Euclidean distance as the motion category of the motion data to be identified.

In a specific implementation process, after the motion to be identified is input into the motion identification model, the motion identification model can be utilized to calculate the Euclidean distance between the mapping vector of the motion data to be identified in the motion identification model and each self-adaptive anchor point in the motion identification model, and the motion category corresponding to the self-adaptive anchor point with the smallest Euclidean distance is used as the motion category of the motion data to be identified. And if the Euclidean distance is minimum, the action data to be identified is closest to the self-adaptive anchor point, and the action data to be identified is closest to the action category corresponding to the self-adaptive anchor point.

According to the action recognition processing method provided by the embodiment of the specification, the Class center clustering Loss or other Euclidean distance-related loss function is utilized to train the embedded layer space of the action recognition model, the self-adaptive anchor point is introduced, more accurate and more effective spatial clustering is realized for each human action by utilizing the self-adaptive anchor point, the intra-class distance of the embedded layer is reduced, the inter-class distance is increased, and the recognition capability of the model on the human action is enhanced.

In some embodiments of the present description, the method further comprises:

In a specific implementation process, after training of the motion recognition model is completed, the embedded layer has extremely strong universality on human motion recognition, and the universality is derived from training of an embedded layer vector space. The embedded layer space trained by the Euclidean distance-related loss function arranges different human actions in the vector space according to the mutual correlation, and the distance between different adaptive anchor points represents the correlation between different human actions. When the model is expected to quickly learn an unknown human body action and has the capability of identifying the unknown human body action, based on the powerful universal performance of the model, a very small amount of data samples of the unknown human body action can be input into the model, and the center of the mapping vector of the sample is used as an anchor point of a new human body action, so that the model quickly has the capability of identifying the human body action.

Specific numbers can be collected such as: the method comprises the steps of taking action data of a small amount of unknown actions such as 10 or 50 as unknown action data samples, inputting the acquired unknown action data samples into a trained action recognition model, calculating an average value of mapping vectors of the unknown action data samples in an embedding layer of the action recognition model by using the action recognition model, and taking coordinates corresponding to the average value in the embedding layer as adaptive anchor points of action categories of the positions. The whole process does not need a large amount of sample data, retraining of the model is not needed, and the efficiency of new action learning and the adaptability of the model are improved.

In some further embodiments of the present disclosure, before calculating, using the motion recognition model, euclidean distances between the mapping vectors of the motion data to be recognized and the respective adaptive anchor points in the motion recognition model, the method further comprises:

calculating rejection scores between action data of the unknown actions and each adaptive anchor point in the action recognition model;

In a specific implementation process, when motion recognition is performed on motion data to be recognized, rejection scores between the motion data to be recognized and each adaptive anchor point in a motion recognition model can be calculated first, whether the motion data to be recognized is far away from all the adaptive anchor points or not is determined through the rejection scores, if the rejection scores are larger than a preset score threshold, the motion data to be recognized is far away from all the adaptive anchor points and belongs to new unknown motion, otherwise, the motion data to be recognized is considered to belong to known motion categories. The Euclidean distance between the motion data to be identified and each self-adaptive anchor point can be calculated respectively, and the motion category of the self-adaptive anchor point with the smallest Euclidean distance is selected as the motion category of the motion data to be identified.

The reject fraction may be understood as a comprehensive distance, and may represent the comprehensive situation of the mapping vector and the distances of the anchor points. When an unknown human body action data sample cannot be obtained, the model can be used for identifying the unknown human body action by mining information carried by the vector space. The unknown human body actions are often far away from the anchor points of the known human body actions, so that the comprehensive distance condition of the human body action sample and each anchor point can be calculated by introducing the reject fraction concept, the boundary between the known human body actions and the unknown human body actions can be divided, whether the action data to be identified are new unknown actions can be distinguished through the boundary, and the identification of the unknown actions is realized.

In some embodiments of the present disclosure, the rejection score is calculated according to the following formula:

in the above-mentioned method, the step of,representing the confidence that the mapping vector of the action data to be identified is nearest to the ith adaptive anchor point, and representing the pair d _i Results after mapping Softmin function, < ->And representing the Euclidean distance between the motion data to be identified and the ith self-adaptive anchor point in the motion identification model, wherein k represents the kth self-adaptive anchor point, N represents the total number of the self-adaptive anchor points in the motion identification model, c represents the confidence level, and s represents the rejection score. />Representing a set of mapping vectors to respective adaptive anchor distances d ₁ ，d ₂ ，...，d _N ]C is also the confidence set that each adaptive anchor is the nearest anchor to the mapping vector. For the set point multiplication operation, specifically the multiplication of elements at the same position in two sets, and finally the summation.

The Softmin (x) =softmax (-x), and both Softmax and Softmin can be regarded as normalized probability functions, and the element sizes in the set x are scaled so that the sum of all elements after scaling is 1. Such as: [2,3,4]Is scaled to [0.0903,0.2447,0.665 ]]Softmax may scale large elements to large confidence, and Softmin may scale small elements to large confidence, as opposed to large confidence, for example: [ -2, -3, -4 ]Is scaled to [0.0903,0.2447,0.665 ]]. And for distances, the smaller the distance, the greater the confidence, and the greater the probability that the anchor point is the nearest anchor point. Above-mentionedThe main purpose is to scale the elements according to the sizes of the elements in the set, so that the sum of the elements in the scaled set is 1. The numerator is therefore related to the i-th term only, and the denominator is to traverse the whole set and sum, intuitively interpreted as the normalized probability. />It can be understood that the confidence of the event that the mapping vector is closest to the ith adaptive anchor.

The present description embodiment has the ability to measure vector to individual anchor point rejection scores by introducing rejection scores. The model firstly inputs the distance vector into the softmin function, and maps the larger Euclidean distance into smaller confidence coefficient, but the softmin function loses the information of the Euclidean distance, so that the confidence coefficient is multiplied by the distance again, and the information loss is avoided.

Fig. 3 is a schematic diagram of training and recognition principles of an action recognition model under different scenes in an embodiment of the present disclosure, as shown in fig. 3, in the embodiment of the present disclosure, the action recognition may be mainly divided into a 3-way scene, a closed-way scene, a pseudo-open-way scene, and a full-way scene, where the two scenes of the pseudo-open-way scene and the full-way scene are mainly different in recognition manner of an unknown new human body action. One is a scene that can collect and utilize a certain number of unknown data samples, we become a pseudo-open set scene; the other is a scene of whether the unknown action is judged by the model without the support of an external unknown human action data sample, which is called a full open set scene. To solve the problems of the pseudo-open set and the fully-open set, an embedded layer space with good universality and robustness needs to be trained, so that the distances between mapping vectors of different human actions in the embedded layer space accord with display actual logic, such as: the euclidean distance between walking and walking should be smaller than the euclidean distance between walking and rope skipping.

As shown in fig. 3, the motion recognition can be divided into a model training stage and a model testing stage, and for the closed set problem, the purpose of model training is to gather the motions around similar anchor points, as shown in the upper left corner of fig. 3, after model training is finished, standing motion data and sitting motion data are more gathered after being mapped to a model space than clusters formed by the model during initialization. When the motion is identified, the action category corresponding to the anchor point which is closest to the distance between the mapping vector of the input motion data and each anchor point can be used as an identification result, as shown in the upper right corner of fig. 3.

For the pseudo-open set problem, the model only uses the known action training set data to train the model, the model parameters are updated, the data in the training set as the unknown action is directly input into the model to be mapped into the embedded layer vector, and the vector is averaged to obtain a new anchor point, as shown in the lower left corner of fig. 3. Specifically, after model training is completed, a small amount of unknown human body action type data is collected, the data is input into the model and mapped into an embedded layer space, and the vectors obtained through mapping are averaged to obtain the center vectors of the vectors, and the center vectors are used as anchor points of new human body actions. When human motion classification is performed subsequently, the human motion to which the anchor point with the smallest distance belongs, namely the category to which the human motion data belongs, is compared with the distance between the human motion sample mapping vector and each anchor point, as shown in the middle part on the right side of fig. 3.

For the problem of the full open set, after model training is completed, the reject scores of the mapping vectors of the human body motion samples of the known type and the unknown type are calculated, if the reject score value exceeds a preset score threshold value, the human body motion samples are far away from the known human body motion, the similarity with the known human body motion is smaller, the correlation is smaller, and the unknown human body motion can be judged. And (3) measuring the distance between the mapped vector and all anchor points by using the aggregate score, selecting a proper distance threshold value through multiple tests to divide the boundary between the known action and the unknown action, and judging the human action sample to be the unknown action if the distance between the human action sample and the known human action is far beyond the preset score threshold value, as shown in the lower right corner of fig. 3.

FIG. 4 is a schematic diagram of an overall framework of the motion recognition model according to one embodiment of the present disclosure, and as shown in FIG. 4, motion data such as: the acceleration data collected by the wearable equipment is firstly subjected to feature extraction through an embedded layer, and the structure of the embedded layer can adopt various deep learning network structures such as MLP, CNN, LSTM, hybrid and the like. The human motion data is mapped by the embedding layer to obtain a mapping vector, and the distance between the vector and each anchor point can be calculated by using the L2 normal form. And during the initialization of the anchor points, the relevance among the anchor points is reduced as much as possible, for example, the anchor points are initialized by adopting a diagonal matrix or an orthogonal matrix. And the human body action type corresponding to the original data can be judged by utilizing the distance between the vector obtained by calculation and each anchor point.

By introducing reject scores, the model has the ability to measure the integrated distance of the vector to each anchor point. The model firstly inputs the distance vector into the softmin function, and maps the larger Euclidean distance into smaller confidence coefficient, but the softmin function loses the information of the Euclidean distance, so that the confidence coefficient is multiplied by the distance again, and the information loss is avoided. When the rejection score of the vector exceeds a threshold, the model judges that the motion is an unknown human motion, otherwise, the model considers that the motion belongs to one of the human motions all the time, and then model identification is the same as traditional closed-set motion identification, and the human motion represented by an anchor point closest to the Euclidean distance of the vector is the prediction result of the model.

According to the motion recognition processing method provided by the embodiment of the specification, the Center Loss or other Loss functions related to Euclidean distance are utilized to train the space of the embedded layer, each human motion is more accurately and effectively clustered in space by utilizing the self-adaptive anchor points, the intra-class distance of the embedded layer is reduced, the inter-class distance is increased, and the recognition capability of the model on the human motion is enhanced. In addition, identification and judgment for unknown human body actions are introduced, and the technical application scene is widened. In addition, the human body actions are mapped into the European space by using a method of training the embedded layer space by using Loss functions related to cosine similarity such as Arcface Loss and the like, and the length information of the vectors is reserved. The European space is more convenient for determining the anchor point, and after the high-dimensional embedded layer is introduced, the method has better classification performance and universality when the unknown human body actions are identified in the face of the open set problem. In addition, when the problems of pseudo-open sets are faced, unknown actions quickly learn and register and other scenes, fewer unknown action samples can be relied on, and better classification accuracy is achieved.

On the other hand, based on the content of the above embodiment, the present disclosure further provides an action recognition processing apparatus, and fig. 5 is a schematic structural diagram of the action recognition processing apparatus in one embodiment of the present disclosure, as shown in fig. 5, where the apparatus includes:

the data acquisition module 501 is used for acquiring action data to be identified;

the data input module 502 is configured to input the motion data to be identified into a motion identification model, where the motion identification model is a neural network model, and an adaptive anchor point is disposed in an embedded layer of the motion identification model, and each adaptive anchor point corresponds to a type of motion;

and the motion recognition module 503 is configured to calculate, using the motion recognition model, euclidean distances between the mapping vector of the motion data to be recognized and each adaptive anchor point in the motion recognition model, and use a motion class corresponding to the adaptive anchor point with the smallest euclidean distance as the motion class of the motion data to be recognized.

In addition, in some embodiments of the present disclosure, the apparatus further includes a model training module configured to train to obtain the motion recognition model by:

collecting known action sample data of known action categories;

In some embodiments of the present description, the apparatus further comprises a model training module for:

The embodiments of the device portion may also have other embodiments with reference to the embodiments of the method portion, which are not described herein in detail.

In another aspect, embodiments of the present disclosure provide a computer readable storage medium having at least one instruction or at least one program stored therein, where the at least one instruction or at least one program is loaded and executed by a processor to implement the action recognition processing method as described above.

In yet another aspect, an embodiment of the present disclosure provides an electronic device for action recognition processing, and fig. 6 is a schematic structural diagram of an electronic device for action recognition processing provided in the embodiment of the present disclosure, where, as shown in fig. 6, the device includes a processor, a memory, a communication interface, and a bus, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or at least one program is loaded and executed by the processor to implement a method for action recognition processing according to any one of the foregoing embodiments.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. The test method provided by the embodiment of the invention has the same implementation principle and technical effects as those of the embodiment of the system, and for the sake of brief description, reference may be made to the corresponding contents of the embodiment of the system.

It should be understood that, in the various embodiments herein, the sequence number of each process described above does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments herein.

It should also be understood that in embodiments herein, the term "and/or" is merely one relationship that describes an associated object, meaning that three relationships may exist. For example, a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided herein, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the elements may be selected according to actual needs to achieve the objectives of the embodiments herein.

In addition, each functional unit in the embodiments herein may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions herein are essentially or portions contributing to the prior art, or all or portions of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Specific examples are set forth herein to illustrate the principles and embodiments herein and are merely illustrative of the methods herein and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the teachings herein, many variations are possible in the specific embodiments and in the scope of use, and nothing in this specification should be construed as a limitation on the invention.

Claims

1. A method of action recognition processing, the method comprising:

collecting action data to be identified;

the training method of the action recognition model comprises the following steps: collecting known action sample data of known action categories; initializing and setting a self-adaptive anchor point corresponding to a known action type according to the known action sample data, and setting model parameters of the action recognition model; the model parameters include: parameters in the loss function, structural parameters of the embedded layer; initializing each adaptive anchor point to a row in an orthogonal matrix according to known motion sample data to reduce mutual correlation among the adaptive anchor points; taking the known action sample data as input data of the action recognition model, taking a known action category as output data, and performing model training on the action recognition model until the accuracy of the action recognition model reaches a preset threshold value or the training times reach preset times so as to adjust an initialized self-adaptive anchor point to a target position, so that the self-adaptive anchor points of the same action category are gathered in a designated range, and the self-adaptive anchor points of different categories are gathered outside the designated range;

Calculating reject scores between the motion data to be identified and each adaptive anchor point in the motion identification model; if the reject fraction is larger than a preset fraction threshold, the action data to be identified is used as a new unknown action, and if the reject fraction is smaller than the preset fraction threshold, the action identification model is utilized to calculate Euclidean distances between mapping vectors of the action data to be identified and each self-adaptive anchor point in the action identification model, and the reject fraction is calculated and obtained according to the following formula:

in the above-mentioned method, the step of,confidence level of mapping vector representing the motion data to be identified nearest to the ith adaptive anchor, +.>Representing the Euclidean distance between the motion data to be identified and the ith adaptive anchor point in the motion identification model, k represents the kth adaptive anchor point, N represents the total number of adaptive anchor points in the motion identification model, c represents the confidence level, s represents the rejection score, and (ii) represents the rejection score, and (iii) represents the total number of adaptive anchor points in the motion identification model>Representing a set of mapping vectors to respective adaptive anchor distances d ₁ ，d ₂ ，...，d _N ]Multiplying elements at the same position in two sets for the set point multiplication operation, and finally summing;

2. The action recognition processing method according to claim 1, characterized in that the method further comprises:

3. The motion recognition processing method according to claim 1, wherein the loss function of the motion recognition model is a euclidean distance-based loss function.

4. An action recognition processing apparatus, the apparatus comprising:

the data acquisition module is used for acquiring action data to be identified;

The model training module is used for training the action recognition model, and comprises the following steps: collecting known action sample data of known action categories; initializing and setting a self-adaptive anchor point corresponding to a known action type according to the known action sample data, and setting model parameters of the action recognition model; the model parameters include: parameters in the loss function, structural parameters of the embedded layer; initializing each adaptive anchor point to a row in an orthogonal matrix according to known motion sample data to reduce mutual correlation among the adaptive anchor points; taking the known action sample data as input data of the action recognition model, taking a known action category as output data, and performing model training on the action recognition model until the accuracy of the action recognition model reaches a preset threshold value or the training times reach preset times so as to adjust an initialized self-adaptive anchor point to a target position, so that the self-adaptive anchor points of the same action category are gathered in a designated range, and the self-adaptive anchor points of different categories are gathered outside the designated range;

the reject score judging and calculating module is used for calculating reject scores between the motion data to be identified and each self-adaptive anchor point in the motion identification model; if the reject fraction is larger than a preset fraction threshold, the action data to be identified is used as a new unknown action, and if the reject fraction is smaller than the preset fraction threshold, the action identification model is utilized to calculate Euclidean distances between mapping vectors of the action data to be identified and each self-adaptive anchor point in the action identification model, and the reject fraction is calculated and obtained according to the following formula:

5. The motion recognition processing apparatus of claim 4, further comprising a model training module for training to obtain the motion recognition model using the method of:

collecting known action sample data of known action categories;

6. The motion recognition processing apparatus of claim 4, further comprising a model training module to:

7. An electronic device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set that is loaded and executed by the processor to implement the method of any of claims 1-3.