WO2021057810A1

WO2021057810A1 - Data processing method, data training method, data identifying method and device, and storage medium

Info

Publication number: WO2021057810A1
Application number: PCT/CN2020/117226
Authority: WO
Inventors: 沈凌浩; 吴新
Original assignee: 深圳数字生命研究院
Priority date: 2019-09-29
Filing date: 2020-09-23
Publication date: 2021-04-01
Also published as: CN111881705B; CN111881705A

Abstract

A data processing method, a data training method, a data identifying method and device, and a storage medium. The data processing method comprises: inputting first feature data having a first number of channels into a first type of convolution layer having a second number of filters, and outputting second feature data having a second number of channels; inputting the second feature data having the second number of channels into a second type of convolution layer having the second number of filters, and generating, according to a learnable mask parameter in the second type of convolution layer, a mask of the weight of each filter in the second type of convolution layer by means of a neural network; determining a connection mode between each filter in the second type of convolution layer and each channel in the second feature data according to the mask; calculating the second feature data according to a mapping relationship obtained by the connection mode to obtain third feature data; and inputting the third feature data having the second number of channels into a third type of convolution layer having the first number of filters, and outputting fourth feature data having the first number of channels.

Description

Data processing, training, identification method, device and storage medium

Technical field

The present invention relates to the application field of computer technology, in particular to a data processing, training, and identification method, device and storage medium.

Background technique

In pose estimation technology (ie, key point detection technology) currently two commonly used solutions include: top-down method (Two-step framework) and bottom-up method (Part-based framework);

Among them, the top-down method is to first detect the position of the rectangular frame of all characters in the picture (2D/3D) (the characters are completely contained in the rectangular frame), and then independently detect the bones of the characters in each rectangular frame. Point coordinates, connected to human skeletons, are characterized by high data processing accuracy. Among them, the accuracy of posture estimation is highly dependent on the detection quality of the rectangular frame of the person's position.

The bottom-up method is to first detect the coordinates of the bone key points of all the characters in the picture, and then deal with the allocation of each bone key point, assign each key point to a different person, and connect the human skeleton. Its characteristic lies in the data. The processing speed is fast, but if there are dense crowds or occlusions between characters, errors are likely to occur in the stage of assigning key points to individuals.

In the related technology, the Kinect device is mainly used to obtain the key points of the character in the realization of body recognition, but the device is expensive and not portable. In addition, the related technology will cause the error of the data source itself to become larger due to the sampling and calculation model. Related technologies have low accuracy in recognizing human body gestures.

In view of the above-mentioned problem of low data processing efficiency in the process of recognizing human posture due to related technologies, no effective solution has been proposed at present.

Summary of the invention

The embodiments of the present invention provide a data processing, training, and recognition method, device, and storage medium to at least solve the technical problem of low data processing efficiency in the process of recognizing human posture due to related technologies.

According to one aspect of the embodiments of the present invention, a data processing method is provided, including: inputting first feature data with a first number of channels into a first type convolutional layer with a second number of filters for calculation, and outputting The second feature data of the second number of channels, where the first number is greater than the second number; the second feature data of the second number of channels is input to the second type of convolutional layer with the second number of filters, and according to the first number The mask parameters that can be learned in the second-type convolutional layer are used to generate the mask of the weight of each filter in the second-type convolutional layer through the neural network; according to the mask, each filter in the second-type convolutional layer and the first 2. The connection mode of each channel in the feature data; the second feature data is convolved to calculate the second feature data according to the mapping relationship obtained by the connection mode to obtain the third feature data; the third feature data with the second number of channels is input to the first feature data The third type convolutional layer of the quantity filter is calculated, and the fourth feature data with the first quantity channel is output.

Optionally, the data processing method is applied to deep learning in artificial intelligence.

Optionally, the data processing method is applied to recognize the posture or action of the target in the picture/video.

Optionally, according to the learnable mask parameters in the second-type convolutional layer, generating a mask of the weights of each filter in the second-type convolutional layer through a neural network includes: according to all the mask parameters in the second-type convolutional layer The connection layer generates a mask of the weight of each filter in the second type of convolutional layer.

According to one aspect of the embodiments of the present invention, a data training method is provided, including: obtaining a weight classification model to be trained, wherein the weight classification model is a neural network model for obtaining image features of image data; and a weight classification model to be trained Training is performed to obtain a weight classification model; wherein, the method used in training the weight classification model to be trained includes the above-mentioned data processing method.

Optionally, training the weight classification model to be trained to obtain the weight classification model includes: inputting the data in the first preset data set into the weight classification model to be trained to obtain the category prediction result; according to the category prediction result and the first prediction data The label category of the concentrated data is obtained, and the error between the category prediction result and the label category of the data in the first prediction data set is obtained; according to the error, the backpropagation algorithm is used to train the weight classification model to be trained until the weight classification model to be trained converges to obtain Convergent weight classification model.

Optionally, training the weight classification model to be trained with the back propagation algorithm based on the error until the weight classification model to be trained converges includes: through repeated iterations of excitation propagation and weight update, until the weight classification model to be trained converges.

Optionally, when the weight classification model to be trained includes a residual structure, a pooling structure, and a fully connected structure, through repeated iterations of incentive propagation and weight update, until the weight classification model to be trained converges, including: In the stage, the image is passed through the convolutional layer of the weight classification model to be trained to obtain features, the category prediction result is obtained in the fully connected layer of the weight classification model to be trained, and then the category prediction result is combined with the label category of the data in the first prediction data set Find the difference to obtain the response error of the hidden layer and the output layer; in the weight update stage, the error is multiplied by the derivative of the response of the current layer to the response of the previous layer to obtain the gradient of the weight matrix between the two layers, along the inverse of the gradient. The direction adjusts the weight matrix with the set learning rate; the gradient matrix is determined as the error of the previous layer, and the weight matrix of the previous layer is calculated, and the weight classification model to be trained is updated through iterative calculation until the weight classification model to be trained converges .

According to another aspect of the embodiments of the present invention, a data training method is provided, which includes: initializing a feature extraction module in a target detection model through a convergent weight classification model to obtain a target detection model to be trained; wherein the convergent weight The classification model is trained by the above data training method; the target detection model to be trained is trained by the target location frame label information in the second preset data set to obtain the trained target detection model; according to the target key in the third preset data set The point label information trains the network parameters of the single-person pose estimation model to be trained to obtain the trained single-person pose estimation model; according to the trained target detection model and the trained single-person pose estimation model, the weighted attention neural network is obtained model.

Optionally, the target detection model to be trained is trained based on the target location frame label information in the second preset data set, and the target detection model obtained after training includes: the target detection model includes a feature extraction module, a suggestion frame generation module, and a target In the case of the classifier and the position box regression prediction module, the feature extraction module and the suggestion box generation module are trained respectively to obtain the first parameter value of the feature extraction module and the first parameter value of the suggestion box generation module; according to the first parameter of the feature extraction module The first parameter value of the value and suggestion box generation module trains the target classifier and the position box regression prediction module to obtain the first parameter value of the target classifier and position box regression prediction module and the second parameter value of the feature extraction module; according to the target classifier and position Box regression prediction module first parameter value and feature extraction module second parameter value training suggestion box generation module to obtain the second parameter value of the suggestion box generation module; training based on the second parameter value of the suggestion box generation module and the second parameter value of the feature extraction module The target classifier and the position box regression prediction module obtain the second parameter value of the target classifier and the position box regression prediction module.

Further, optionally, the feature extraction module is used to extract the features of each data in the second preset data set; the suggestion frame generation module is used to generate candidate target frames of each data according to the features of each data in the second preset data set ; The target classifier and position frame regression prediction module is used to obtain the detection frame of each data target in the second preset data set and the corresponding detection frame according to the characteristics of each data in the second preset data set and the candidate target frame of each data Category; when the suggestion frame generation module includes a convolutional layer with a sliding window, two parallel convolutional layers are connected after the convolutional layer, and the two parallel convolutional layers are the regression layer and the classification layer, the suggestion frame is generated The module is used to generate candidate target frames of each data according to the characteristics of each data in the second preset data set, including: obtaining each data in the second preset data set through the regression layer according to the characteristics of each data in the second preset data set The coordinates of the center anchor point of each candidate target frame and the width and height of the corresponding candidate target frame; the classification layer determines whether each candidate target frame of each data is foreground or background.

Optionally, when the structure of the target classifier and the location box regression prediction module is a pooling layer, three fully connected layers and two parallel fully connected layers that are sequentially connected, the target classifier and the location box regression The prediction module is used to obtain the detection frame of each target of each data in the second preset data set and the corresponding detection frame category according to the characteristics of each data in the second preset data set and the candidate target frame of each data, including: through pooling The layer converts the characteristics of each data of different lengths output by the feature extraction module into the characteristics of each data of a fixed length; according to the characteristics of each data of a fixed length, it passes through three fully connected layers and then passes through two parallel fully connected layers. , Output the detection frame of each target of each data in the second preset data set and the category of the corresponding detection frame.

Optionally, training the network parameters of the single-person pose estimation model to be trained based on the target key point label information in the third preset data set, and the single-person pose estimation model obtained after training includes: according to the information in the third preset data set Target key point label information trains the network parameters of the single-person pose estimation model to be trained, and iteratively updates the network parameters of the single-person pose estimation model to be trained through forward propagation and backward propagation algorithms; among them, according to the third preset The target key point label information in the data set is trained on the network parameters of the single-person pose estimation model to be trained, and the network parameters of the single-person pose estimation model to be trained are iteratively updated through the forward propagation and backward propagation algorithms. The network parameters include: according to the preset The aspect ratio expands the height or width of the input single image and crops the single image to a preset size.

Optionally, the method used in training the network parameters of the single-person pose estimation model to be trained includes the above-mentioned data processing method.

Optionally, the method further includes: collecting samples required for training the target detection model to be trained and the single-person pose estimation model to be trained; preprocessing the samples, where the preprocessing includes: data set division and preprocessing Operation; training the weight classification model to be trained to obtain a convergent weight classification model includes: inputting the data in the first preset data set into the weight classification model to be trained to obtain the category prediction result; according to the category prediction result and the first prediction data The label category of the concentrated data is obtained, and the error between the category prediction result and the label category of the data in the first prediction data set is obtained; according to the error, the backpropagation algorithm is used to train the weight classification model to be trained until the weight classification model to be trained converges to obtain Convergent weight classification model.

Further, optionally, the first preset data set includes: a first type of image data set, the first type of image data set defines a training set and a validation set; the second preset data set includes a second type of image data set And the third type of image data set has a data set labeled with position box information; the second type of image data set has customized training set and verification set; the third type of image data set is randomly divided into training set and verification set according to the preset ratio; The training set of the second type of image data set and the training set of the third type of image data set are the training set of the second preset data set, the validation set of the second type of image data set and the validation set of the third type of image data set are The verification set in the second preset data set; the third preset data set includes the second type image data set and the third type image data set labeled with key point information; the preprocessing operation includes: The data in one preset data set and the third preset data set are processed separately; the data in the second preset data set is processed through random mixing operation and/or random geometric transformation.

Optionally, the random geometric transformation includes random cropping, random rotation according to a preset angle, and/or random scaling according to a preset zoom ratio; the random mixing operation includes superimposing at least two data according to preset weights, specifically The product of the preset position pixel value in different data and the preset weight is added.

According to another aspect of the embodiments of the present invention, there is provided a data recognition method. Based on the above method, the method includes: inputting feature data to be recognized into a weighted attention neural network model, and identifying at least one target in the feature data to be recognized Two-dimensional coordinates of key points, where the weighted attention neural network model is used to estimate the pose of at least one person in a top-down manner, detect the position rectangle of at least one target in the feature data to be recognized, and detect the position rectangle The two-dimensional coordinates of the key points of the inner target; through the calculation of the two-dimensional coordinates of the key points of the target, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first The angle between the line of the preset key point combination and the first preset line; the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first The included angle between the line of the preset key point combination and the first preset line is matched in the first preset database to obtain the recognition result of the target.

Optionally, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the line of the first preset key point combination and the first preset line Matching the included angle of at least one in the first preset database to obtain the recognition result of the target includes: in the case that the feature data to be recognized includes image data, the obtained angle value of at least one included angle is compared with the first preset database Match the angle values of the corresponding included angle types in to obtain the recognition result of the image data.

Optionally, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the line of the first preset key point combination and the first preset line The included angle of is matched in the first preset database, and the recognition result of the target is obtained including: in the case that the feature data to be recognized includes video data, for each frame or specified frame, obtain the center of each corresponding frame of the video data The key point two-dimensional coordinate information of at least one target, wherein the designated frame is a fixed time interval frame and/or key frame; according to the key point two-dimensional coordinate information of at least one target of each corresponding frame in the video data, the at least one target’s two-dimensional coordinate information An angle-time variation curve of a specific included angle is compared and analyzed with at least one angle-time variation curve of at least one standard motion to obtain an identification result.

Further, optionally, according to the two-dimensional coordinate information of the key point of at least one target in each corresponding frame of the video data, the angle-time variation curve of at least one specific included angle of the at least one target is obtained, and the angle-time variation curve of at least one specific included angle is obtained by comparing with at least one standard motion. The comparison and analysis of the angle-time variation curve of at least one included angle to obtain the recognition result includes: comparing the angle-time variation curve of at least one specific included angle of the at least one target with at least one angle of at least one included angle obtained in advance for at least one standard motion The time variation curve is compared for similarity. If the similarity falls within the first preset threshold interval, it is determined that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type; the corresponding standard motion type of each corresponding frame in the video data is determined. When the target is performing the corresponding standard exercise type, further compare the angle time change curve of at least one specific included angle of the target with the angle time change curve of the corresponding specific included angle of the standard motion; if the target has at least one specific included angle The difference between the adjacent maximum value on the angle-time variation curve of the standard motion and the adjacent maximum value on the angle-time variation curve of the corresponding specific included angle of the standard motion falls within the second preset threshold interval, and then the specific target in the video data is determined The joint motion specification corresponding to the included angle, otherwise the joint motion corresponding to the specific included angle of the target in each corresponding frame of the video data is not standardized; the angle time variation curve of at least one specific included angle of the target is judged between adjacent peaks Whether the difference between adjacent peaks on the angle-time variation curve of the corresponding specific included angle of the standard motion falls within the third preset threshold interval, the fourth preset threshold interval or the fifth preset threshold interval, and then confirm the video data The motion intensity of the joint action corresponding to the specific included angle of the target in each corresponding frame is too low, appropriate, or too high.

Optionally, the method further includes: performing matching in a second preset database according to the recognition result to obtain a posture evaluation result corresponding to the recognition result.

Further, optionally, after obtaining the posture evaluation result corresponding to the recognition result, the method further includes: matching in a third preset database according to the posture evaluation result to obtain suggestion information corresponding to the posture evaluation result.

According to another aspect of the embodiments of the present invention, there is provided a data recognition device, including: a coordinate recognition module, configured to input feature data to be recognized into a weighted attention neural network model, and identify at least one of the feature data to be recognized Two-dimensional coordinates of key points of the target, where the weighted attention neural network model is set to estimate the pose of at least one person in a top-down manner, detect the position rectangle of at least one target in the feature data to be recognized, and detect the position The two-dimensional coordinates of the key points of the target in the rectangular frame; the calculation module is set to calculate through the two-dimensional coordinates of the key points of the target to obtain the connection line of the first preset key point combination and the line of the second preset key point combination The included angle between the first preset key point combination or the included angle between the first preset line and the first preset line; the matching module is set to combine the first preset key point combination line with the second preset key point The angle between the combined lines or the angle between the line of the first preset key point combination and the first preset line is matched in the first preset database to obtain the recognition result of the target.

Optionally, the matching module includes: a first matching unit configured to compare the obtained angle value of at least one included angle with a corresponding included angle in the first preset database when the feature data to be recognized includes image data The angle value of the type is matched, and the recognition result of the image data is obtained.

Optionally, the matching module includes: an acquiring unit configured to acquire, for each frame or specified frame, a key point of at least one target of each corresponding frame in the video data when the feature data to be identified includes video data. Coordinate information, wherein the designated frame is a fixed time interval frame and/or a key frame; the second matching unit is set to obtain at least one of the at least one target according to the key point two-dimensional coordinate information of at least one target of each corresponding frame in the video data The angle-time variation curve of a specific included angle is compared and analyzed with the angle-time variation curve of at least one included angle of at least one standard motion to obtain an identification result.

Further, optionally, the second matching unit includes: a first judging subunit, configured to clip the angle-time variation curve of at least one specific included angle of the at least one target with at least one pre-obtained at least one standard motion curve. The angle-time variation curve of the angle is compared for similarity. If the similarity falls within the first preset threshold interval, it is determined that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type; the comparison subunit is set to be in In the case of determining that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type, further compare the angle time change curve of at least one specific angle of the target with the angle time change curve of the corresponding specific angle of the standard motion ; The second judging subunit is set to determine the difference between the adjacent maximum value on the angle-time variation curve of at least one specific included angle of the target and the difference between the adjacent maximum value on the angle-time variation curve of the corresponding specific included angle of the standard motion If it falls within the second preset threshold interval, determine the joint motion specification corresponding to the specific included angle of the target of each corresponding frame in the video data, otherwise the joint motion corresponding to the specific included angle of the target in the video data is not standardized; third; The judging subunit is set to judge whether the distance between adjacent peaks on the angle-time variation curve of at least one specific included angle of the target and the adjacent peaks on the angle-time variation curve of the corresponding specific included angle of the standard motion falls within The third preset threshold interval, the fourth preset threshold interval or the fifth preset threshold interval further confirms that the joint motion intensity corresponding to the specific included angle of the target of each corresponding frame in the video data is too low, appropriate or too high.

Optionally, the device further includes: an evaluation module configured to perform matching in a second preset database according to the recognition result to obtain a posture evaluation result corresponding to the recognition result.

Further, optionally, the device further includes: a suggestion module configured to, after obtaining the posture evaluation result corresponding to the recognition result, perform matching in a third preset database according to the posture evaluation result to obtain the suggestion information corresponding to the posture evaluation result .

According to one aspect of the embodiments of the present invention, there is provided a non-volatile storage medium, the non-volatile storage medium includes a stored program, wherein the device where the non-volatile storage medium is located is controlled to execute the above method when the program is running.

According to one aspect of the embodiments of the present invention, there is provided a data recognition device, including: a non-volatile storage medium and a processor configured to run a program stored in the non-volatile storage medium, and the above method is executed when the program is running .

In the embodiment of the present invention, a weighted attention mechanism is proposed, in which, by introducing a learnable mask mechanism, the grouping convolution mode of the network is not artificially fixed, so that the network itself learns the convolution group and selects the filter useful for the network Perform convolution operation to improve the performance of the network; perform data training on the weight classification model to be trained based on the weight attention mechanism to obtain the weight classification model, and initialize the initial parameters of the feature extraction module in the target detection model through the weight classification model , So that in the process of obtaining the weighted attention neural network model, the weight classification model is used to improve the accuracy of the target detection model and accelerate the convergence speed of the model training;

Based on the above data training method, the top-down multi-person pose estimation method is adopted, and the key points of at least one target in the feature data to be recognized are recognized by inputting the feature data to be recognized into the weighted attention neural network model. Two-dimensional coordinates, where the weighted attention neural network model is used to estimate the pose of at least one person in a top-down manner, detect the position rectangle of at least one target in the feature data to be recognized, and detect the target in the position rectangle The two-dimensional coordinates of the key points; the two-dimensional coordinates of the key points of the target are calculated to obtain the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first preset The angle between the line of the key point combination and the first preset line; the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first preset The angle between the line of the key point combination and the first preset line is matched in the first preset database to obtain the target recognition result, which achieves the purpose of improving the recognition accuracy and efficiency of the human body posture, thereby achieving The technical effect of providing the evaluation result according to the human body posture after the accuracy and efficiency is improved is solved, and the technical problem of low data processing efficiency due to the related technology in the process of recognizing the human body posture is solved.

Description of the drawings

The drawings described here are used to provide a further understanding of the present invention and constitute a part of this application. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present invention;

2 is a schematic diagram of a weighted attention mechanism in a data processing method according to an embodiment of the present invention;

FIG. 3 is a schematic flowchart of a data training method according to an embodiment of the present invention;

Fig. 4 is a network structure diagram of a weight classification model in a data training method according to an embodiment of the present invention;

Fig. 5 is a schematic flowchart of a data training method according to an embodiment of the present invention;

Fig. 6 is a schematic diagram of a target detection model in a data training method according to an embodiment of the present invention;

Fig. 7 is a schematic diagram of a single pose estimation model in a data training method according to an embodiment of the present invention;

8 is a schematic diagram of key point positions and skeleton connections in a data training method according to an embodiment of the present invention;

Fig. 9a is a schematic diagram of the effect before labeling the key point positions and the skeleton connection in the data training method according to the embodiment of the present invention;

Fig. 9b is a schematic diagram of the effect of labeling key point positions and skeleton connections in the data training method according to an embodiment of the present invention;

10 is a schematic diagram of the effect of mix-up in the data training method according to an embodiment of the present invention;

FIG. 11 is a schematic flowchart of a data identification method according to an embodiment of the present invention;

FIG. 12 is a schematic flowchart of a posture risk assessment based on deep learning in a data recognition method according to an embodiment of the present invention;

Fig. 13a is a schematic diagram of a front view in a method for assessing posture risk according to an embodiment of the present invention;

FIG. 13b is a schematic diagram of a side view in a method for assessing a posture risk according to an embodiment of the present invention;

FIG. 14 is a schematic diagram showing the evaluation result of posture risk in the data recognition method according to an embodiment of the present invention;

Fig. 15 is a schematic diagram of a data recognition device according to an embodiment of the present invention.

detailed description

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

Technical terms involved in this application:

Posture evaluation: Use certain technical methods to evaluate the posture of the characters in the picture, such as whether they have O/X legs, whether they have postural diseases such as hunchback or high and low shoulders, and can further conduct various serious posture conditions Grade scoring

Action recognition: Recognize the action category of the characters in the picture or video through certain technical methods, such as walking, raising hands, applauding and other gesture names or action category names;

Key point detection: Identify the key point coordinates of a single target or multiple targets in the picture/video through a certain technical method. If the target is a person, the key point coordinates are the bone key point coordinates.

Example one

According to one aspect of the embodiment of the present invention, a data processing method is provided. FIG. 1 is a schematic flowchart of a data processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:

Step S102: Input the first feature data with the first number of channels into the first type convolutional layer with the second number of filters for calculation, and output the second feature data with the second number of channels, where the first number is greater than Second quantity

Step S104, input the second feature data with the second number of channels to the second type convolutional layer with the second number of filters, and generate the mask parameters through the neural network according to the learnable mask parameters in the second type convolutional layer The mask of the weight of each filter in the second type of convolutional layer;

Step S106: Determine the connection mode between each filter in the second type convolutional layer and each channel in the second feature data according to the mask;

Step S108: Perform convolution calculation on the second feature data according to the mapping relationship obtained by the connection mode to obtain the third feature data;

Step S110: Input the third feature data with the second number of channels into the third type convolutional layer with the first number of filters for calculation, and output the fourth feature data with the first number of channels.

Specifically, in conjunction with step S102 to step S110, FIG. 2 is a schematic diagram of a weighted attention mechanism in a data processing method according to an embodiment of the present invention. Take the example of the weighted attention mechanism shown in FIG. 2 as an example for description. In the embodiment, the first feature data of the first number of channels can be feature map data with 256 channels, and the first type of convolutional layer of the second data volume filter can be a 1×1 convolution with 128 filters. Therefore, based on FIG. 2, step S102 is to input feature map data with 256 channels into a 1×1 convolutional layer with 128 filters for calculation, and output feature map data with 128 channels;

As shown in Figure 2, after the calculation of the 1×1 convolutional layer with 128 filters, in step S104, the feature map data with the number of channels of 128 is input as input to the 3 with 128 filters. ×3 convolutional layer (that is, the second-type convolutional layer with the second number of filters in the embodiment of the present application), wherein, in the process of performing convolution calculation on the second-type convolutional layer (3×3) , Is a mask based on the weights of each filter generated by the fully connected layer in the 3×3 convolutional layer. According to the mask, it is determined in step S106 that the number of filters and channels in the 3×3 convolutional layer is 128 The connection mode of each channel in the feature map data (see the mask diagram on the left in Figure 2), according to the connection mode, in step S108 according to the mapping relationship of the connection mode, the feature map data with the number of channels is 128 Perform the convolution calculation to obtain the third feature data, that is, the feature map data with the number of channels of 128; finally, in step S110, the feature map data with the number of channels of 128 is used as input and input to 1 with 256 filters. Calculate in the ×1 convolutional layer, and obtain the feature map data with the number of channels of 256.

Specifically, the second type of convolutional layer has a total of 128 channels, and the 3×3 convolution kernel contains 10 weights, that is, the total number of weights is 128×10 (see Figure 2 ) Figure), the corresponding output is also 128 channels, and the corresponding mask matrix size is 128×128. Therefore, we can use a fully connected layer with 10 input channels, 128 output channels, and a sigmoid activation function to generate a weight mask corresponding to 128 output channels for the 10 weights of each convolution kernel.

In order to efficiently use the matrix multiplication of the existing deep learning framework to calculate the weight mask, you can choose the following calculation method: repeat 128 input channels 128 times to obtain a 128×128 matrix, and perform Hadamard product with the mask matrix , And then transformed into ^{a feature channel with a length of 128 2} , and then based on the grouped convolution of 128 groups and 1 output channel for each group to obtain 128-channel output.

In the embodiments of the present application, the foregoing examples are only used as an example for description, and implementation of the data processing method provided in the embodiments of the present application shall prevail, and the details are not limited.

In an achievable example, according to the mask parameters that can be learned in the second-type convolutional layer, generating a mask of the weights of each filter in the second-type convolutional layer through a neural network includes: according to the second-type convolutional layer The fully connected layer in the build-up layer generates a mask of the weight of each filter in the second-type convolutional layer.

In the embodiment of this application, a fully connected layer is used to generate a filter mask (mask), as shown in the mask on the left side of Figure 2, in the backward propagation, the network is allowed to learn the mask (mask), and the mask is set to 1. The part is the filter selected by the network. Since the filter connection method is selected by learning a weight mask, the method of filter selection and convolution calculation between the filter and the channel used in steps S102 to S110 in the embodiment of the present application is the weight. The force mechanism weight attention. In the embodiment of this application, a learnable mask mechanism is introduced, and the grouped convolution mode of the fixed network is not artificially fixed (the grouped convolution mode is the same type of line between in and out shown in Figure 2, which represents the output Convolution is only calculated based on the connected input channels), let the network learn the convolution group by itself, and select filters useful for the network to perform convolution operations to improve the performance of the network.

It should be noted that, in the above embodiment of the present application, the fully connected layer is used to generate the filter mask, which can also be implemented by the following scheme:

As shown in Figure 2, after the calculation of the 1×1 convolutional layer with 128 filters, in step S104, the feature map data with the number of channels of 128 is input as input to the 3 with 128 filters. ×3 convolutional layer (that is, the second-type convolutional layer with the second number of filters in the embodiment of the present application), wherein, in the process of performing convolution calculation on the second-type convolutional layer (3×3) , Is a learnable mask based on the weight of each filter in the 3×3 convolutional layer. According to the mask, it is determined in step S106 that each filter and channel number in the 3×3 convolutional layer is 128 feature map data The connection mode of each channel in (see the mask diagram on the left in Figure 2), according to the connection mode, in step S108 according to the mapping relationship of the connection mode, the convolution calculation is performed on the feature map data with the number of channels 128 , Obtain the third feature data, that is, the feature map data with the number of channels 128; finally, in step S110, the feature map data with the number of channels 128 is used as input, and input to the 1×1 convolution with 256 filters Calculate in the layer, and get the feature map data with the number of channels 256. The mask generates a 128×128 mask matrix based on learnable parameters, derivable transformation and sigmoid activation function, and multiplies the weight of the mask with the filter to make different outputs of the second type of convolutional layer The filter will selectively use different input features; in the prediction, it will be binarized to 0 or 1 according to the preset threshold. Optionally, group convolution can be performed based on the specific connection mode to optimize the calculation efficiency.

Among them, the number of parameters used by the fully connected layer to generate a mask is 10 (input) * 128 + 128. In fact, any method of generating a 128×128 mask based on trainable parameters can be used. The embodiments of the present application only take the foregoing examples as examples for description, and implementation of the data processing methods provided in the embodiments of the present application shall prevail, and the specifics are not limited.

In the above solution, back propagation is used to learn the filter mask (mask), as shown in the mask on the left side of FIG. 2, the part of each row in the mask that is set to 1 is the input channel feature selected by the corresponding output filter. Since the filter connection method is selected by learning a weight mask, the method of filter selection and convolution calculation between the filter and the channel used in steps S102 to S110 in the embodiment of the present application is the weight. The weight attention of the force mechanism. In the embodiment of this application, the WeightNet network introduces a learnable mask mechanism, which does not artificially fix the grouping convolution mode of the network (the grouping convolution mode is the same type of line between in and out shown in Figure 2, representing The output convolution is only calculated based on the connected input channel), let the network learn the convolution group by itself, and select the useful filter for the network to perform the convolution operation to improve the performance of the network.

Optionally, the data processing method provided in the embodiment of the present application is applied to deep learning in artificial intelligence.

Specifically, the convolution algorithm based on step S102 to step S104 can apply this weighted attention mechanism to artificial intelligence technology, especially deep neural network learning, so that the network can perform self-learning between filters and channels based on its own learning. Group, and then perform convolution calculation, thereby improving the data processing ability of deep neural network learning.

Optionally, the data processing method provided in the embodiment of the present application is applied to recognize the posture or action of the target in the picture/video.

Specifically, in the embodiments of the present application, the target may be humans, animals, etc., that is, humans, animals in pictures or videos, in the extension of artificial intelligence (AI) calculation, and generally applicable, based on step S102 to step S102. The S104 convolution algorithm can specifically apply the weighted attention mechanism to recognize the posture of the target in the picture/video, here it can be applied to the security monitoring environment, based on the people, cars, animals, insects, etc. in the acquired pictures/videos Target, predict the behavior and movement trajectory of people, vehicles, animals, insects, etc.;

In the embodiments of the present application, it may be preferable to apply this technology to medical diagnosis, for example, by recognizing people in pictures/videos, using the recognized people as targets, and obtaining the key points of the target by obtaining the shape of the target , Perform posture evaluation according to the key point, and further evaluate the bone health of the target according to the posture. The image calculation process used to identify the person in the picture/video can be the data processing method described in step S102-step S110. The applicable convolution algorithm can be shown in Figure 2.

Among them, the application of the convolution algorithm shown in FIG. 2 to the data model training in AI technology is detailed in the data training method in the second embodiment. In the second embodiment, a WeightNet network is obtained based on the convolution algorithm shown in FIG. 2 For details of data training based on the WeightNet network, please refer to the second embodiment.

Example two

According to one aspect of the embodiment of the present invention, a data training method is provided. FIG. 3 is a schematic flowchart of a data training method according to an embodiment of the present invention. As shown in FIG. 3, the method includes the following steps:

Step S302: Obtain a weight classification model to be trained, where the weight classification model is a neural network model for acquiring image features of the image data;

Specifically, FIG. 4 is a network structure diagram of the weight classification model in the data training method according to the embodiment of the present invention. In the embodiment of the present application, the WeightNet-50 classification model is taken as an example for description. Based on the classification model, the image to be processed can be performed Image feature extraction. In the same way, the WeightNet-101 classification model is also applicable to the data training method provided in the embodiment of this application. The embodiment of this application only takes the WeightNet-50 classification model as an example for illustration, and realizes the data training method provided in the embodiment of this application as Standard, no specific limitation.

In step S304, the weight classification model to be trained is trained to obtain the weight classification model; wherein the method used in training the weight classification model to be trained includes the data processing method in Embodiment 1 above.

Specifically, based on the WeightNet-50 classification model obtained in step S302, by introducing training pictures into the WeightNet-50 classification model for convolution training, the WeightNet classification model (ie, the weight classification model provided in this embodiment of the application) is finally obtained.

Optionally, training the weight classification model to be trained in step S304 to obtain the weight classification model includes:

Step S3041: Input the data in the first preset data set into the weight classification model to be trained to obtain the category prediction result;

Specifically, in this embodiment of the application, the preset data set is an image data set covering all object categories, and the object categories include natural categories such as people, dogs, horses, etc.; wherein, in step S3041, the image data set Each type of picture data is input to the weight classification model to be trained as the first data, that is, pictures of various types, such as people, dogs, horses, etc., are input to the WeightNet-50 classification model to be trained, and the category prediction results of each picture data are obtained.

Among them, as shown in Figure 4, taking the WeightNet-50 classification model to be trained as an example, the WeightNet-50 classification model consists of a residual structure, a pooling structure and a fully connected structure:

The residual structure is completed by three layers of convolution. The first layer of convolution has n 1x1 convolution kernels with a step size of 1. The second layer has n 3x3 convolution kernels with a step size of 1. The third layer There are 2n 1x1 convolution kernels with a step size of 1;

The specific network parameter configuration is: the first convolution block is n=64 7x7 convolution kernels with a step size of 2 with 3x3 pooling with a step size of 2; the second convolution block is composed of 3 N=64 residual structure; the third convolution block is composed of 4 n=128 residual structures; the fourth convolution block is composed of 6 n=256 residual structures; fifth A convolutional block is composed of 3 residual structures with n=512; the last is a global average pooling layer and a softmax fully connected layer with 1000 outputs.

It should be noted that, in the embodiment of the present application, the data in the first preset data set is input into the weight classification model to be trained, and the obtained category prediction result can be the category prediction result of the picture or the category of the image in the video. As for the prediction result, the preset data set used in the embodiment of the present application only uses a picture-like data set as a preferred example for description, and in addition, it may also include a video image-like data set. The implementation of the data training method provided in the embodiment of the present application shall prevail, which is not specifically limited.

Step S3042: Obtain the error between the category prediction result and the label category of the data in the first preset data set according to the category prediction result and the label category of the data in the first preset data set;

Specifically, the image data of the labeled category based on the first preset data set is input into the WeightNet-50 classification model to be trained, and the feature is extracted through forward propagation, and the category prediction result is obtained.

The category prediction result is compared with the label category of the data in the first preset data set to obtain an error between the category prediction result and the label category of the data in the first preset data set.

Step S3043: Perform a back propagation algorithm to train the weight classification model to be trained according to the error, until the weight classification model to be trained converges, and a converged weight classification model is obtained.

Specifically, based on the error obtained in step S3042, the error back propagation algorithm is used to train the model until the model converges, and the WeightNet-50 classification model is obtained.

It should be noted here that the first preset data set in this embodiment of the application may be the ImageNet data set. The WeightNet classification model is pre-trained by using millions of ImageNet classification data, and the target detection model is initialized by the convergent weight classification model. The feature extraction module improves the accuracy of the final target detection model and speeds up the convergence speed of model training. Among them, the ImageNet data set is used because ImageNet contains 1.2 million ImageNet images in 1,000 categories, and the huge amount of data used as samples for training can meet the needs of AI technology for deep neural network learning.

Therefore, it should be noted that the first preset data set provided in the embodiment of the present application only uses the ImageNet data set as an example for description, and the data training method provided in the embodiment of the present application shall prevail, and the specifics are not limited.

Further, optionally, performing a back-propagation algorithm to train the weight classification model to be trained in step S3044 according to the error until the weight classification model to be trained converges includes:

Step S30441, through repeated iterations of excitation propagation and weight update, until the weight classification model to be trained converges; wherein, in the case that the weight classification model to be trained includes residual structure, pooling structure and fully connected structure, through excitation propagation Repeated iterations of weight update and weight update until the weight classification model to be trained converges include: in the incentive propagation stage, the image is obtained through the convolution layer of the weight classification model to be trained to obtain features, and the weight classification model to be trained is obtained in the fully connected layer For the category prediction result, the difference between the category prediction result and the label category of the data in the first prediction data set is obtained to obtain the response error of the hidden layer and the output layer; in the weight update stage, the error and the response of the current layer are compared with the response of the previous layer. Multiply the derivative of the function to obtain the gradient of the weight matrix between the two layers, adjust the weight matrix with the set learning rate along the opposite direction of the gradient; determine the gradient matrix as the error of the previous layer, and calculate the weight matrix of the previous layer , Through iterative calculation, the weight classification model to be trained is updated until the weight classification model to be trained converges.

Specifically, still take the ImageNet data set as an example to illustrate, use ImageNet's labeled category data to train network parameters, extract features through forward propagation, and use the category prediction results (one-hot) output by the network and the true label category error , The error back propagation algorithm is used to train the model until the model converges, and the WeightNet-50 classification model is obtained.

Among them, the error back propagation algorithm is used to train the convolutional neural network model, specifically the repeated iteration of the two links of excitation propagation and weight update, until the convergence condition is reached;

In the excitation propagation stage, the image is obtained through the convolutional layer of the WeightNet-50 classification model to obtain features, and the prediction result is obtained in the last fully connected layer of the network, and then the prediction result and the real result are calculated to obtain the response error of the hidden layer and the output layer. ；

In the weight update stage, the known error is first multiplied by the derivative of the function of the response of the current layer to the response of the previous layer to obtain the gradient of the weight matrix between the two layers, and then follow the opposite direction of this gradient with the set learning rate Adjust the weight matrix; then, use the gradient matrix as the error of the previous layer to calculate the weight matrix of the previous layer, and so on to complete the update of the entire model;

Here, in the embodiment of this application, Adam can be used as the optimizer for training the WeightNet-50 classification model. The basic learning rate can be set to 0.1 in the setting parameters, and it will be divided by 10 at the 32000 and 48000 iterations, and at 64000 iterations The training is terminated, the weight decay value is set to 0.0001, and the batch size is set to 128.

It should be noted that the training of the WeightNet-50 classification model in the embodiments of this application takes Adam as the optimizer as an example, and only the above is a preferred example in setting parameters, and the data training method provided in the embodiments of this application can be realized as Standard, no specific limitation.

Example three

According to one aspect of the embodiment of the present invention, a data training method is provided. FIG. 5 is a schematic flowchart of the data training method according to an embodiment of the present invention, as shown in FIG. 5, including:

In step S502, the feature extraction module in the target detection model is initialized by the convergent weight classification model to obtain the target detection model to be trained; wherein, the convergent weight classification model is obtained through training and training in the method in Embodiment 2;

Specifically, the data training method provided in the embodiments of the present application is suitable for training a weighted attention neural network model, where the weighted attention neural network model includes a target detection model (Faster-RCNN), and the Faster-RCNN is used for Extract the position frame information of each person in the input image to estimate the pose of a single person pose estimation model. Among them, Faster-RCNN includes: feature extraction module (WeightNet), suggestion frame generation module (RPN), target classifier and position frame regression prediction Module (Fast-RCNN);

Among them, the feature extraction module in step S502 is the feature extraction module in Faster-RCNN, based on the weight classification model obtained in embodiment 2, and then the feature extraction module is initialized based on the weight classification model, but does not include the output layer parameters. Here, the weights of the feature extraction module that obtains the image features in the first preset data set can be initialized by the weight classification model as follows:

Still taking the WeightNet-50 classification model as an example, the WeightNet-50 classification model is pre-trained for the classification task on the ImageNet dataset, and the final convergent weight is used as the initial weight of the feature extraction module in the person detection model to improve the final Accuracy of the character detection model and speed up the convergence speed of model training;

The training and verification here follow the above-mentioned data preprocessing operations; Adam is used as the optimizer (Adam: Adaptive Moment estimation, a method of random optimization); the basic learning rate is set to 0.1, and in the 32000th step of the iteration and Divide by 10 at 48000 steps, and terminate training at the 64000th iteration; the weight decay value is 0.0001; the batch size is set to 128.

Among them, the preprocessing operation of the image in the first preset data set adopts a preset probability to randomly flip horizontally. The preset probability can be set to 50%. When an image that does not need to be flipped is obtained, there is no need to perform random flipping. The actual requirements for image processing shall prevail.

Step S504: Train the target detection model to be trained by using the target location frame label information in the second preset data set to obtain the trained target detection model;

Specifically, based on the target detection model to be trained obtained in step S502, FIG. 6 is a schematic diagram of the target detection model in the data training method according to an embodiment of the present invention. As shown in FIG. 6, in combination with the second preset data set, In the embodiment of the present application, the second preset data set may be a data set containing the target location frame label information, where the second preset data set may be composed of the target location frame label information in the COCO and Kinetics-14 data sets Data set. Here, the target detection model is trained by using the data set composed of the target location frame label information in the COCO and Kinetics-14 data sets to improve the recognition effect of the final overall architecture on the location of characters in similar scenes. In addition, it should be noted that the feature extraction module in the embodiment of the present application is obtained based on the weight classification model training in Example 2. The difference lies in the structure and function of the weight classification model and the feature extraction module;

Among them, the weight classification model is used to pre-train the WeightNet-50 classification model on the ImageNet dataset for classification tasks, and use the final convergent weight as the initial weight of the feature extraction module in the target detection model to improve the final target detection The accuracy of the model and speed up the convergence speed of model training; and the structure of the weight classification model is: weight classification network + classifier;

The feature extraction module is obtained by initializing the weights through the weight classification model; structurally, the structure of the feature extraction module is to remove the weight classification model part of the classifier, that is, it contains the weight classification part.

Step S506, training the network parameters of the single-person pose estimation model to be trained according to the target key point label information in the third preset data set to obtain the trained single-person pose estimation model;

Specifically, in the embodiment of the present application, the third preset data set may be a data set containing tag information of target key points, where the third preset data set may be the target key points in the COCO and Kinetics-14 datasets A data set composed of tag information. Here, a single-person pose estimation model is trained by using a data set composed of target key point tag information in the COCO and Kinetics-14 data sets to improve the recognition effect of the final overall architecture on the key points of character bones in similar scenes.

Among them, the single-person pose estimation model in the embodiment of the present application can take the HRNet model as an example. FIG. 7 is a schematic diagram of the single-person pose estimation model in the data training method according to an embodiment of the present invention. As shown in FIG. 7, it is based on the HRNet algorithm. And the data set constructed above, retrain a single-person pose model that meets this scenario; the HRNet model connects high-resolution to low-resolution subnets in parallel, which is different from the serial connection in related technologies, and the HRNet model maintains high resolution , Instead of restoring the resolution through a low-to-high process; and the fusion scheme that is different from the related technology, it combines the low-level and high-level representations. In the embodiment of this application, the HRNet model uses repeated multi-scale fusion, using the same Depth and similar level of low-resolution representation to improve high-resolution representation.

In step S508, a weighted attention neural network model is obtained according to the trained target detection model and the trained single-person pose estimation model.

Specifically, combining the trained target detection model obtained in step S504 and the trained single pose estimation model obtained in step S506 to obtain a weighted attention neural network model, that is, the combination of the Faster-RCNN model and the HRNet model constitutes a weighted attention Force neural network model.

In summary, the first preset data set in the data training method provided by the embodiments of the present application is used to train the weight classification model, and then the convergent weight classification model is used to initialize the feature extraction module in the target detection model; the second preset data set Used to train the target detection model; the third preset data set is used to train the single-person pose estimation model.

Optionally, in step S504, training the target detection model to be trained based on the target location frame label information in the second preset data set, and obtaining the trained target detection model includes:

Step S5041, in the case where the target detection model includes a feature extraction module, a suggestion box generation module, and a target classifier and a position box regression prediction module, train the feature extraction module and the suggestion box generation module respectively to obtain the first parameter of the feature extraction module Value and suggestion box generation module first parameter value;

Specifically, based on the Faster-RCNN in step S502 including: feature extraction module, suggestion box generation module (RPN) and target classifier and location box regression prediction module (Fast-RCNN), the parameters of the feature extraction module and RPN module The details of the training are as follows: separately train the feature extraction module and the RPN module parameters to obtain rpn1 (that is, the first parameter value of the suggestion box generation module in the embodiment of this application) and weightnet1 (that is, the first parameter value of the feature extraction module in the embodiment of this application) A parameter value).

Among them, the suggestion box generation module in the target detection model, and the target classifier and position box regression prediction module can be initialized by different data distribution methods (commonly used initialization methods are: 1. Initialize to 0, 2. Random initialization, 3. Xavier initialization, 4. He initialization; in the embodiment of the present application, 3 or 4 is preferred).

Step S5042: Train the target classifier and the position box regression prediction module according to the first parameter value of the feature extraction module and the first parameter value of the suggestion box generation module to obtain the first parameter value of the target classifier and the position box regression prediction module and the first parameter value of the feature extraction module. Two parameter values;

Specifically, Fast-RCNN (that is, the target classifier and position box regression prediction module in this embodiment of the application) is trained according to the first parameter value of the feature extraction module and the first parameter value of the suggestion box generation module to obtain fast-rcnn1 (ie , The first parameter value of the target classifier and the position box regression prediction module in the embodiment of the present application), WeightNet2 (ie, the second parameter value of the feature extraction module in the embodiment of the present application).

Step S5043, training the suggestion box generation module according to the first parameter value of the target classifier and the position box regression prediction module and the second parameter value of the feature extraction module to obtain the second parameter value of the suggestion box generation module;

Specifically, the RPN (ie, the suggestion box generation module in the embodiment of the present application) is trained in combination with fast-rpn1 and WeightNet2 to obtain rpn2 (ie, the second parameter value of the suggestion box generation module in the embodiment of the present application).

Step S5044, training the target classifier and the position box regression prediction module according to the second parameter value of the suggestion box generation module and the second parameter value of the feature extraction module to obtain the second parameter value of the target classifier and the position box regression prediction module.

Specifically, the Fast-RCNN module is trained according to the second parameter value of the feature extraction module and the second parameter value of the suggestion box generation module to obtain fast-rcnn2 (that is, the second parameter value of the target classifier and the position box regression prediction module in this embodiment of the application). Parameter value).

Among them, in the process of training the target detection model, the input image preprocessing operation can use mix-up and random horizontal flip (50%), and the process of training the target detection model can take Adam as an optimizer as an example. The parameters can be set as follows: the basic learning rate is 0.001, the weight attenuation value is 0.0001, the batch size is set to 32, and the steps of each iteration of the 4 training stages are 80,000, 40000, 80000, and 40000 respectively.

Specifically, based on the foregoing, as shown in FIG. 6, the feature extraction module in the target detection model provided by the embodiment of the present application is used to extract a feature map of the input image;

The proposal frame generation module (RPN) inputs the feature map extracted by the feature extraction module, and outputs a series of candidate target rectangular frame coordinates, which are used to generate the candidate target frame of the input image.

The main input of the target classifier and the location box regression prediction module (Fast-RCNN) is the feature map extracted by the feature extraction module and the candidate box generated by the suggestion box generation module for accurate location regression and category prediction results.

Among them, the RPN network structure includes: a convolution layer using a 3×3 sliding window, followed by two parallel 1×1 convolution layers, which are a regression layer (reg_layer) and a classification layer (cls-layer). Among them, the regression layer (reg_layer) is used to predict the center anchor point of the window corresponding to the coordinates x, y and width and height w, h of the candidate box on the original image; cls-layer (classification layer): used to determine that the candidate is the foreground Still background.

Further, optionally, in the case where the structure of the target classifier and the position box regression prediction module is a pooling layer, three fully connected layers and two parallel fully connected layers connected in sequence, the target classifier and The position frame regression prediction module is used to obtain the detection frame of each target of each data in the second preset data set and the corresponding detection frame category according to the characteristics of each data in the second preset data set and the candidate target frame of each data, including: Through the pooling layer, the characteristics of each data of different lengths output by the feature extraction module are converted into the characteristics of each data of a fixed length; according to the characteristics of each data of a fixed length, pass through three fully connected layers and then pass through two parallel The fully connected layer outputs the detection frame of each target of each data in the second preset data set and the category of the corresponding detection frame.

Specifically, taking person detection as an example, the target classifier and position box regression prediction module includes an ROI pooling layer, three fully connected layers and two fully connected layers in parallel. The main function of the ROI pooling layer is to differentiate different The input of the size is converted to the output of a fixed length, and the two parallel fully connected layers are mainly used to predict the category and return to the person detection frame.

Optionally, in step S506, training the network parameters of the single-person pose estimation model to be trained according to the target key point label information in the third preset data set, and obtaining the trained single-person pose estimation model includes: according to the third preset The target key point label information in the data set is trained on the network parameters of the single-person pose estimation model to be trained, and the network parameters of the single-person pose estimation model to be trained are iteratively updated through forward propagation and backward propagation algorithms; among them, according to the first The target key point label information in the three preset data sets is trained for the network parameters of the single-person pose estimation model to be trained, and the network parameters of the single-person pose estimation model to be trained are iteratively updated through forward propagation and backward propagation algorithms including: Expand the height or width of the input single-person image according to the preset aspect ratio, and crop the single-person image to a preset size.

Specifically, taking the HRNet model as an example, the input of the HRNet single-person pose estimation network is a single-person image, and the output is the two-dimensional coordinates of the key points of the human skeleton in the single-person image; the structure diagram of the HRNet single-person pose estimation network is shown in Figure 7 As shown, there are four stages. Starting from the second stage, each stage is divided into a sub-network in parallel, and its resolution is reduced by half compared with the previous network, and the width (the number of channels C) is doubled. Therefore, there are four parallel sub-networks in the final fourth stage; at the same time, each stage contains several exchange blocks (not in the first stage), and each exchange block contains a basic unit on a branch ( It consists of 4 WeightNet residual units, each WeightNet residual unit is shown in Figure 2) and a switching unit that spans the resolution; among them, the function of the switching unit is to pass the current output of each parallel sub-network through up-sampling and down-sampling. Sampling or identity mapping operation, the resolution of different branches is merged as the next input of the branch to achieve the effect of multi-scale fusion of the model; specifically, the first stage includes a basic unit and a 3×3 volume The main function of the 3×3 convolutional layer is to reduce the feature map channel output by the basic unit to 32 as the next high-resolution branch; the second, third, and fourth stages include 1, 4, and 3 respectively It can be seen that there are a total of 8 exchange blocks in HRNet, and 8 multi-scale fusions are performed. In the final stage, the number of each branch channel is 32, 64, 128, and 256, respectively.

Use the target key point label information in the COCO and Kinetics-14 data sets to train the HRNet network parameters, and update the network parameters iteratively through the forward propagation and backward propagation algorithms; among them, the HRNet network will input the height or width of the single-person image Expand to a fixed aspect ratio (height to width equals 4:3), and then crop the image to a fixed size of 384×288; data enhancement (preprocessing) includes random rotation (±45 degrees), random scaling (0.65～1.35) And/or random level flip; Adam optimizer is used during training, the basic learning rate is set to 0.001, the batch size is set to 16, and it drops to 0.0001 and 0.00001 at the 170th and 200th epochs, respectively. The total training epoch is set to 210.

In summary, the target detection model and the single-person pose estimation model provided in the embodiments of the present application both use the forward propagation algorithm to obtain the model prediction output and the mean square error of the true label (as in formula (1),

Among them, y _i is the model's prediction for the i-th data, y′ _i is the true label of the i-th data, and n is the batch size value;)

And through the back-propagation algorithm to update the model parameters, through a limited number of iterations, the mean square error of the training model on the training data set is minimized/converged (when training the model, when the training accuracy and error do not change with the training iteration steps, it tends to When it is stable, it is said that the model has converged and the error is minimized); and the optimal model is selected through the verification set as the detection model in the test phase (during training, the verification set is used to test the model every certain training interval, and finally Select the model with the highest accuracy or the smallest error on the validation set).

Optionally, the method used in training the network parameters of the single-person pose estimation model to be trained includes the data processing method in Embodiment 1.

Optionally, the data training method provided in this embodiment of the application further includes: collecting samples required for training the target detection model to be trained and the single-person pose estimation model to be trained; preprocessing the samples, where the preprocessing includes: Data set division and preprocessing operations; training the weight classification model to be trained to obtain a convergent weight classification model includes: inputting the data in the first preset data set into the weight classification model to be trained to obtain the category prediction result; according to the category The prediction result and the label category of the data in the first prediction data set are obtained, and the error between the category prediction result and the label category of the data in the first prediction data set is obtained; according to the error, the backpropagation algorithm is used to train the weight classification model to be trained until it is to be trained The weight classification model converges to obtain a convergent weight classification model.

Specifically, the samples in the embodiments of this application may be derived from open source data sets, such as: Microsoft COCO 2017 Keypoint Detection Dataset (Microsoft COCO 2017 Keypoint Detection Dataset), Kinetics-600 and ImageNet (Large Scale Visual Recognition Challenge);

Among them, the preprocessing in the embodiment of the present application includes the division of data sets and preprocessing operations, where the division of data sets is a step of processing data before inputting the data into the model, wherein the above three data sets are based on The data is divided in a preset way, so that the optimal data model can be obtained by screening.

The preprocessing operations include mixing operations and random geometric transformations. When the input is a picture, new training data is obtained by synthesizing different pictures, and the picture is geometrically transformed according to the training data, so that the Among them, it is common for people to be occluded. The preprocessing operation enriches the diversity of training data, makes the model more robust, and can effectively reduce the impact of confronting images.

Specifically, the first preset data set includes a first type of image data set. In this embodiment of the application, the first type of image data set can be described by taking the ImageNet data set as an example; Assuming that the second type of image data set included in the data set can be illustrated by taking the data set labeled by the position box information in the Microsoft COCO 2017 Keypoint Detection Dataset (hereinafter referred to as the COCO data set) as an example, The third type of image data set included in the second preset data set can be illustrated by taking the data set labeled with position frame information in Kinetics-14 as an example; the second type of image data set included in the third preset data set and the third type of image data set The data set labeled with key point information in the image-like data set can be illustrated by taking the data set labeled with key point information in the COCO data set and the data set labeled with key point information in Kinetics-14 as examples.

Among them, the COCO data set contains more than 200,000 images and a total of 250,000 data that has been labeled with two-dimensional key point information (in this data set, the scales of the characters in the pictures are mostly medium-scale and large-scale), and the training set and The validation set has a total of more than 150,000 people and 1.7 million labeled key points. The annotation information is mainly recorded in the corresponding .json format file, which records the detailed information of each picture, including: the URL of the picture download, the picture name, the picture resolution, the time when the picture was collected, the index (ID) of the picture, and the picture The number of visible bone key points of each character in the COCO data set (the number of complete annotations in the COCO data set is 17 bone key points, that is, Figure 8 is the connection between the key point position and the skeleton in the data training method according to the embodiment of the present invention The schematic diagram, as shown in Figure 8, the subscripts start from 0, respectively: 0: nose: 1 left eye, 2: right eye, 3: left ear, 4: right ear, 5: left shoulder, 6: right shoulder , 7: left elbow, 8: right elbow, 9: left wrist, 10: right wrist, 11: left hip, 12: right hip, 13: left knee, 14: right knee, 15: left ankle, 16: Right ankle, 17: The midpoint of the line connecting the left and right shoulders, because in the picture, some characters stand sideways or body parts are blocked, so this information only records the number of visible bone key points), the coordinates of the bone key points (Arranged in order, if there is no visible key point at a certain bone position, the corresponding position (x,y) is set to (0,0)), the rectangular position frame coordinates of each character (the upper left corner coordinates and the lower right corner coordinates ), category name (the COCO data set has about 80 categories, but only characters have the annotation information of bone key points), image segmentation information, and so on.

It should be noted that the left picture in FIG. 8 is a schematic diagram of the key point positions and skeleton connections of the COCO data set; the right picture in FIG. 8 is the key point obtained based on the COCO data set 2 in the data training method provided by the embodiment of the application Location and skeleton connection diagram

9a and 9b are schematic diagrams of the pre- and post-labeling effect of the key point positions and the skeleton connection in the data training method according to the embodiment of the present invention. As shown in Figs. 9a and 9b, the labeling process is as follows: Tool, manually mark the specific visible 17 points on each picture, the left side is the original picture, and the right side is the visualized effect picture after labeling.

Since the existing human body detection models and pose estimation models are mainly obtained from image training in natural scenes, they have poor results for target detection and pose estimation in sports scenes; this is because the body poses of people in sports scenes are more different from those in natural scenes. However, in most of the open source data sets, there are relatively few character position annotations and pose estimation annotation data in various sports scenes, resulting in the existing target detection models and pose estimation models for the detection and pose of people in sports scenes. Poor estimated effect;

In response to this problem, the embodiment of the application collects 14 additional sports categories from the Kinetics-600 open source data set, including: bench press, clean and jerk, rope climbing, deadlift, lunge, boxing, running, sit-ups, rope skipping, deep Squatting and stretching legs, a total of more than 10,000 pictures in sports scenes, and use the open source software Visipedia Annotation Toolkit (an image key point annotation tool) to mark them, the annotation format is the same as the COCO data set, in the embodiment of this application Call it Kinetics-14; based on the target in Kinetics-14 (that is, the third type of image data set in the embodiment of this application) and the COCO data set (that is, the second type of image data set in the embodiment of this application) The position frame label information and the target key point label information are used to train the target detection model and the single-person pose estimation model respectively to improve the final overall framework's recognition of the position of the person in the similar scene and the recognition of the key points of the skeleton. Among them, the data set composed of the target location frame label information in the Kinetics-14 and the COCO data set is the second preset data set in the embodiment of the application, and the data set composed of the target key point label information in the Kinetics-14 and the COCO data set This is the third preset data set in the embodiment of this application.

Use millions of ImageNet classification data to pre-train WeightNet, and use the final convergence weight as the initial weight of the feature extraction module in the person detection model, so as to improve the accuracy of the final person detection model and accelerate the convergence speed of model training.

Here, the first preset data set needs to perform random geometric transformation of the data during the process of inputting the above data model for data training; the second preset data set needs to perform data mixing operations during the process of inputting the above data model for data training And random geometric transformation; the third preset data set needs to perform random geometric transformation of the data during the process of inputting the above-mentioned data model for data training.

Specifically, the random mixing operation in the embodiment of the present application is expressed as: a mix-up operation, where FIG. 10 is a schematic diagram of the effect of mix-up in the data recognition method according to an embodiment of the present invention, as shown in FIG. 10, mix-up The operation process is as follows:

The two input images are merged into a new image according to a certain weight, and the merged image is used as the new input training data; since the target detection model is very sensitive to image geometric transformation, the two input images are used when performing mix-up operations. When the resolutions of the images are inconsistent, the geometric alignment will be used to avoid image distortion, that is, the image is not trimmed and zoomed, and the pixel value of the corresponding position is directly multiplied by a certain weight and then added. The specific expression is formula (2) . Through the mix-up operation, since it is common for people to occlude in the scene of multi-person movement, this operation is used as a data expansion method, which enriches the diversity of training data, makes the model more robust, and can effectively reduce the confrontation image Impact. It should be noted that after the implementation of this solution, the image after the mix-up operation needs to be normalized, so that the pixel value of each channel of the final image is still within the range of 0-255.

Among them, x _i and x _j represent two different images,

Represents the image synthesized by the mix-up operation, α and β represent the weight of the mix-up, the value range of α and β in the embodiment of this application is not limited (for example: in the classification task, preferably 0<α+β<1 In the target detection task, preferably α and β>1, another example: in the classification task, 0.2<α: β<0.4, in the target detection task, 0.8<α: β<1.2), preferably, this application In the embodiment, α=β=1.5 is set.

It should be noted that for the aforementioned mix-up operation process, the following methods can also be used instead:

The two input images are merged into a new image according to a certain weight, and the merged image is used as the new input training data; since the target detection model is very sensitive to image geometric transformation, the two input images are used when performing mix-up operations. When the resolutions of the images are inconsistent, geometric alignment will be used to avoid image distortion, that is, the image is not trimmed and zoomed, and the pixel value of the corresponding position is directly multiplied by a certain weight and then added. The specific expression is formula (3) . Through the mix-up operation, since it is common for people to occlude in the scene of multi-person movement, this operation is a data expansion method, which enriches the diversity of training data, makes the model more robust, and can effectively reduce the confrontation image Impact.

Among them, x _i and x _j represent two different images,

Represents the image synthesized by the mix-up operation, λ represents the weight of the mix-up, and for each

The λ is randomly sampled from a beta random distribution, expressed as formula (4). The value range of λ in the embodiment of this application is not limited (in the classification task, 0<λ<1, in the target detection task, λ>1). Preferably, λ=1.5 is set in the embodiment of this application.

λ～Beta(α,α) (4)

In addition, in the embodiments of this application, random geometric transformation includes random cropping (256x256, among which: there can be multiple cropping sizes. Considering the training hardware conditions, it is generally set to the Nth power of 2, and the shortest side is not less than 128, and the largest side Not greater than 512), randomly rotating within the range of (-45°, 45°) (ie, the preset rotation angle in the embodiment of this application), with a 50% probability of random horizontal flipping and randomly within the range of (0.65, 1.35) Zoom. Among them, random cropping means that the size of the original picture is randomly cropped to 256x256 (the cropping size used in the embodiment of this application), and the channel size is unchanged; the random rotation operation means that the image angle is randomly rotated within plus or minus 45 degrees to change the image content. Orientation; random flip operation means to flip the image at a random level with a probability of 50%; random zoom operation means to enlarge or reduce the image within a ratio of 0.65 to 1.35; through random geometric transformation, when training the classification network and the pose estimation network, the random geometry Transformation is not only to increase data, but also a method to reduce data noise and increase model stability.

It should be noted that the random geometric transformation in the embodiments of the present application may include one or a combination of at least two of random cropping, random rotation according to a preset angle, and/or random scaling according to a preset zoom ratio, and The execution sequence is adjusted according to the actual needs of the pictures. For example, if the size of some pictures just meets the data training, random cropping or scaling is not required; or, if the display angle of the pictures just meets the data training, random rotation is not required. In the same way, random geometric transformation is performed on the picture according to the actual demand for the picture.

Among them, the preprocessing operation in the embodiment of this application is to preprocess the part of the data originally used for training in the above-mentioned manner during model training (each round), and then these preprocessed data are used for training. ; The data selected between different rounds is different from the actual training data after preprocessing, in order to achieve the effect of gradual convergence.

Example four

According to one aspect of the embodiment of the present invention, a data identification method is provided. Based on the method in the third embodiment above, FIG. 11 is a schematic flowchart of a data identification method according to an embodiment of the present invention, as shown in FIG. 11, including:

Step S1102, input the feature data to be recognized into the weighted attention neural network model, and identify the two-dimensional coordinates of the key points of at least one target in the feature data to be recognized. The weighted attention neural network model is used for top-down At least one person’s posture is estimated in a manner of, the position rectangle of at least one target in the feature data to be recognized is detected, and the two-dimensional coordinates of key points of the target in the position rectangle are detected;

Step S1104: Calculate the two-dimensional coordinates of the key points of the target to obtain the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first preset key point combination The angle between the connection and the first preset line;

Among them, the first preset line can be a horizontal line or a vertical line, etc.; there are two key points in the first preset key point combination; and there are two key points in the second preset key point combination.

Specifically, in the embodiment of the present application, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the line of the first preset key point combination and the first preset The angle between the lines is as follows:

Scenario 1: The included angle between specific two lines of specific three key points;

Among them, assuming that there are 3 key points that are not on the same straight line in the plane, according to the pairwise combination, that is, the line segment connecting key point 1 and key point 2, and the line segment connecting key point 1 and key point 3, pass the key The angle formed by connecting points 1.

Scenario 2: The angle between the connection of two specific key points and the environmental line (for example, a horizontal line or a vertical line, that is, the first preset line in the embodiment of the present application);

Among them, it is assumed that the two key points obtained are the two key points located at the shoulder of the human body target. In order to form a skeleton connection with other key points of the human body, a line segment connection is required. Therefore, when there is no redundant connection, , Through the connection with the horizontal line or the vertical line to form an angle.

Scenario 3: The angle between the connection of two specific key points and the connection of the other two key points;

Among them, similar to case 1, based on the two-dimensional coordinates of the key points obtained through the pass, two sets of lines from the two key points are obtained respectively, and the angle between the two lines is obtained.

Step S1106, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the line between the first preset key point combination and the first preset line The included angle is matched in the first preset database to obtain the recognition result of the target.

Specifically, in combination with step S1102 to step S1106, FIG. 12 is a schematic diagram of the evaluation process of the posture risk based on deep learning in the data recognition method according to the embodiment of the present invention. As shown in FIG. 12, the characteristic data in the embodiment of the present application It may include: pictures and/or videos, that is, in the embodiment of the present application, the input form of the feature data may include: form one: picture; form two: video; form three: picture and video.

Among them, the data recognition method provided by the embodiment of the present application also includes data sample collection and neural network learning before inputting the characteristic data into the end-to-end model. As shown in FIG. 12, the posture risk assessment provided by the embodiment of the present application The method is as follows:

Step1: Data collection, according to the acquired data set, sample collection;

Step2: Based on the sample collection of Step1, preprocess the data in the data set to obtain the training set and the test set respectively;

Step3: Input feature data to the end-to-end model to obtain the two-dimensional coordinates of the key points of the target;

Step4: According to the data type of the characteristic data, the angle is calculated according to the two-dimensional coordinates of the key points of the target, and the assessment result of the posture risk is generated.

Here, in the data recognition method provided by the embodiment of the application, the image to be evaluated is input to the end-to-end model, and the output is the two-dimensional coordinates of the key points of the human skeleton recognized by the model (ie, the key point two of the target in the embodiment of the application). Dimensional coordinates), calculate the angle value of a specific number of joints through the two-dimensional coordinates of the bone key point, and use the angle value to match the included angle with the included angle in the first preset database to obtain each included angle The corresponding position of, thereby generating the key point combination line in the right figure of Figure 8 to achieve the purpose of identifying the target in the image; and then by matching the recognition result in the second preset database, the posture evaluation result will be obtained; In addition, the input can also be a sports video, through the above-mentioned acquisition of the continuous change curve information of each joint angle of each athlete in the frequency stream (frame), and compare it with the standard sports library, and then provide targeted sports improvement guidance .

In addition, when the input is a picture and a video, as shown in Figure 12, the picture and video are processed separately in the multi-person pose estimation module. When the picture is the input, the key points of the human skeleton in the picture are obtained. Two-dimensional coordinates; when the video is input, obtain the continuous change curve information of each joint angle of each athlete in each frame of the video, or extract frame images from the video at a preset time interval, and according to the extracted The frame image acquires the continuous change curve information of each joint angle of each athlete, and extracts frame images at preset time intervals to reduce the pressure on the computer to recognize the image, reduce the amount of calculation, and improve the recognition efficiency; respectively according to the key of human bones Point two-dimensional coordinates and continuous change curve information to obtain the assessment result of the posture risk of each person in the picture and each person in the video.

In the embodiment of the present invention, the top-down multi-person pose estimation method is adopted, and the key points of at least one target in the feature data to be recognized are recognized by inputting the feature data to be recognized into the weighted attention neural network model. Dimensional coordinates, where the weighted attention neural network model is used to estimate the pose of at least one person in a top-down manner, detect the position rectangle of at least one target in the feature data to be recognized, and detect the position of the target in the position rectangle Two-dimensional coordinates of key points; calculated by the two-dimensional coordinates of the key points of the target, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first preset key is obtained The angle between the line of the point combination and the first preset line; the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first preset key The angle between the connection line of the point combination and the first preset line is matched in the first preset database to obtain the recognition result of the target, which achieves the purpose of improving the accuracy and efficiency of the recognition of the human body posture, thereby achieving The technical effect of the evaluation result is provided according to the human body posture after the accuracy and efficiency has been improved, thereby solving the technical problem of low data processing efficiency in the process of recognizing the human body posture due to related technologies.

Optionally, in step S1106, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the line of the first preset key point combination and the first preset The angle between the lines is matched in the first preset database, and the recognition result of the target includes:

In the case where the feature data to be recognized includes picture data, the obtained angle value of at least one included angle is matched with the angle value of the corresponding included angle type in the first preset database to obtain the recognition result of the picture data.

Optionally, the included angle includes: the angle between the line between the eyes and the horizontal line, the angle between the shoulder line and the horizontal line, the angle between the crotch line and the horizontal line, the angle between the center line of the head and the vertical line , The angle between the midline of the torso and the vertical straight line, the joint angle between the upper arm and the lower arm, the joint angle between the thigh and the calf, the angle between the line between the ear and the shoulder and the vertical straight line, the joint angle between the midline of the trunk and the midline of the thigh , The joint angle between the upper arm and the lower arm and the joint angle between the thigh and the calf.

Specifically, FIGS. 13a and 13b are schematic diagrams of front and side shots in the data recognition method according to an embodiment of the present invention. As shown in FIGS. 13a and 13b, they show specific 13 joint angles calculated by the angle calculation module. ; Including the angle between the line between the eyes and the horizontal line (front view/1), the angle between the shoulder line and the horizontal line (front view/2), the angle between the crotch line and the horizontal line (front view/ 3) The angle between the midline of the head and the vertical line (frontal photo/1), the angle between the midline of the torso and the vertical line (frontal photo/5), the joint angle between the upper arm and the lower arm (frontal photo/left 6 right 7), the joint angle between the thigh and the calf (front photo/left 8 right 9), the angle between the line between the ear and the shoulder and the vertical line (side photo/10), the joint angle between the midline of the torso and the midline of the thigh (side Photo/11), the joint angle between the upper arm and the lower arm (side photo/12) and the joint angle between the thigh and the calf (side photo/13). The specific calculation process is as follows: Suppose A, B, and C are three points on a two-dimensional plane (that is, any three points are obtained on the two-dimensional plane where the feature data in the embodiment of the present application is located), and a straight line AB is required. The angle between the line AC and the line AB can be calculated first, and then converted into the corresponding angle. The difference between the angles of the two lines is the angle to be obtained. Considering the direction of the angle, the angle is clockwise. Set as positive.

Step S11061, in the case that the feature data to be identified includes video data, for each frame or specified frame, obtain key point two-dimensional coordinate information of at least one target of each corresponding frame in the video data, where the specified frame is a fixed time Interval frames and/or key frames;

Among them, each frame or specified frame is implemented as follows:

Obtain the two-dimensional coordinate information of the key points of at least one target in each frame of the video data. Take a 10-second real-time video as an example. Within 10 seconds, obtain the key of at least one target in consecutive frames (each frame) in the video data. Point two-dimensional coordinate information;

Obtain the two-dimensional coordinate information of the key points of at least one target in the specified frame of the video data. Since there are often repeated images in consecutive frames, in order to improve the efficiency of data processing, the collection of preset time intervals (fixed time intervals) or key frame The two-dimensional coordinate information of the key point of at least one target in the frame picture reduces the pressure of data processing for each frame picture.

Among them, the key frame can be obtained through the relevant function flags of the software. For example, a frame with a detected person or animal is regarded as a key frame, and/or a frame with a preset amplitude motion change is determined as a key frame ; Obtaining two-dimensional coordinate information of key points of at least one target in the specified frame of the video data can be applied to the uploaded video data that has been shot.

In addition, in a distributed system, acquiring the key point two-dimensional coordinate information of at least one target in a specified frame of video data can be implemented simultaneously on multiple computing devices with data processing capabilities through fixed time interval frames and key frames. Two-dimensional coordinate information of key points of a target.

Step S11062: Obtain the angle-time variation curve of at least one specific included angle of at least one target according to the two-dimensional coordinate information of the key point of at least one target in each corresponding frame of the video data, and pass the at least one included angle with at least one standard motion Compare and analyze the angle-time variation curve of the angle, and get the recognition result.

Further, optionally, according to the two-dimensional coordinate information of the key point of at least one target in each corresponding frame of the video data, the angle-time variation curve of at least one specific included angle of the at least one target is obtained, and the angle-time variation curve of at least one specific included angle is obtained by comparing with at least one standard motion The comparison and analysis of the angle-time variation curve of at least one included angle to obtain the recognition result includes: comparing the angle-time variation curve of at least one specific included angle of the at least one target with at least one angle of at least one included angle obtained in advance for at least one standard motion The time variation curve is compared for similarity. If the similarity falls within the first preset threshold interval, it is determined that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type; the corresponding standard motion type of each corresponding frame in the video data is determined. When the target is performing the corresponding standard exercise type, further compare the angle time change curve of at least one specific included angle of the target with the angle time change curve of the corresponding specific included angle of the standard motion; if the target has at least one specific included angle The difference between the adjacent maximum value on the angle-time variation curve of the standard motion and the adjacent maximum value on the angle-time variation curve of the corresponding specific included angle of the standard motion falls within the second preset threshold interval, and then the specific target in the video data is determined The joint motion specification corresponding to the included angle, otherwise the joint motion corresponding to the specific included angle of the target in each corresponding frame of the video data is not standardized; the angle time variation curve of at least one specific included angle of the target is judged between adjacent peaks Whether the difference between adjacent peaks on the angle-time variation curve of the corresponding specific included angle of the standard motion falls within the third preset threshold interval, the fourth preset threshold interval or the fifth preset threshold interval, and then confirm the video data The motion intensity of the joint action corresponding to the specific included angle of the target in each corresponding frame is too low, appropriate, or too high.

Specifically, for example, to obtain the change curve of the arm of a person who is exercising in a video image, because when the person lifts or lifts the barbell, the coordinates of the key points of the arm in the image will change. Therefore, according to the value of each angle The connection obtained from the change of time, and then the angle-time change curve is obtained, and the angle-time change curve of at least one included angle of the corresponding standard motion type of at least one standard motion is compared and analyzed according to the angle-time change curve, and the identification is obtained result.

Among them, take the video of a person who is exercising as an example for illustration. Suppose the person in the video is lifting a barbell, and obtain the angle and time change of the angle formed by the key points of the person’s joints and related connections over time. Curve, the angle time change curve can be the angle time change curve obtained by the angle transformation in each frame of the image in the video; it can also be the angle time change curve obtained by the angle transformation of the frame images extracted in the preset time interval ；

Obtain the angle-time variation curve of at least one included angle of various standard motions in the database;

By comparing the similarity between the angle-time variation curves, if the obtained similarity falls within the first preset threshold interval, it is determined that the exercise performed by the person in the video data is lifting a barbell;

Further, by comparing the person’s angle-time variation curve with the standard angle-time variation curve of the lifting barbell, the difference between the two angle-time curves is found to be the adjacent maximum value, and the person’s status is judged by the difference. Whether the joint motion corresponding to each specific angle is standardized; and further, by calculating the difference between adjacent peaks on the two angle-time curves, and judging whether it belongs to the third preset threshold interval, the fourth preset threshold interval, or the fifth preset threshold interval. The threshold interval is preset to determine whether the person's exercise intensity is too low, appropriate, or too high.

Among them, the first preset threshold interval in the embodiment of the present application is used to determine the movement type of the target in the video; the second preset threshold interval is used to determine whether the movement posture of the target in the video is standardized; the third preset threshold interval, the first The fourth preset threshold interval or the fifth preset threshold interval is used to determine the exercise intensity of the target in the video;

It needs to be said that the setting of the third preset threshold interval, the fourth preset threshold interval or the fifth preset threshold interval can also be realized by setting a threshold interval, and the corresponding exercise intensity is set through the sub-intervals in each threshold interval.

It should be supplemented that, in an alternative to this embodiment, this embodiment may not perform action type recognition, and directly obtain the movement type of the feature to be recognized (for example, entering a video or image) (for example, when entering a video or image, the corresponding The type of movement), and then directly compare the at least one angle-time variation curve obtained by the feature recognition to be identified with the corresponding angle-time variation curve of the standard action corresponding to the entered type of movement; the comparison method can be as described above.

In Figure 10, the main input of the motion guidance module (dynamic evaluation) is the motion video of a single person or multiple people. The two-dimensional coordinate information of the key points of each human body in the motion video stream (frame) is obtained through the multi-person pose estimation model. The video stream The two-dimensional coordinates of the (frame) obtain the continuously changing curve value of each specific joint angle of each person in the video stream (frame) through the angle calculation module (each frame of the video (stream) can be regarded as a time point, each time point The connection line of each angle value is the angle change curve of (angle value y/frame x), which is compared and analyzed with the corresponding standard motion curve. Among them, the standard motion curve identifies the key points through the model of this application And each joint angle change value, and then obtain the standard movement curve, give the movement correction guidance.

The specific implementation is as follows: where each specific angle of each person is recorded with the input of the video stream (frame), a continuous angle change curve is recorded; in the first preset database, each type of standard action has been calculated and stored (Including different stances and orientations of the same action) the angle time change curve of each specific joint angle, when the angle time change curve of each specific joint angle of each person in the video stream (frame) is obtained, the It is matched and compared with the angle-time variation curve of the corresponding standard action; among them, the difference (the lowest value and the highest value) between the adjacent maximum values of the angle-time variation curve can be used to determine whether the motion amplitude of the specific joint to be tested is standardized, if The distance between the adjacent maximum value and the minimum value of the angle-time variation curve of the joint of the test subject and the distance value of the relative position in the standard motion video is greater than the specified threshold (that is, the second prediction in the embodiment of the present application). Set the threshold interval), it can be concluded that the movement of this part is not standardized; on the other hand, the distance between each two peaks of the angle change curve (the distance between two adjacent maximum or minimum values) can be used to measure The intensity of a specific angle exercise, if the distance between the adjacent maximum value of the angle time change curve of the joint specified by the tester, and the distance value of the relative position in the standard exercise video, the difference is greater than a specified threshold, and the difference lies in The interval in which the threshold is located (that is, the third preset threshold interval in the embodiment of the present application), it can be concluded that the joint exercise intensity is too high; if it is in the interval of a specified threshold, it can be concluded that the exercise intensity is moderate (that is, , The fourth preset threshold interval in the embodiment of the present application); if it is less than a specified threshold and the difference is in the interval where the threshold is located, it can be concluded that the exercise intensity is too low (ie, the fifth in the embodiment of the present application) Preset threshold interval). Integrate the norm values and strength values of all joints to get a final assessment.

Optionally, the data identification method provided in the embodiment of the present application further includes:

In step S1109, matching is performed in the second preset database according to the recognition result to obtain a posture evaluation result corresponding to the recognition result.

Specifically, the angle value-posture knowledge base is the second preset database in the embodiment of the application. In a specific embodiment, the posture assessment risk of each part is divided into three levels: low risk, potential risk, and high risk. The specific matching process is:

(1) Head roll risk assessment (0-4 degrees: low risk, 4-9 degrees: potential risk, 9 degrees: above high risk) The main matching angle is 1;

(2) High and low shoulder risk assessment (0-2 degrees: low risk, 2-4 degrees: potential risks, 4 degrees: above high risks) The main matching angle is 2;

(3) The risk assessment of ectopic spine (0-2 degree: low risk, 2-4 degree: potential risk, 4 degree: above high risk) The main matching angle is 5;

(4) Pelvic roll risk assessment (0-2 degrees: low risk, 2-4 degrees: potential risk, 4 degrees: above high risk) The main matching angle is 6;

(5) The risk assessment of abnormal legs (176-180 degrees: low risk, 173-176 degrees: potential risk, 173 degrees: the following high risk) The main matching angles are 8 and 9;

(6) The main matching angle of head tilt and round shoulder risk assessment (0-9 degrees: low risk, 9-14 degrees: potential risk, 14 degrees: above high risk) is 10;

(7) Knee hyperextension risk assessment (179-180 degrees: low risk, 177-179 degrees: potential risk, 177 degrees: following high risk) The main matching angle is 13.

Based on the above matching process, FIG. 14 is a schematic diagram of the evaluation result of the posture risk in the data recognition method according to the embodiment of the present invention. As shown in FIG. 14, the embodiment of the present application summarizes 7 common unhealthy postures, namely head tilt , High and low shoulders, ectopic spine, pelvic tilt, abnormal leg shape, head tilt and round shoulders and knee hyperextension.

Further, optionally, after obtaining the posture evaluation result corresponding to the recognition result, the data recognition method provided in the embodiment of the present application further includes:

In step S1110, matching is performed in the third preset database according to the posture evaluation result to obtain suggestion information corresponding to the posture evaluation result.

Specifically, in the embodiment of the present application, the first preset database, the second preset database, and the third preset database may be three independent databases, or databases located on different servers, or three stores on one server. A spatial database, or a data module used to store different types of mapping relationships in a database, based on the evaluation results obtained in step S1109, and provide corresponding advice information according to each evaluation result. The advice information includes but is not limited to the corresponding posture prompts Potential diseases, suggestions for improvement, etc., for example: when the assessment result includes the risk of forward tilt and round shoulders in the target, the recommended information corresponding to the assessment result can include: posture will cause cervical spine displacement and protrusion; The above changes in posture will cause dizziness, neurological headaches, and head pain; it is recommended to avoid playing with mobile phones for a long time, facing the computer, TV, and reading books for a long time. It is recommended to participate in more physical exercises, especially ball sports;

Or, when the evaluation result includes pelvic tilt, the recommended information corresponding to the evaluation result may include: posture will cause long and short legs, protruding lumbar disc; if the posture changes above, the length of the legs will be different, standing The weight of the receptor affects the weight bearing of the two legs; if the lumbar disc herniation occurs, it will cause uneven force on the lumbar spine and risk of paralysis in bed; recommendations for long and short legs: avoid crossing the two legs, supporting the sitting posture with one leg, and bearing the weight with one leg when standing Recommendations for lumbar disc herniation: Avoid sitting for a long time. It is recommended to take part in more physical exercises, exercise the lumbar spine appropriately, and cooperate with regular massage and massage.

In addition, the data recognition method provided by the embodiments of the present application can also be applied to online shopping. Taking clothes shopping online as an example, a user uploads a selfie photo or a selfie video, and performs identification through steps S1102 to S1106, and obtains the identification result, based on the identification As a result, it is compared with the model using product A stored in the server, and shopping suggestions are provided based on the comparison result. For example, the sizes of product A are: S size, M size, L size, XL size, XXL size; if steps S1102 to Step S1106 is performed to identify that the user’s reminder is the same size as the model, and the size of the product A worn by the model is M, it is recommended that the user buy product A in M size; if it is smaller than the model, it is recommended that the user buy S size Commodity A; On the contrary, it is recommended that users buy commodity A in L size, XL size or XXL size compared to the fat of the model.

Example five

According to one aspect of the embodiment of the present invention, a data recognition device is provided. FIG. 15 is a schematic diagram of the data recognition device according to an embodiment of the present invention. As shown in FIG. Input the weighted attention neural network model of the feature data to identify the two-dimensional coordinates of the key points of at least one target in the feature data to be recognized. The weighted attention neural network model is set to perform top-down processing of at least one person Posture estimation, detecting the position rectangle of at least one target in the feature data to be recognized, and detecting the two-dimensional coordinates of the key points of the target in the position rectangle; the calculation module 1504 is set to calculate through the two-dimensional coordinates of the key points of the target to obtain The angle between the line of the first preset key point combination and the line of the second preset key point combination or the angle between the line of the first preset key point combination and the first preset line; The module 1506 is configured to set the angle between the line of the first preset key point combination and the line of the second preset key point combination or the angle between the line of the first preset key point combination and the first preset line The angle between the two is matched in the first preset database to obtain the recognition result of the target.

Optionally, the matching module 1506 includes: a first matching unit configured to compare the obtained angle value of at least one included angle with a corresponding value in the first preset database when the feature data to be recognized includes image data. The angle value of the angle type is matched to obtain the recognition result of the image data.

Optionally, the matching module 1506 includes: an acquiring unit configured to acquire key points of at least one target of each corresponding frame in the video data for each frame or specified frame in the case that the feature data to be identified includes video data. Dimensional coordinate information, wherein the designated frame is a fixed time interval frame and/or a key frame; the second matching unit is set to obtain at least one target's two-dimensional coordinate information according to the key point two-dimensional coordinate information of at least one target of each corresponding frame in the video data An angle-time variation curve of a specific included angle is compared and analyzed with at least one angle-time variation curve of at least one standard motion to obtain an identification result.

Optionally, the data recognition device provided in the embodiment of the present application further includes: an evaluation module configured to perform matching in a second preset database according to the recognition result to obtain a posture evaluation result corresponding to the recognition result.

Further, optionally, the data recognition device provided in the embodiment of the present application further includes: a suggestion module configured to perform a matching in a third preset database according to the posture evaluation result after obtaining the posture evaluation result corresponding to the recognition result to obtain Suggested information corresponding to the posture assessment results.

Example Six

Example Seven

The sequence numbers of the foregoing embodiments of the present invention are only for description, and do not represent the superiority or inferiority of the embodiments.

In the above-mentioned embodiments of the present invention, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units may be a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Industrial applicability

The solutions provided in the embodiments of the present application can be applied to the image recognition process, for example, can be applied to the image recognition process of the human body posture. Based on the solution provided by the embodiment of the present application, since the top-down multi-person pose estimation method is adopted, the feature data to be recognized is input into the weighted attention neural network model to identify at least one target in the feature data to be recognized The two-dimensional coordinates of the key points, therefore, realize the evaluation results are provided based on the human body posture after the accuracy and efficiency are improved, and the purpose of improving the recognition accuracy and efficiency of the human posture is achieved, thereby solving the problem of the recognition process of the human posture due to related technologies In, the technical problem of low data processing efficiency.

Claims

A data processing method, including:

The first feature data with the first number of channels is input to the first type convolutional layer with the second number of filters for calculation, and the second feature data with the second number of channels is output, where the first number is greater than all The second quantity;

Input the second feature data with the second number of channels to the second type convolutional layer with the second number of filters, and according to the mask parameters that can be learned in the second type convolutional layer, pass A neural network generates a mask of the weight of each filter in the second-type convolutional layer;

Determine, according to the mask, a connection mode between each filter in the second-type convolutional layer and each channel in the second feature data;

Performing convolution calculation on the second feature data according to the mapping relationship obtained by the connection mode to obtain third feature data;

The third feature data with the second number of channels is input to the third type convolutional layer with the first number of filters for calculation, and the fourth feature data with the first number of channels is output.
The method according to claim 1, wherein the data processing method is applied to deep learning in artificial intelligence.
The method according to claim 1, wherein the data processing method is applied to recognize the posture or action of the target in the picture/video.
The method according to claim 1, wherein, according to the mask parameters that can be learned in the second-type convolutional layer, generating a mask of the weights of each filter in the second-type convolutional layer through a neural network comprises : Generate a mask of the weight of each filter in the second-type convolutional layer according to the fully connected layer in the second-type convolutional layer.
A data training method includes:

Acquiring a weight classification model to be trained, wherein the weight classification model is a neural network model for acquiring image features of the image data;

Training the weight classification model to be trained to obtain a convergent weight classification model;

Wherein, the method used in training the weight classification model to be trained includes the data processing method in claim 1.
The method according to claim 5, wherein the training the weight classification model to be trained to obtain a convergent weight classification model comprises:

Input the data in the first preset data set into the weight classification model to be trained to obtain a category prediction result;

Obtaining an error between the category prediction result and the label category of the data in the first prediction data set according to the category prediction result and the label category of the data in the first prediction data set;

Performing a back propagation algorithm to train the weight classification model to be trained according to the error until the weight classification model to be trained converges to obtain the converged weight classification model.
The method according to claim 6, wherein the training of the weight classification model to be trained by the back propagation algorithm according to the error until the weight classification model to be trained converges comprises:

Through repeated iterations of incentive propagation and weight update, until the weight classification model to be trained converges;

Wherein, in the case that the weight classification model to be trained includes a residual structure, a pooling structure, and a fully connected structure, the repeated iterations of incentive propagation and weight update until the weight classification model to be trained converges include:

In the stage of incentive propagation, the image is passed through the convolutional layer of the weight classification model to be trained to obtain features, the category prediction results are obtained in the fully connected layer of the weight classification model to be trained, and then the category prediction results are compared with the first One predicts the difference of the label categories of the data in the data set to obtain the response errors of the hidden layer and the output layer;

In the weight update stage, the error is multiplied by the derivative of the function of the response of the current layer to the response of the previous layer to obtain the gradient of the weight matrix between the two layers, and the weight is adjusted at the set learning rate along the opposite direction of the gradient. Matrix; the gradient matrix is determined as the error of the previous layer, and the weight matrix of the previous layer is calculated, and the weight classification model to be trained is updated through iterative calculation until the weight classification model to be trained converges.
A data training method includes:

The feature extraction module in the target detection model is initialized by the convergent weight classification model to obtain the target detection model to be trained; wherein the convergent weight classification model is obtained by training the method described in claim 5;

Training the target detection model to be trained by using the target location frame label information in the second preset data set to obtain the trained target detection model;

Training the network parameters of the single-person pose estimation model to be trained according to the target key point label information in the third preset data set to obtain the trained single-person pose estimation model;

According to the trained target detection model and the trained single-person pose estimation model, a weighted attention neural network model is obtained.
The method according to claim 8, wherein the training the target detection model to be trained by using the target location frame label information in the second preset data set to obtain the trained target detection model comprises:

In the case where the target detection model includes a feature extraction module, a suggestion box generation module, and a target classifier and position box regression prediction module,

Separately training the feature extraction module and the suggestion box generation module to obtain the first parameter value of the feature extraction module and the first parameter value of the suggestion box generation module;

Train the target classifier and the position box regression prediction module according to the first parameter value of the feature extraction module and the first parameter value of the suggestion box generation module to obtain the first parameter value of the target classifier and the position box regression prediction module and the second parameter value of the feature extraction module Parameter value

Training the suggestion box generation module according to the first parameter value of the target classifier and the position box regression prediction module and the second parameter value of the feature extraction module to obtain the second parameter value of the suggestion box generation module;

The target classifier and the position box regression prediction module are trained according to the second parameter value of the suggestion box generation module and the second parameter value of the feature extraction module to obtain the second parameter value of the target classifier and the position box regression prediction module.
9. The method according to claim 9, wherein the feature extraction module is used to extract features of each data in the second preset data set; the suggestion box generation module is used to extract features according to the second preset data set The feature of each data generates a candidate target frame of each data; the target classifier and the position frame regression prediction module are used to obtain the candidate target frame of each data according to the characteristics of each data in the second preset data set The detection frame of the target of each data in the second preset data set and the type of the corresponding detection frame;

When the suggestion frame generation module includes a convolutional layer with a sliding window, after the convolutional layer is connected two parallel convolutional layers, and the two parallel convolutional layers are respectively a regression layer and a classification layer The suggestion frame generating module is used to generate candidate target frames for each data according to the characteristics of each data in the second preset data set, including:

According to the characteristics of each data in the second preset data set, through the regression layer, the coordinates of the center anchor point of each candidate target frame of each data in the second preset data set and the corresponding candidate target frame's coordinates are obtained. Width and height

It is determined by the classification layer that each candidate target frame of each data is foreground or background.
The method according to claim 10, wherein the structure of the target classifier and the position box regression prediction module is a case where one pooling layer, three fully connected layers and two fully connected layers are connected in sequence Next, the target classifier and position frame regression prediction module is used to obtain the information of each data in the second preset data set according to the characteristics of each data in the second preset data set and the candidate target frame of each data The detection frame of each target and the corresponding detection frame category include:

Converting the features of the data of different lengths output by the feature extraction module into the features of the data of fixed length through the pooling layer;

According to the characteristics of each data of the fixed length, the detection frame of each target of each data in the second preset data set is output after passing through the three fully connected layers and then through the two parallel fully connected layers. And the category of the corresponding detection frame.
The method according to claim 8, wherein the training the network parameters of the single-person pose estimation model to be trained according to the target key point label information in the third preset data set, and obtaining the single-person pose estimation model after training comprises :

The network parameters of the single-person pose estimation model to be trained are trained according to the target key point label information in the third preset data set, and the single-person pose estimation to be trained is iteratively updated through forward propagation and backward propagation algorithms Network parameters of the model;

Wherein, the network parameters of the single-person pose estimation model to be trained are trained according to the target key point label information in the third preset data set, and the to-be-trained is updated iteratively through forward propagation and backward propagation algorithms The network parameters of the single-person pose estimation model include:

Expand the height or width of the input single-person image according to the preset aspect ratio, and crop the single-person image to a preset size.
The method according to claim 8, wherein the method used in training the network parameters of the single-person pose estimation model to be trained comprises the data processing method in claim 1.
The method according to claim 8, wherein the method further comprises:

Collecting samples required for training the target detection model to be trained and the single-person pose estimation model to be trained;

Preprocessing the sample, where the preprocessing includes: data set division and preprocessing operations;

The training the weight classification model to be trained to obtain the convergent weight classification model includes:

Input the data in the first preset data set into the weight classification model to be trained to obtain a category prediction result;

Obtaining an error between the category prediction result and the label category of the data in the first prediction data set according to the category prediction result and the label category of the data in the first prediction data set;

Performing a back propagation algorithm to train the weight classification model to be trained according to the error until the weight classification model to be trained converges to obtain the converged weight classification model.
The method according to claim 14, wherein the first preset data set comprises: a first type of image data set, the first type of image data set is customized with a training set and a verification set; the second preset data set comprises The second type of image data set and the third type of image data set have a data set labeled with position frame information; the second type of image data set is customized with a training set and a verification set; the third type of image data set is in accordance with the preset The ratio is randomly divided into a training set and a verification set; the training set of the second type of image data set and the training set of the third type of image data set are the training set of the second preset data set, and the second type of image The validation set of the data set and the validation set of the third type of image data set are the validation set of the second preset data set; the third preset data set includes the second type of image data set and the third type of image Data set marked with key point information in the data set;

The preprocessing operation includes: separately processing the data in the first preset data set and the third preset data set through random geometric transformation; and performing the random mixing operation and/or random geometric transformation on the second preset data set. Data to be processed.
The method according to claim 15, wherein the random geometric transformation includes random cropping, random rotation according to a preset angle, and/or random scaling according to a preset zoom ratio; the random mixing operation includes combining at least two The data is overlapped according to the preset weight, which is specifically adding the product of the preset position pixel value in different data and the preset weight.
A data recognition method, based on the method described in any one of claims 8 to 16, comprising:

The feature data to be recognized is input into the weighted attention neural network model, and the two-dimensional coordinates of key points of at least one target in the feature data to be recognized are recognized. Perform posture estimation of at least one person in the following manner, detect the position rectangular frame of at least one target in the feature data to be recognized, and detect the two-dimensional coordinates of key points of the target in the position rectangular frame;

By calculating the two-dimensional coordinates of the key points of the target, the angle between the line of the first preset key point combination and the line of the second preset key point combination or the connection of the first preset key point combination is obtained. The angle between the line and the first preset line;

The angle between the line of the first preset key point combination and the line of the second preset key point combination or the clamp between the line of the first preset key point combination and the first preset line The angle is matched in the first preset database to obtain the recognition result of the target.
18. The method according to claim 17, wherein the angle between the line connecting the first preset key point combination and the line connecting the second preset key point combination or the first preset key point combination The included angle between the connection line of and the first preset line is matched in the first preset database, and obtaining the recognition result of the target includes:

In the case where the feature data to be identified includes picture data, the obtained angle value of at least one included angle is matched with the angle value of the corresponding included angle type in the first preset database to obtain the picture data The recognition result.
18. The method according to claim 17, wherein the angle between the line connecting the first preset key point combination and the line connecting the second preset key point combination or the first preset key point combination The included angle between the connection line of and the first preset line is matched in the first preset database, and obtaining the recognition result of the target includes:

In the case where the feature data to be identified includes video data, for each frame or specified frame, obtain key point two-dimensional coordinate information of at least one target in each corresponding frame in the video data, wherein the specified frame It is a fixed time interval frame and/or key frame;

According to the two-dimensional coordinate information of the key point of at least one target in each corresponding frame of the video data, the angle time variation curve of at least one specific included angle of at least one target is obtained, and the angle time variation curve of at least one specific included angle of at least one target is determined by The angle-time variation curve is compared and analyzed to obtain the recognition result.
The method according to claim 19, wherein the angle-time variation curve of at least one specific included angle of at least one target is obtained according to the key point two-dimensional coordinate information of at least one target in each corresponding frame of the video data, and By comparing and analyzing the angle-time variation curve of at least one included angle with at least one standard motion, obtaining the recognition result includes:

Compare the angle-time variation curve of at least one specific angle of the at least one target with the pre-obtained angle-time variation curve of at least one angle of at least one standard motion, and if the similarity falls within the first prediction If a threshold interval is set, it is determined that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type;

In the case of determining that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type, further compare the angle time change curve of at least one specific included angle of the target with the angle of the corresponding specific included angle of the standard motion Time curve

If the difference between the adjacent maximum value on the angle-time variation curve of at least one specific included angle of the target and the adjacent maximum value on the angle-time variation curve of the corresponding specific included angle of the standard motion falls into the second prediction If a threshold interval is set, the joint motion specification corresponding to the specific included angle of the target in each corresponding frame of the video data is determined, otherwise the joint motion corresponding to the specific included angle of the target in the video data is not standardized;

Determine whether the distance between adjacent peaks on the angle-time variation curve of at least one specific included angle of the target and the difference between adjacent peaks on the angle-time variation curve of the corresponding specific included angle of the standard motion falls into the third Preset threshold interval, fourth preset threshold interval, or fifth preset threshold interval, and then confirm that the joint motion intensity corresponding to the specific included angle of the target in each corresponding frame of the video data is too low, appropriate, or excessive high.
The method according to claim 17, wherein the method further comprises:

Matching is performed in the second preset database according to the recognition result to obtain the posture evaluation result corresponding to the recognition result.
The method according to claim 21, wherein, after obtaining the posture evaluation result corresponding to the recognition result, the method further comprises:

Matching is performed in the third preset database according to the posture evaluation result to obtain suggestion information corresponding to the posture evaluation result.
A data recognition device includes:

The coordinate recognition module is configured to input the feature data to be recognized into the weighted attention neural network model, and identify two-dimensional coordinates of key points of at least one target in the feature data to be recognized, wherein the weighted attention neural network model Configured to estimate the pose of at least one person in a top-down manner, detect the position rectangle of at least one target in the feature data to be recognized, and detect the two-dimensional coordinates of key points of the target in the position rectangle;

The calculation module is configured to calculate through the two-dimensional coordinates of the key points of the target to obtain the angle between the line of the first preset key point combination and the line of the second preset key point combination or the first preset The angle between the line of the key point combination and the first preset line;

The matching module is configured to compare the angle between the line of the first preset key point combination and the line of the second preset key point combination or the line of the first preset key point combination with the first preset The angle between the lines is matched in the first preset database to obtain the recognition result of the target.
The device according to claim 23, wherein the matching module comprises:

The first matching unit is configured to match the obtained angle value of at least one included angle with the angle value of the corresponding included angle type in the first preset database when the feature data to be recognized includes image data , Get the recognition result of the picture data.
The device according to claim 23, wherein the matching module comprises:

The acquiring unit is configured to acquire, for each frame or specified frame, the key point two-dimensional coordinate information of at least one target of each corresponding frame in the video data when the feature data to be identified includes video data, wherein , The designated frame is a fixed time interval frame and/or a key frame;

The second matching unit is configured to obtain the angle-time variation curve of at least one specific included angle of at least one target according to the two-dimensional coordinate information of the key point of at least one target in each corresponding frame of the video data, and to pass and at least one standard The angle-time variation curve of at least one included angle of motion is compared and analyzed to obtain the recognition result.
The device according to claim 25, wherein the second matching unit comprises:

The first judging subunit is configured to compare the angle-time variation curve of at least one specific included angle of the at least one target with a pre-obtained angle-time variation curve of at least one included angle of at least one standard motion, and If the similarity falls within the first preset threshold interval, it is determined that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type;

The comparison subunit is configured to further compare the angle-time variation curve of at least one specific included angle of the target with the standard motion when it is determined that the corresponding target of each corresponding frame in the video data is performing the corresponding standard motion type. The angle-time variation curve of the corresponding specific included angle;

The second judgment subunit is set to determine if the difference between the adjacent maximum value on the angle-time variation curve of at least one specific included angle of the target and the adjacent maximum value on the angle-time variation curve of the corresponding specific included angle of the standard motion Enter the second preset threshold interval, determine the joint motion specification corresponding to the specific included angle of the target in each corresponding frame in the video data, otherwise the joint motion corresponding to the specific included angle of the target in the video data is not standardized;

The third judging subunit is configured to judge whether the distance between adjacent peaks on the angle-time variation curve of at least one specific included angle of the target and the adjacent peaks on the angle-time variation curve of the corresponding specific included angle of the standard motion is different Fall into the third preset threshold interval, the fourth preset threshold interval, or the fifth preset threshold interval, and then confirm that the joint motion intensity corresponding to the specific included angle of the target of each corresponding frame in the video data is too low, appropriate, or excessive high.
The device according to claim 23, wherein the device further comprises:

The evaluation module is configured to perform matching in a second preset database according to the recognition result to obtain a posture evaluation result corresponding to the recognition result.
The device according to claim 27, wherein the device further comprises:

The suggestion module is configured to, after obtaining the posture evaluation result corresponding to the recognition result, perform matching in the third preset database according to the posture evaluation result to obtain the suggestion information corresponding to the posture evaluation result.
A non-volatile storage medium, the non-volatile storage medium includes a stored program, wherein when the program is running, the device where the non-volatile storage medium is located is controlled to execute any one of claims 1 to 22 The method described.
A data identification device, comprising: a non-volatile storage medium and a processor configured to run a program stored in the non-volatile storage medium, wherein the program executes any one of claims 1 to 22 when running The method described.