WO2021217937A1

WO2021217937A1 - Posture recognition model training method and device, and posture recognition method and device

Info

Publication number: WO2021217937A1
Application number: PCT/CN2020/105898
Authority: WO
Inventors: 姜沛; 曹锋铭
Original assignee: 平安国际智慧城市科技股份有限公司
Priority date: 2020-04-27
Filing date: 2020-07-30
Publication date: 2021-11-04
Also published as: CN111539349A

Abstract

A posture recognition model training method, relating to artificial intelligence, and comprising: obtaining a human body sample image and a corresponding human body sample posture, and respectively inputting the human body sample image into a trained first posture recognition model and a trained second posture recognition model, wherein the number of first layers of a hourglass network corresponding to the first posture recognition model is greater than the number of second layers of a hourglass network corresponding to the second posture recognition model; training the second posture recognition model according to output of the first posture recognition model and output of the second posture recognition model; and when the number of times of training reaches a preset threshold, completing training of the second posture recognition model. In addition, the present invention also relates to a blockchain technology, and the human body sample image and the corresponding human body sample posture are stored in the blockchain.

Description

Training method and equipment for posture recognition model, posture recognition method and equipment

This application requires the priority of a Chinese patent application filed with the Chinese Patent Office on April 27, 2020, with the application number CN202010343546.2, and the invention title of "Training Method and Device for Gesture Recognition Model, and Method and Device for Gesture Recognition". The entire content is incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method and equipment for training a gesture recognition model, a method and equipment for gesture recognition.

Background technique

With the continuous development of computer vision technology, multi-person gesture recognition technology continues to appear in people’s lives. For example, in elderly care institutions or home care scenarios, multi-person gesture recognition technology can recognize the dangerous actions of the elderly and perform Alerts can evaluate the mobility of the elderly so as to take better care of the elderly.

Multi-person gesture recognition technology includes two indicators: recognition accuracy and recognition speed. The inventor realizes that in related technologies, the recognition accuracy is improved by continuously increasing the structural complexity of the gesture recognition model, but it consumes a lot of system resources, and the cost of technology implementation is relatively high. high. However, by simplifying the model to improve the recognition speed and reduce the cost, the recognition accuracy will be reduced. Therefore, there is an urgent need for a gesture recognition model whose recognition accuracy and recognition speed can meet application requirements.

Summary of the invention

A method for training a gesture recognition model, including: acquiring a human body sample image and a corresponding human body sample pose; inputting the human body sample image into a first gesture recognition model and a second gesture recognition model; wherein, the first gesture recognition The model includes a first stacked hourglass network, the first stacked hourglass network includes a first layer of hourglass network, the second gesture recognition model includes a second stacked hourglass network, the second stacked hourglass network includes a second layer Hourglass network, the first number of layers is greater than the second number of layers; training the first gesture recognition model according to the output of the first gesture recognition model, and according to the output of the first gesture recognition model Output and the output of the second posture recognition model to train the second posture recognition model; and when the number of training times reaches a preset threshold, complete the first posture recognition model and the second posture recognition model Training.

A posture recognition method, including: acquiring a current frame of human body image to be recognized, and a human body posture corresponding to the last frame of human body image; wherein, the human body posture includes the position of a human skeleton point; and inputting the current frame of human body image into training After the second posture recognition model; wherein, the second posture recognition model is generated after training through the training method of the aforementioned posture recognition model; according to the position of the human skeleton point corresponding to the last frame of the human body image, the current The predicted position of the human skeleton point corresponding to the frame of the human body image; and according to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of human body image, the human body corresponding to the current frame of human body image is generated attitude.

A computer device includes a memory and a processor, the memory is used to store information including program instructions, the processor is used to control the execution of the program instructions, and the following steps are implemented when the program instructions are loaded and executed by the processor:

Acquire a human body sample image and a corresponding human body sample pose; input the human body sample image into a first pose recognition model and a second pose recognition model respectively; wherein, the first pose recognition model includes a first stacked hourglass network, and the first pose recognition model includes a first stacked hourglass network. A stacked hourglass network includes a first number of hourglass networks, the second gesture recognition model includes a second stacked hourglass network, the second stacked hourglass network includes a second number of hourglass networks, and the first number of layers is greater than The second layer number; training the first gesture recognition model according to the output of the first gesture recognition model, and according to the output of the first gesture recognition model and the output of the second gesture recognition model , Training the second posture recognition model; and when the number of training times reaches a preset threshold, completing the training of the first posture recognition model and the second posture recognition model.

Acquiring the current frame of human body image to be recognized and the human body posture corresponding to the last frame of human body image; wherein the human body posture includes the position of the human skeleton point; inputting the current frame of human body image into the trained second posture recognition model; Wherein, the second posture recognition model is generated after training by the training method of the aforementioned posture recognition model; according to the position of the human skeleton point corresponding to the last frame of human body image, the human skeleton point corresponding to the current frame of human body image is generated And according to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human image, the human posture corresponding to the current frame of the human image is generated.

A computer-readable storage medium, the storage medium including a stored program, wherein, when the program is running, the device where the storage medium is located is controlled to perform the following steps:

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, without creative labor, other drawings can be obtained from these drawings.

FIG. 1 is a schematic flowchart of a method for training a gesture recognition model provided by an embodiment of the application;

Figure 2 is a schematic diagram of the position distribution of human bone points;

Figure 3 is a schematic diagram of the structure of the hourglass network;

Figure 4 is a schematic diagram of the structure of a stacked hourglass network;

FIG. 5 is a schematic flowchart of a gesture recognition method proposed in an embodiment of this application;

6 is a schematic structural diagram of a training device for a gesture recognition model proposed in an embodiment of this application;

FIG. 7 is a schematic structural diagram of a gesture recognition device proposed in an embodiment of this application; and

FIG. 8 is a schematic diagram of a computer device provided by an embodiment of this application.

Detailed ways

In order to better understand the technical solutions of the present application, the following describes the embodiments of the present application in detail with reference to the accompanying drawings.

It should be clear that the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms used in the embodiments of this application are only for the purpose of describing specific embodiments, and are not intended to limit the application. The singular forms of "a", "the" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings.

It should be understood that the term "and/or" used herein is only a description of the same field of the associated object, indicating that there can be three relationships. For example, A and/or B can mean that A exists alone and A exists at the same time. And B, there are three cases of B alone. In addition, the character "/" in this text generally indicates that the associated objects before and after are in an "or" relationship.

It should be understood that, although the terms first, second, third, etc. may be used in the embodiments of the present application to describe the preset range, etc., these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from each other. For example, without departing from the scope of the embodiments of the present application, the first preset range may also be referred to as the second preset range, and similarly, the second preset range may also be referred to as the first preset range.

Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination" or "in response to detection". Similarly, depending on the context, the phrase "if determined" or "if detected (statement or event)" can be interpreted as "when determined" or "in response to determination" or "when detected (statement or event) )" or "in response to detection (statement or event)".

Based on the foregoing description of the prior art, it can be known that the multi-person gesture recognition technology includes two indicators of recognition accuracy and recognition speed. The multi-person gesture recognition technology specifically includes two steps in the realization process. The first step is to detect the human body target, and the second step is to detect the human body pose for each human target. Among them, the detection of the human body pose takes up about approximately Five-sixths of the time. Therefore, to improve the recognition speed of multi-person gesture recognition technology is mainly to simplify the gesture recognition model so that the recognition accuracy and recognition speed can meet the application requirements.

Based on this, the embodiment of the present application provides a method for training a gesture recognition model, which uses the output of the first gesture recognition model with a larger number of layers to help the second gesture recognition model with a smaller number of layers to train, so that the second gesture recognition model after training is The accuracy of the second gesture recognition model is close to that of the first gesture recognition model, but the amount of data processing is much smaller than that of the first gesture recognition model.

FIG. 1 is a schematic flowchart of a method for training a gesture recognition model provided by an embodiment of the application. As shown in Figure 1, the method includes the following steps:

Step S101: Obtain an image of a human body sample and a corresponding posture of the human body sample.

Among them, the human body sample image is an image in which the human body posture has been determined, and the correct recognition result is the corresponding human body sample posture. Therefore, it can be used to train the gesture recognition model. It should be emphasized that, in order to further ensure the privacy and security of the human body sample image and the corresponding human body sample posture, the human body sample image and the corresponding human body sample posture may also be stored in a node of a blockchain.

Specifically, the posture of the human body includes the positions of the human bone points, and FIG. 2 is a schematic diagram of the position distribution of the human bone points. As shown in Figure 2, various parts of the human body can be determined by human bone points. Specifically, each human bone point is numbered, and according to the coordinates of each human bone point in the image, one of the different human bone points is determined. The relative position between the two, corresponding to different human postures.

Step S102: Input the human body sample image into the trained first posture recognition model and the second posture recognition model respectively.

Wherein, the first gesture recognition model includes a first stacked hourglass network, the first stacked hourglass network includes a first layer of hourglass network, the second gesture recognition model includes a second stacked hourglass network, and the second stacked hourglass network includes a second layer In the hourglass network, the first layer is greater than the second layer.

Figure 3 is a schematic diagram of the structure of the hourglass network. As shown in Figure 3, the input of a single hourglass network is an image, and the output is an image feature. After the image is input into the hourglass network, the image processing process can be divided into two parts: a convolution path and a step-by-step path. Among them, the convolution path convolves the image through the convolution path residual module, and the output of the last convolution path residual module is used as the input of the first upsampling module.

It should be noted that the size of the block model in Figure 3 represents the size of the input resolution, the output resolution of the first convolution path residual module is half of the input resolution, and the second convolution path residual module The input of is the output of the first convolution path residual module, that is, the input resolution of the second convolution path residual module is half of the output resolution of the first convolution path residual module.

In addition, the output resolution of each up-sampling module is twice the input resolution, so that the up-sampling module and the convolution path residual module have a one-to-one correspondence. For example, the output resolution of the fourth convolutional neural network in Figure 3 is equal to the input resolution of the first upsampling module, and the input resolution of the fourth convolutional neural network is the same as the first upsampling The output resolutions of the modules are equal.

And the input resolution and output resolution of each step-by-step residual module are equal, part of the output of the convolution path residual module is processed by multiple convolution path residual modules and multiple up-sampling modules, and the other part is processed by step-by-step The processing of the path residual module is superimposed with the same resolution. For example, part of the output of the first residual module is processed by the second, third, fourth, and five residual modules of the convolution path, and the input and output resolutions of the fifth convolution path residual module are equal, and then After the first, second, and third up-sampling modules are up-sampling, the resolution is the same as the output resolution of the first residual module. The output of the first residual module is processed by the step-by-step residual module, and the resolution remains unchanged, which is also the same as the output resolution of the first residual module. Therefore, after the two parts of the output of the first residual module are processed differently, the resolution is the same and can be superimposed, and the superimposed result is used as the input of the fourth upsampling module.

Based on the above analysis of the structure of the hourglass network, the characteristic image output by the hourglass network not only retains the information of all layers, but also can determine the human skeleton points from it.

Figure 4 is a schematic diagram of the structure of a stacked hourglass network. As shown in Figure 4, cascade multiple hourglass networks (the output of the previous hourglass network is used as the input of the next hourglass network) to get the stacked hourglass network, and the next hourglass network in the stacked hourglass network can use the previous hourglass The relationship between the human bone points determined by the network makes the determination of the human bone points in the next hourglass network output more accurate.

It should be understood that the more layers of the stacked hourglass network, the more accurate the determination of human bone points. Therefore, the accuracy of the first posture recognition model is higher than that of the second posture recognition model, but the data processing amount of the first posture recognition model in use is also greater than that of the second posture recognition model, and the recognition speed is lower.

The embodiments of the present application aim to use the output of the trained first gesture recognition model with a larger number of layers to help train the second gesture recognition model with a smaller number of layers, so that the accuracy of the trained second gesture recognition model is close to that of the first gesture recognition model. A gesture recognition model, but the amount of data processing is much smaller than the first gesture recognition model.

Step S103, training the second gesture recognition model according to the output of the first gesture recognition model and the output of the second gesture recognition model.

It should be noted that, in the embodiment of the present application, the first gesture recognition model is trained first, and when the recognition accuracy of the first gesture recognition model meets the preset condition, the training of the first gesture recognition model is completed, and the training is used After completing the first posture recognition model, train the second posture recognition model. Specifically, the human body sample image is input into the trained first gesture recognition model and the second gesture recognition model, respectively, to obtain the outputs of the first gesture recognition model and the second gesture recognition model. According to the output of the first posture recognition model and the output of the second posture recognition model, the second posture recognition model is trained.

For example, the first gesture recognition model may include an 8-layer stacked hourglass network, and the second gesture recognition model may include a 4-layer stacked hourglass network, so that when the second gesture recognition model is used, the amount of data processing is much smaller than that of the first pose Recognize the model to improve the recognition speed. In addition, the dimension of the feature vector input by the second gesture recognition model should also be smaller than that of the first gesture recognition model. For example, the dimension of the feature vector input by the first gesture recognition model can be 256 dimensions. The dimension may be 128 dimensions, so that the data processing amount of the second gesture recognition model is smaller than that of the first gesture recognition model.

Specifically, the first gesture recognition model in the embodiment of the present application is trained through the following steps:

Step S11: Determine the first difference between the output of the first posture recognition model and the posture of the human body sample.

Among them, the output of the first posture recognition model is a human skeleton point, and the human skeleton point is in the form of coordinates. Specifically, the coordinates (x, y) of the k-th human skeleton point output by the first pose recognition model, and the coordinates of the k-th human skeleton point in the human sample pose are (x _k , y _k ), then according to the formula

Calculate the distribution of the k-th human skeleton points, where σ ² is the variance of the Gaussian distribution, according to the formula

Calculate the first difference between the output of the first posture recognition model and the posture of the human body sample.

In step S12, the parameters of the first gesture recognition model are optimized according to the first difference.

It should be noted that using the gradient descent method, the parameters of the first gesture recognition model can be optimized so that L _{1 is} gradually reduced.

Step S103, training the second gesture recognition model according to the output of the first gesture recognition model and the output of the second gesture recognition model, including:

Step S21: Determine the second difference between the output of the second posture recognition model and the posture of the human body sample.

It can be understood that the output of the second pose recognition model is also the human skeleton point, the coordinates of the kth human skeleton point (x, y), the coordinates of the kth human skeleton point in the human sample pose are (x _k , y _k ), According to the formula

Calculate the second difference between the output of the second posture recognition model and the posture of the human body sample.

Step S22: Determine the third difference between the output of the first gesture recognition model and the output of the second gesture recognition model.

Specifically, according to the formula

Calculate the third difference between the output of the first gesture recognition model and the output of the second gesture recognition model. in,

Is the output of the trained first gesture recognition model.

In step S23, the parameters of the second gesture recognition model are optimized according to the second difference and the third difference.

One possible implementation is to add weights to the second difference and the third difference to generate the fourth difference. Wherein, the sum of the weight corresponding to the second difference and the weight corresponding to the third difference is one, and the parameters of the second gesture recognition model are optimized according to the fourth difference. Specifically, the fourth difference is calculated according to the formula L ₄ =wL ₃ +(1-w)L _2. Using the gradient descent method, the parameters of the second gesture recognition model can be optimized so that L _{4 is} gradually reduced.

Step S104: When the number of training times reaches the preset threshold, the training of the second gesture recognition model is completed.

It should be noted that when using the gradient descent method to optimize the parameters of the second attitude recognition model, the learning rate in the gradient descent method, that is, the step length in the gradient descent process, needs to be continuously adjusted. Continuously reduce the learning rate to reduce the scope of parameter optimization. For example, there are 40k human sample images in the human body sample image library, 29k of which are used as training data, and the remaining 11k human sample images are used as test data. At the beginning of training, set the value of the learning rate to 0.01 , After training all 29k human sample images, and testing all the remaining 11k, the corresponding recognition accuracy is obtained as a training. When the number of training times is 120 times, the value of the learning rate is adjusted to 0.001, when the number of training times reaches 200 times, the value of the learning rate is adjusted to 0.0001, and when the number of training times reaches 250 times, the training of the second gesture recognition model is completed.

To sum up, the training method of the posture recognition model proposed in the embodiment of the application obtains the human body sample image and the corresponding human body sample posture, and inputs the human body sample image into the trained first posture recognition model and the second posture recognition model respectively. . Wherein, the first number of layers of the hourglass network corresponding to the first gesture recognition model is greater than the second number of layers of the hourglass network corresponding to the second gesture recognition model. According to the output of the first posture recognition model and the output of the second posture recognition model, the second posture recognition model is trained. When the number of training times reaches the preset threshold, the training of the second gesture recognition model is completed. As a result, the output of the trained first gesture recognition model with a larger number of layers is used to help the second gesture recognition model with a smaller number of layers to train, so that the accuracy of the trained second gesture recognition model is close to that of the first. Posture recognition model, but the amount of data processing is much smaller than the first posture recognition model.

In order to be able to use the second posture recognition model trained by the posture recognition model training method proposed in the embodiment of the application for the posture recognition of the human body image, the embodiment of the application also proposes a posture recognition method. FIG. 5 is the application A schematic flowchart of the gesture recognition method proposed in the embodiment. As shown in Figure 5, the method includes the following steps:

Step S201: Obtain the current frame of human body image to be recognized and the human body posture corresponding to the last frame of human body image.

Among them, the posture of the human body includes the position of the bone point of the human body. It should be emphasized that, in order to further ensure the privacy and security of the human body sample image and the corresponding human body sample posture, the human body sample image and the corresponding human body sample posture can also be stored in a blockchain node.

It should be noted that, since the accuracy of the second gesture recognition model trained in the embodiment of this application is not as good as that of the first gesture recognition model, in order to make up for the lack of accuracy of the second gesture recognition model, the embodiment of this application uses optical flow The algorithm compensates the recognition accuracy of the second gesture recognition model.

Among them, the optical flow is the instantaneous velocity of the pixel motion of the spatially moving object on the observation imaging plane. The optical flow algorithm uses the changes in the time domain of the pixels in the image sequence and the correlation between adjacent frames to find the previous frame. A method of calculating the motion information of objects between adjacent frames based on the corresponding relationship between the current frames.

In the embodiment of the present application, the optical flow algorithm can predict the position of the corresponding human skeleton point in the current frame of the human body image by analyzing the position of the human skeleton point in the previous frame of the human body image.

Step S202: Input the current frame of human body image into the second posture recognition model after training.

Wherein, the second gesture recognition model is generated after training by the training method of the aforementioned gesture recognition model.

It should be understood that, compared with the first gesture recognition model, the second gesture model has a smaller amount of data processing, so the recognition speed is faster.

Step S203: According to the position of the human skeleton point corresponding to the last frame of the human body image, the predicted position of the human skeleton point corresponding to the current frame of the human body image is generated.

Step S204, according to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human image, generate the human posture corresponding to the current frame of the human image.

Specifically, for the human skeleton point corresponding to the human body image in the previous frame, the predicted position of the human skeleton point corresponding to the human body image in the current frame is obtained according to the optical flow corresponding to the human body image in the previous frame and the current frame. According to the formula

The human body posture corresponding to the current frame of human body image is calculated. in,

_{Is the predicted position of the k-th} human skeleton point corresponding to the current frame of the human body image, and K cur is the position of the k-th human skeleton point in the output of the second pose recognition model,

Is the position of the k-th human skeleton point corresponding to the current frame of the human body image, and α is the correction coefficient, which is a constant between 0.25-0.3. According to the positions of all the human skeleton points, the posture of the human body corresponding to the current frame of the human body image can be determined.

In summary, the posture recognition method proposed in the embodiment of the present application obtains the current frame of human body image to be recognized and the human body posture corresponding to the previous frame of human body image. The current frame of the human body image is input into the trained second posture recognition model, and the predicted position of the human skeleton point corresponding to the current frame of the human body image is generated according to the position of the human skeleton point corresponding to the previous frame of the human body image. According to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human image, the human posture corresponding to the current frame of the human image is generated. As a result, the optical flow algorithm is used to compensate the output of the second posture recognition model, and the accuracy of human posture recognition is improved.

In order to implement the foregoing embodiment, an embodiment of the present application also proposes a training device for a gesture recognition model. FIG. 6 is a schematic structural diagram of a training device for a gesture recognition model proposed in an embodiment of this application. As shown in FIG. 6, the device includes: a first acquisition module 310, a first input module 320, a training module 330, and a completion module 340.

The first acquisition module 310 is used to acquire a human body sample image and a corresponding human body sample pose.

The first input module 320 is configured to input the human body sample image into the trained first posture recognition model and the second posture recognition model respectively.

The training module 330 is configured to train the second gesture recognition model according to the output of the first gesture recognition model and the output of the second gesture recognition model.

The completion module 340 is configured to complete the training of the second gesture recognition model when the number of training times reaches the preset threshold.

Further, in order to optimize the parameters of the first posture recognition model, a possible implementation manner is that the device further includes: a determining module 350 for determining the first difference between the output of the first posture recognition model and the posture of the human body sample . The optimization module 360 is configured to optimize the parameters of the first gesture recognition model according to the first difference.

Further, in order to optimize the parameters of the second posture recognition model, a possible implementation is that the training module 330 includes: a first determination sub-module 331 for determining the output of the second posture recognition model and the posture of the human body sample The second difference. The second determination sub-module 332 is used to determine the third difference between the output of the first gesture recognition model and the output of the second gesture recognition model. The optimization sub-module 333 is used to optimize the parameters of the second gesture recognition model according to the second difference and the third difference.

Further, in order to comprehensively consider the second difference and the third difference, the parameters of the second gesture recognition model are optimized. A possible implementation is that the optimization sub-module 333 includes: a summation unit 333a for calculating the second difference And the third difference is weighted and summed to generate the fourth difference. Wherein, the sum of the weight corresponding to the second difference and the weight corresponding to the third difference is one. The optimization unit 333b is configured to optimize the parameters of the second gesture recognition model according to the fourth difference.

It should be noted that the foregoing explanation of the embodiment of the training method of the gesture recognition model is also applicable to the training device of the gesture recognition model of this embodiment, and will not be repeated here.

To sum up, the training device for the gesture recognition model proposed in the embodiment of the present application acquires the human body sample image and the corresponding human body sample posture when training the gesture recognition model, and inputs the human body sample image into the trained first The gesture recognition model and the second gesture recognition model. Wherein, the first number of layers of the hourglass network corresponding to the first gesture recognition model is greater than the second number of layers of the hourglass network corresponding to the second gesture recognition model. According to the output of the first posture recognition model and the output of the second posture recognition model, the second posture recognition model is trained. When the number of training times reaches the preset threshold, the training of the second gesture recognition model is completed. As a result, the output of the trained first gesture recognition model with a larger number of layers is used to help the second gesture recognition model with a smaller number of layers to train, so that the accuracy of the trained second gesture recognition model is close to that of the first. Posture recognition model, but the amount of data processing is much smaller than the first posture recognition model.

In order to implement the foregoing embodiments, an embodiment of the present application also proposes a gesture recognition device. FIG. 7 is a schematic structural diagram of a gesture recognition device proposed in an embodiment of the application. As shown in FIG. 7, the device includes: a second acquisition module 410, a second input module 420, a first generation module 430, and a second generation module 440.

The second acquisition module 410 is configured to acquire the current frame of human body image to be recognized and the human body posture corresponding to the last frame of human body image.

Among them, the posture of the human body includes the position of the bone point of the human body. It should be emphasized that, in order to further ensure the privacy and security of the human body sample image and the corresponding human body sample posture, the human body sample image and the corresponding human body sample posture may also be stored in a node of a blockchain.

The second input module 420 is configured to input the current frame of the human body image into the trained second posture recognition model.

Wherein, the second gesture recognition model is generated after being trained by the training device of the aforementioned gesture recognition model.

The first generating module 430 is configured to generate the predicted position of the human skeleton point corresponding to the current frame of the human body image according to the position of the human skeleton point corresponding to the previous frame of the human body image.

The second generating module 440 is configured to generate the human body pose corresponding to the current frame of the human body image according to the output of the second pose recognition model and the predicted position of the human skeleton point corresponding to the current frame of human body image.

It should be noted that the foregoing explanation of the embodiment of the gesture recognition method is also applicable to the gesture recognition device of this embodiment, and will not be repeated here.

In summary, the gesture recognition device proposed in the embodiment of the present application acquires the current frame of human body image to be recognized and the human body posture corresponding to the previous frame of human body image when performing gesture recognition. The current frame of the human body image is input into the trained second posture recognition model, and the predicted position of the human bone point corresponding to the current frame of the human body image is generated according to the position of the human skeleton point corresponding to the previous frame of the human body image. According to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human image, the human posture corresponding to the current frame of the human image is generated. As a result, the optical flow algorithm is used to compensate the output of the second posture recognition model, and the accuracy of human posture recognition is improved.

In order to implement the above-mentioned embodiments, the embodiments of the present application also propose a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the aforementioned method The steps of the training method of the gesture recognition model of the embodiment.

In order to implement the above embodiments, the embodiments of the present application also propose a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the aforementioned method when the computer program is executed. The steps of the gesture recognition method of the embodiment.

FIG. 8 is a schematic diagram of a computer device provided by an embodiment of this application. As shown in FIG. 8, the computer device 50 of this embodiment includes: a processor 51, a memory 52, and a computer program 53 stored in the memory 52 and running on the processor 51. When the computer program 53 is executed by the processor 51, In order to avoid repetition, the training method of the gesture recognition model and the method of gesture recognition in the embodiment are not repeated here. Alternatively, when the computer program is executed by the processor 51, the function of each model/unit in the baby crying-based emotion detection device in the embodiment is realized. To avoid repetition, it will not be repeated here.

The computer device 50 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device may include, but is not limited to, a processor 51 and a memory 52. Those skilled in the art can understand that FIG. 8 is only an example of the computer device 50, and does not constitute a limitation on the computer device 50. It may include more or less components than those shown in the figure, or a combination of certain components, or different components. For example, computer equipment may also include input and output devices, network access devices, buses, and so on.

The so-called processor 51 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 52 may be an internal storage unit of the computer device 50, such as a hard disk or memory of the computer device 50. The memory 52 may also be an external storage device of the computer device 50, such as a plug-in hard disk equipped on the computer device 50, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, and a flash memory card (Flash). Card) and so on. Further, the memory 52 may also include both an internal storage unit of the computer device 50 and an external storage device. The memory 52 is used to store computer programs and other programs and data required by the computer equipment. The memory 52 can also be used to temporarily store data that has been output or will be output.

In order to implement the above-mentioned embodiments, the embodiments of the present application also propose a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program. Wherein, when the computer program is executed by the processor, the steps of the training method of the gesture recognition model as in the foregoing method embodiment are implemented.

In order to implement the above-mentioned embodiment, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium stores a computer program. step.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (Processor) execute part of the steps of the methods in the various embodiments of the present application . The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

The above are only preferred embodiments of this application, and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection of this application. Within range.

Claims

A training method for a gesture recognition model, which includes:

Obtain the human body sample image and the corresponding human body sample pose;

The human body sample image is input into the trained first posture recognition model and the second posture recognition model respectively; wherein, the first posture recognition model includes a first stacked hourglass network, and the first stacked hourglass network includes a first layer A number of hourglass networks, the second gesture recognition model includes a second stacked hourglass network, the second stacked hourglass network includes a second number of hourglass networks, and the first number of layers is greater than the second number of layers;

Training the second gesture recognition model according to the output of the first gesture recognition model and the output of the second gesture recognition model; and

When the number of training times reaches the preset threshold, the training of the second gesture recognition model is completed.
The training method of claim 1, wherein the first gesture recognition model is trained through the following steps:

Determining the first difference between the output of the first posture recognition model and the posture of the human sample;

According to the first difference, the parameters of the first gesture recognition model are optimized.
The training method of claim 2, wherein the human body sample image and the corresponding human body sample posture are stored in a blockchain, and the output of the first posture recognition model and the second posture recognition model are The output of training the second gesture recognition model includes:

Determining the second difference between the output of the second posture recognition model and the posture of the human body sample;

Determining a third difference between the output of the first gesture recognition model and the output of the second gesture recognition model;

According to the second difference and the third difference, the parameters of the second gesture recognition model are optimized.
The training method of claim 1, wherein the dimension of the feature vector input by the second gesture recognition model is smaller than the dimension of the feature vector input by the first gesture recognition model.
The training method according to claim 2, wherein the output of the first gesture recognition model is a human skeleton point in the form of coordinates.
The training method of claim 1, wherein the first gesture recognition model includes an 8-layer stacked hourglass network, and the second gesture recognition model includes a 4-layer stacked hourglass network.
A gesture recognition method, which includes:

Acquiring the current frame of human body image to be recognized and the human body posture corresponding to the last frame of human body image; wherein, the human body posture includes the position of the human skeleton point;

Input the current frame of the human body image into the trained second posture recognition model; wherein the second posture recognition model is generated after being trained by the training method of the posture recognition model according to any one of claims 1-6;

Generating the predicted position of the human skeleton point corresponding to the current frame of the human body image according to the position of the human skeleton point corresponding to the last frame of the human body image; and

According to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human image, the human posture corresponding to the current frame of the human image is generated.
The gesture recognition method according to claim 7, wherein said generating the corresponding human body image corresponding to the current frame based on the output of the second gesture recognition model and the predicted position of the human skeleton point corresponding to the human body image of the current frame The formula for calculating the posture of the human body includes:

in,
Represents the predicted position of the k-th human skeleton point corresponding to the current frame of the human body image, and K cur represents the position of the k-th human skeleton point in the output of the second gesture recognition model,
Indicates the position of the k-th human skeleton point corresponding to the human body image in the current frame, and α represents the correction coefficient.
A computer device includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the following steps when the processor executes the computer program, so as to realize a posture model Training:

Obtain the human body sample image and the corresponding human body sample pose;

The human body sample image is input into the trained first posture recognition model and the second posture recognition model respectively; wherein, the first posture recognition model includes a first stacked hourglass network, and the first stacked hourglass network includes a first layer A number of hourglass networks, the second gesture recognition model includes a second stacked hourglass network, the second stacked hourglass network includes a second number of hourglass networks, and the first number of layers is greater than the second number of layers;

Training the second gesture recognition model according to the output of the first gesture recognition model and the output of the second gesture recognition model; and

When the number of training times reaches the preset threshold, the training of the second gesture recognition model is completed.
9. The computer device of claim 9, wherein the first gesture recognition model is trained through the following steps:

Determining the first difference between the output of the first posture recognition model and the posture of the human sample;

According to the first difference, the parameters of the first gesture recognition model are optimized.
The computer device according to claim 10, wherein the human body sample image and the corresponding human body sample pose are stored in a blockchain, and the output of the recognition model based on the first pose and the second pose recognition model The output of training the second gesture recognition model includes:

Determining the second difference between the output of the second posture recognition model and the posture of the human body sample;

Determining a third difference between the output of the first gesture recognition model and the output of the second gesture recognition model;

According to the second difference and the third difference, the parameters of the second gesture recognition model are optimized.
9. The computer device of claim 9, wherein the dimension of the feature vector input by the second gesture recognition model is smaller than the dimension of the feature vector input by the first gesture recognition model.
A computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the following steps when the processor executes the computer program to realize gesture recognition :

Acquiring the current frame of human body image to be recognized and the human body posture corresponding to the last frame of human body image; wherein, the human body posture includes the position of the human skeleton point;

Input the current frame of the human body image into the trained second posture recognition model; wherein the second posture recognition model is generated after being trained by the training method of the posture recognition model according to any one of claims 1-6;

Generating the predicted position of the human skeleton point corresponding to the current frame of the human body image according to the position of the human skeleton point corresponding to the last frame of the human body image; and

According to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human image, the human posture corresponding to the current frame of the human image is generated.
The computer device according to claim 13, wherein the output corresponding to the current frame of the human body image is generated based on the output of the second gesture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human body image The calculation formula of human body posture includes:

in,
Represents the predicted position of the k-th human skeleton point corresponding to the current frame of the human body image, and K cur represents the position of the k-th human skeleton point in the output of the second gesture recognition model,
Indicates the position of the k-th human skeleton point corresponding to the human body image in the current frame, and α represents the correction coefficient.
A computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, the following steps are implemented to realize the training of a posture model:

Obtain the human body sample image and the corresponding human body sample pose;

The human body sample image is input into the trained first posture recognition model and the second posture recognition model respectively; wherein, the first posture recognition model includes a first stacked hourglass network, and the first stacked hourglass network includes a first layer A number of hourglass networks, the second gesture recognition model includes a second stacked hourglass network, the second stacked hourglass network includes a second number of hourglass networks, and the first number of layers is greater than the second number of layers;

Training the second gesture recognition model according to the output of the first gesture recognition model and the output of the second gesture recognition model; and

When the number of training times reaches the preset threshold, the training of the second gesture recognition model is completed.
15. The computer-readable storage medium of claim 15, wherein the first gesture recognition model is trained through the following steps:

Determining the first difference between the output of the first posture recognition model and the posture of the human sample;

According to the first difference, the parameters of the first gesture recognition model are optimized.
The computer-readable storage medium according to claim 16, wherein the human body sample image and the corresponding human body sample pose are stored in a blockchain, and the output of the recognition model according to the first pose and the second The output of the gesture recognition model to train the second gesture recognition model includes:

Determining the second difference between the output of the second posture recognition model and the posture of the human body sample;

Determining a third difference between the output of the first gesture recognition model and the output of the second gesture recognition model;

According to the second difference and the third difference, the parameters of the second gesture recognition model are optimized.
15. The computer-readable storage medium of claim 15, wherein the dimension of the feature vector input by the second gesture recognition model is smaller than the dimension of the feature vector input by the first gesture recognition model.
A computer-readable storage medium storing a computer program, wherein the computer program implements the following steps when executed by a processor to realize gesture recognition:

Acquiring the current frame of human body image to be recognized and the human body posture corresponding to the last frame of human body image; wherein, the human body posture includes the position of the human skeleton point;

Input the current frame of the human body image into the trained second posture recognition model; wherein the second posture recognition model is generated after being trained by the training method of the posture recognition model according to any one of claims 1-6;

Generating the predicted position of the human skeleton point corresponding to the current frame of the human body image according to the position of the human skeleton point corresponding to the last frame of the human body image; and

According to the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame of the human image, the human posture corresponding to the current frame of the human image is generated.
The computer-readable storage medium of claim 19, wherein the current frame human body is generated based on the output of the second posture recognition model and the predicted position of the human skeleton point corresponding to the current frame human body image The calculation formula for the posture of the human body corresponding to the image includes:

in,
Represents the predicted position of the k-th human skeleton point corresponding to the current frame of the human body image, and K cur represents the position of the k-th human skeleton point in the output of the second gesture recognition model,
Indicates the position of the k-th human skeleton point corresponding to the human body image in the current frame, and α represents the correction coefficient.