CN111598037B

CN111598037B - Human body posture predicted value acquisition method, device, server and storage medium

Info

Publication number: CN111598037B
Application number: CN202010442857.4A
Authority: CN
Inventors: 喻冬东; 王长虎
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2023-04-25
Anticipated expiration: 2040-05-22
Also published as: CN111598037A

Abstract

The embodiment of the disclosure discloses a method, a device, a server and a storage medium for acquiring a human body posture predicted value, which belong to the technical field of image processing, and the embodiment of the disclosure considers the difference of posture characteristics of each human body part, firstly extracts the posture characteristics of each human body part from a characteristic diagram of a human body posture image, and then inputs corresponding local predicted neural networks; and then the local predicted value of each part is input into the overall predicted neural network, so that the prediction accuracy of the human body posture is effectively improved.

Description

Human body posture predicted value acquisition method, device, server and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of image processing, in particular to a method and a device for acquiring a predicted value of a human body posture, a server and a storage medium.

Background

At present, a method for predicting the human body posture generally comprises the steps of firstly acquiring a human body image, then extracting a characteristic image of the human body image through VGG, restNet and an acceptance network, and inputting the extracted characteristic image into a neural network to predict the human body posture. However, since there are differences in the attitudes of the respective parts of the human body and there is a correlation in the characteristics of each part, the method for acquiring the predicted value of the human body attitude provided in the prior art cannot sufficiently consider the above differences and correlations, so that there is a problem in that the prediction accuracy is low.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, a server and a storage medium for acquiring a human body posture predicted value, so as to solve the problem of inaccurate human body posture prediction in the prior art.

In a first aspect, an embodiment of the present disclosure provides a method for acquiring a predicted value of a human body posture, including the steps of:

acquiring a human body posture image;

extracting a feature map of the human body posture image;

inputting the feature map into N feature extraction neural networks to obtain gesture feature information of N human body parts, wherein each feature extraction neural network is used for obtaining gesture feature information of one human body part, and N is a positive integer greater than or equal to 2;

inputting the gesture feature information of N human body parts into M local prediction neural networks to obtain local gesture predicted values of the M human body parts, wherein each local prediction neural network inputs the gesture feature information of one or more human body parts, and M is a positive integer greater than or equal to 2;

and inputting the local posture predicted values of the M human body parts into the overall predicted neural network to obtain the human body posture predicted values.

Alternatively, the values of N and M are the same.

Optionally, when the local prediction neural network inputs the gesture feature information of the plurality of human body parts, the gesture feature information of the plurality of human body parts comprises gesture feature information of one host part and gesture feature information of at least one auxiliary human body part;

when the local prediction neural network inputs the gesture feature information of a human body part, the gesture feature information of the human body part is input as the gesture feature information of the host body part.

Optionally, the N body parts include several of the head, hand, upper arm, lower arm, foot, leg, and torso.

Optionally, the N feature extraction neural networks have the same network structure and different weight parameters.

Optionally, the feature extraction neural network, the local prediction neural network, and the global prediction neural network are convolutional neural networks, and the convolutional neural network includes at least one input layer, a hidden layer, and an output layer.

Optionally, the hidden layer of the feature extraction neural network includes at least one convolution layer and a pooling layer, and the at least one convolution layer and the pooling layer form at least one convolution group for extracting features layer by layer.

Optionally, the hidden layer further comprises at least one of an activation layer, a fully connected layer and a BN layer.

In a second aspect, an embodiment of the present disclosure provides a device for acquiring a predicted value of a human body posture, including:

an image acquisition unit for acquiring a human body posture image;

a first extraction unit for extracting a feature map of the human body posture image;

the second extraction unit is used for inputting the feature map into N feature extraction neural networks to obtain gesture feature information of N human body parts, wherein each feature extraction neural network is used for obtaining gesture feature information of one human body part, and N is a positive integer greater than or equal to 2;

the local prediction unit is used for inputting the gesture feature information of the N human body parts into M local prediction neural networks to obtain local gesture prediction values of the M human body parts, wherein each local prediction neural network inputs the gesture feature information of one or more human body parts, and M is a positive integer greater than or equal to 2;

and the overall prediction unit is used for inputting the local posture predicted values of the M human body parts into the overall prediction neural network to obtain the human body posture predicted values.

In a third aspect, embodiments of the present disclosure provide a server, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method of obtaining a human posture prediction value as in any of the first aspects of the embodiments of the present disclosure.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of obtaining a human posture prediction value as in any of the first aspects of the embodiments of the present disclosure.

The embodiment of the disclosure provides a method and a device for acquiring a predicted value of a human body posture, which take the difference of the posture characteristics of each human body part into consideration, firstly extract the posture characteristics of each human body part from a characteristic diagram of a human body posture image, and then input a corresponding local predicted neural network; and then the local predicted value of each part is input into the overall predicted neural network, so that the prediction accuracy of the human body posture is effectively improved.

Drawings

Fig. 1 is a flowchart of a method for obtaining a predicted value of a human body posture according to an embodiment of the present disclosure;

FIG. 2 is a block diagram of a 5-layer fully connected neural network provided by an embodiment of the present disclosure;

FIG. 3 is a schematic workflow diagram of various neural networks provided by embodiments of the present disclosure;

fig. 4 is a block diagram of a device for obtaining a predicted value of a human body posture according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the present disclosure and not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present disclosure are shown in the drawings.

It should be noted that the terms "system" and "network" are often used interchangeably herein in this disclosure. References to "and/or" in embodiments of the present disclosure are intended to "include any and all combinations of one or more of the associated listed items. The terms first, second and the like in the description and in the claims and drawings are used for distinguishing between different objects and not for limiting a particular order.

It should be further noted that, the following embodiments of the present disclosure may be implemented separately or in combination with each other, and the embodiments of the present disclosure are not limited in this regard.

The method for predicting the human body posture in the prior art generally comprises the steps of firstly acquiring a human body image, then extracting a characteristic image of the human body image, and then inputting the extracted characteristic image into a neural network to predict the human body posture. However, since there are differences in the attitudes of the respective parts of the human body and there is a correlation in the characteristics of each part, the method for acquiring the predicted value of the human body attitude provided in the prior art cannot sufficiently consider the above differences and correlations, so that there is a problem in that the prediction accuracy is low. According to the technical scheme for acquiring the human body posture predicted value, the posture characteristic of each human body part is firstly extracted from the characteristic diagram of the human body posture image in consideration of the difference of the posture characteristic of each human body part, and then the corresponding local prediction neural network is input; and then the local predicted value of each part is input into the overall predicted neural network, so that the prediction accuracy of the human body posture is effectively improved.

Specifically, in a first aspect, fig. 1 is a flowchart of a method for obtaining a predicted value of a human body posture according to an embodiment of the present disclosure, including the following steps:

s101, acquiring a human body posture image;

in this step, the human body posture image sample used in the embodiment of the present disclosure may be obtained from an existing ImageNet database, or may be obtained from another database, and the embodiment of the present disclosure is not particularly limited.

S102, extracting a feature map of a human body posture image;

in this step, an existing neural network, for example, a RestNet network may be used to extract the feature map of the human body posture image first, and other neural networks may be used to extract the feature map of the human body posture image.

S103, inputting the feature map into N feature extraction neural networks to obtain gesture feature information of N human body parts, wherein each feature extraction neural network is used for obtaining gesture feature information of one human body part, and N is a positive integer greater than or equal to 2;

in this step, in the prior art, the pose characteristic information of a plurality of parts of the human body is often extracted through a neural network, and then the pose of the human body is predicted. Because the pose characteristic information of each part of the human body is different and the pose characteristic information of each part is associated, if the extracted characteristic images are not distinguished, only one neural network is adopted for extracting the characteristics, and then the human body pose is predicted, so that the prediction accuracy is not ideal. The method adopted by the embodiment of the disclosure is to distinguish the gesture feature information of each part of the human body, and three different neural networks are used for extracting the gesture feature information of each part of the human body from the feature map, so that the accuracy of extracting the gesture feature information of a plurality of parts is ensured, and the accuracy of predicting the human body gesture in the follow-up process is also ensured.

S104, inputting the gesture feature information of the N human body parts into M local prediction neural networks to obtain local gesture predicted values of the M human body parts, wherein each local prediction neural network inputs the gesture feature information of one or more human body parts, and M is a positive integer greater than or equal to 2;

in this step, in the prior art, the pose characteristic information of a plurality of human body parts is often input into a neural network to predict the human body pose, and because the pose characteristic information of each part of the human body has a difference and the pose characteristic information of each part has a relevance, predicting the human body pose through the neural network may result in non-ideal prediction accuracy. The method adopted by the embodiment of the disclosure is to input the gesture characteristic information of a plurality of human body parts into a plurality of local prediction neural networks, distinguish the gesture characteristic information of the plurality of human body parts, and predict the gesture of the corresponding human body parts by adopting different local prediction neural networks, thereby improving the accuracy of the subsequent whole human body gesture prediction. In addition, the gesture characteristic information of a specific part is input into the corresponding local prediction neural network, and the gesture characteristic information of other parts related to the specific part is also input, so that the accuracy of human gesture prediction is further ensured.

S105, inputting the local posture predicted values of the M human body parts into the overall predicted neural network to obtain the human body posture predicted values.

In this step, the local posture predicted values of the plurality of human body parts are obtained through the foregoing steps, and the final human body overall posture predicted value is to be obtained, so that it is also necessary to predict the overall posture of the human body through one overall prediction neural network. The gesture characteristic information of the human body parts is distinguished through the steps, and different local prediction neural networks are adopted to predict the gesture of the corresponding human body parts, so that the accuracy of the human body gesture predicted value is improved on the basis of improving the accuracy of the local gesture predicted value.

According to the embodiment of the disclosure, the difference of the gesture characteristics of each human body part is considered, firstly, the gesture characteristics of each human body part are extracted from the characteristic diagram of the human body gesture image, then the corresponding local prediction neural network is input, and when the gesture characteristics of one part are input into the corresponding local prediction neural network, the gesture characteristics of other parts are also input into the local prediction neural network, so that the relevance of the gesture characteristics of each human body part is fully considered, and the prediction accuracy of each part is further improved; and then the local predicted value of each part is input into the overall predicted neural network, so that the prediction accuracy of the human body posture is effectively improved.

In some embodiments, the values of N and M are the same.

In the embodiment of the present disclosure, two implementations of local pose prediction of a human body part are specifically described below:

for the first implementation manner, the pose characteristic information of the N human body parts and the values of N and M in the M local prediction neural networks are different. When the human body posture is predicted, the posture characteristic information of a plurality of specific human body parts can be selected to predict the human body posture according to the actual requirement and the accuracy requirement of the predicted value, and because the posture characteristic information of a certain specific human body part has relevance with the posture characteristic information of other specific human body parts, when the posture characteristic information of a certain specific human body part is input into a corresponding local prediction neural network, the posture characteristic information of the relevant part is also required to be input, so that the posture characteristic information of N human body parts is different from the values of N and M in M local prediction neural networks.

Aiming at the second implementation mode, the gesture characteristic information of N human body parts is the same as the values of N and M in M local prediction neural networks. Unlike the first implementation manner, in the method, each human body part is provided with a local prediction neural network corresponding to the human body part, when the human body posture is predicted, the local posture of each human body part is predicted, and the posture characteristic information of a certain specific human body part and the posture characteristic information of other specific human body parts are considered to have relevance, so that when the posture characteristic information of a certain specific human body part is input into the corresponding local prediction neural network, the posture characteristic information of the relevant part is also required to be input, and the posture characteristic information of N human body parts and the values of N and M in M local prediction neural networks are the same. Compared with the first implementation mode, the method carries out local posture prediction on each human body part, so that a human body posture predicted value with higher accuracy can be finally obtained.

In some embodiments, when the local prediction neural network inputs the posture feature information of the plurality of human body parts, the posture feature information of the plurality of human body parts includes posture feature information of one main body part and posture feature information of at least one auxiliary human body part;

In the embodiment of the present disclosure, the posture feature information of the human body part is divided into the posture feature information of the main body part and the posture feature information of the auxiliary human body part, and when the posture feature information of the plurality of human body parts is input into the local prediction neural network, two situations exist, which are specifically described as follows:

for the first case, there are a plurality of other posture feature information of a human body part associated therewith. In this case, when predicting the local posture feature of a specific part, the posture feature information of the specific part is the posture feature information of the main body part, and the posture feature information of the other parts is the posture feature information of the auxiliary body part.

For the second case, the posture feature information of a certain human body part does not exist in the posture feature information of the human body part associated with the certain human body part. In this case, when predicting the local posture feature of a specific part, the master only needs to input the posture feature information of the specific part, but does not need to input the posture feature information of other parts, and at this time, the posture feature information of the specific part is the posture feature information of the master part.

It should be noted that, the relevance between multiple human body parts may be set according to the actual requirement and the requirement on the accuracy of the human body posture predicted value, and the posture feature information of one human body part may not be associated with the posture feature information of other parts, may be associated with the posture feature information of only one other part, and may be associated with the posture feature information of multiple other parts.

In some embodiments, the N human body parts include several human body parts of the head, hand, upper arm, lower arm, foot, leg, and torso.

In general, a human body part is roughly divided into a head, a hand, an upper arm, a lower arm, a foot, a leg, and a trunk, and when a certain motion is performed by a certain human body part, a coordinated motion of other parts may be accompanied, so that there is a correlation in posture characteristic information between each human body part. Therefore, when predicting the human body posture, if the correlation between each human body part is not considered, the human body posture prediction value may be inaccurate. The human body parts are divided, and the association between each part is considered, so that the accuracy of the human body posture predicted value is remarkably improved.

It should be noted that, the division of the human body parts is not limited to the head, the hand, the upper arm, the lower arm, the foot, the leg, and the trunk, and may further divide the accuracy of the human body posture prediction value according to the actual requirement and the accuracy, and the embodiment of the disclosure is not particularly limited.

In some embodiments, the N feature extraction neural networks have the same network structure and different weight parameters.

Because of the variability of each human body part, the feature extraction neural networks used for extracting the gesture feature information of each human body part are also different, and the difference of the feature extraction neural networks mainly appears in two aspects, namely, the difference of network structures and the difference of weight parameters adopted by the different feature extraction neural networks.

In the embodiment of the disclosure, the N feature extraction neural networks adopt the same network structure, and the weight parameters adopt different values, so that the N feature extraction neural networks can extract the gesture feature information of different human body parts. Because N feature extraction neural networks adopt the same network structure, the workload of building a plurality of neural network models is greatly reduced, and the extraction of the gesture feature information of different human body parts is ensured by setting different weight parameters.

It should be noted that, the N feature extraction neural networks may also adopt different network structures and the same weight parameters; different network structures and different weight parameters can be adopted, and the setting can be performed according to actual requirements, and the embodiment of the disclosure is not particularly limited.

In addition, the feature extraction neural network may directly adopt an existing feature extractor, and the weight parameter may be obtained directly from an existing database, or may be obtained through pre-training the feature extraction neural network, which is not particularly limited in the embodiments of the present disclosure.

In some embodiments, the feature extraction neural network, the local prediction neural network, and the global prediction neural network are convolutional neural networks that include at least one input layer, a hidden layer, and an output layer.

In an embodiment of the disclosure, the feature extraction neural network, the local prediction neural network, and the global prediction neural network are all convolutional neural networks (Convolutional Neural Network, CNN). Convolutional neural networks are widely used in various situations such as image recognition and voice recognition, and human body posture prediction in the embodiments of the present disclosure is used as an application of image recognition, and is also implemented by using convolutional neural networks. The convolutional neural network in the embodiment of the present disclosure includes three parts, i.e., an Input layer (Input layer), a Hidden layer (Hidden layer), and an Output layer (Output layer).

Wherein the input layer is composed of a plurality of neurons (neurons) for accepting a plurality of nonlinear input information, such as pose characteristics of a human body part in embodiments of the present disclosure.

The output layer also comprises a plurality of neurons, and information is transmitted, analyzed and weighed in the neuron links of the output layer to form an output result.

The hidden layer is each layer composed of a plurality of neurons and links between the input layer and the output layer, and the hidden layer can be multiple layers or only one layer. The number of neurons of the hidden layer is variable, but the more the number is, the more the nonlinearity of the convolution neural network is obvious, so that the robustness of the convolution neural network (the characteristic that a control system maintains certain performance under the perturbation of parameters such as a certain structure, a certain size and the like) is more obvious.

It should be noted that, the embodiment of the present disclosure only provides an exemplary structure, and the structure of the convolutional neural network used is not limited, and the convolutional neural network may be set according to actual requirements, or may not include one or more of an input layer, a hidden layer and an output layer, which is not specifically limited in the embodiment of the present disclosure.

In some embodiments, the hidden layers of the feature extraction neural network include at least one convolution layer and a pooling layer, the at least one convolution layer and pooling layer forming at least one convolution set for layer-by-layer extraction of features.

In the embodiments of the present disclosure, the convolutional neural network may include a plurality of convolutional layers, or may include one convolutional layer. In each convolution layer, the convolution of the layer is used to check an input feature map (also referred to as input feature data or input feature values) of the layer, that is, a feature map of a human body posture image in the embodiment of the present disclosure, to perform a convolution operation of the layer, so as to obtain an output feature map (also referred to as output feature data or output feature values) of the layer, that is, posture feature information of a human body part in the embodiment of the present disclosure. In the layers of the convolutional neural network, the input feature map may have a certain width and height, and may have a certain number of channels (also referred to as depth). The respective convolution kernels may have the same (or different) width and height that is less than (or equal to) the width and height of the input feature map, and may have the same number of channels that is equal to the number of channels of the input feature map.

A pooling layer is connected behind each convolution layer, and one convolution layer and the pooling layer form a convolution group. The purpose of the pooling layer is to reduce the amount of data to be processed to the next convolution set. For example, when the output size of the convolution layer is 32×32, if the size of the pooling layer filter is 2×2, the size of the output data after the pooling layer processing is 16×16, that is, the existing data amount is reduced to 1/4 of that before pooling. The adoption of the pooling layer reduces the data volume to be processed, so that the number of parameters is reduced, and the overfitting of the convolutional neural network to the data can be prevented.

In some embodiments, the hidden layer further comprises at least one of an activation layer, a fully connected layer, and a BN layer.

All neurons of adjacent layers have connections, i.e., full-connections (connected). In the embodiment of the present disclosure, the Affine layer may be used to implement a fully connected layer, and when the Affine layer is used, for example, a fully connected neural network of 5 layers may be implemented by a network structure as shown in fig. 2. An activation layer, such as an activation function ReLU layer or Sigmoid layer, is typically connected to the back of the Affine layer, as shown in fig. 2, in the embodiment of the disclosure, 4 layers of "Affine-ReLU" combinations are stacked, then the 5 th layer is the Affine layer, and finally the Softmax layer outputs the final result.

Typically, the neural network is trained using a gradient descent method to update parameters. Although the gradient descent method is simple and efficient in training the neural network, parameters such as learning rate, parameter initialization, weight attenuation coefficient, dropout proportion and the like still need to be manually selected, and the selection of the parameters is critical to the training result, so that the training time of the neural network is wasted to parameter adjustment. And a BN (Batch Normalization) algorithm can be adopted to select a larger learning rate, so that the training speed of the neural network is increased very fast, and the neural network has rapid convergence. In addition, the BN algorithm is adopted, the selection of the relation Dropout proportion and the L2 regular term parameter is not needed, or the two parameters can be removed, so that the parameter adjusting time is effectively shortened.

On the other hand, the neural network generally needs to perform normalization processing on the data before training, and the reason for the normalization processing is that the training process of the neural network is also a process of learning data distribution, and if the distributions of the training data and the test data are different, the generalization capability of the neural network is greatly reduced. In addition, if the data distribution of each batch is different, the neural network needs to adapt to different data distribution at each iteration, so that the training speed of the network is greatly reduced, and therefore, normalization preprocessing needs to be performed on the data.

In addition, parameters of the neural network are updated in the training process, and data distribution of other layers except data of an input layer of the neural network is uniform and changed. That is, the change of the network parameters during the training process of the neural network may cause the distribution of the later input data to change, for example, the second layer input is obtained by the input data and the first layer parameters, and the change of the first layer parameters along with the training will cause the change of the second layer input distribution.

Therefore, the embodiment of the disclosure can effectively solve the problems and improve the training speed of the neural network by adopting the BN layer.

It should be noted that the embodiment of the present disclosure only provides an exemplary structure, and the structure of the convolutional neural network used is not limited, and the convolutional neural network may be set according to actual requirements, or may not include one or more of an active layer, a full connection layer, and a BN layer, which is not specifically limited in the embodiment of the present disclosure.

For the method for obtaining the human body posture predicted value provided by the embodiment of the present disclosure in the first aspect, the embodiment of the present disclosure further provides a specific workflow schematic of each neural network used in the embodiment of the present disclosure, as shown in fig. 3, where the RestNet network is used to extract a feature map of a human body posture image; the ResNet+Deconv network is a feature extraction neural network and is used for extracting corresponding pose feature information of a human body part from a feature map of a human body pose image, such as the pose feature information of the head, the upper arm and the leg of the human body in FIG. 3; the LocalNet network is a local prediction neural network and is used for obtaining a local posture prediction value of a corresponding part according to posture characteristic information of the human body part; the global net network is an overall prediction neural network and is used for obtaining a human body posture predicted value according to the local posture predicted value.

It should be noted that, the workflow of each neural network shown in fig. 3 is only used to illustrate an embodiment of the disclosure, and the workflow of each neural network may be reasonably modified within the scope of the protection of the embodiment of the disclosure according to actual requirements, and the embodiment of the disclosure is not particularly limited.

In a second aspect, fig. 4 is a structural block diagram of a device for obtaining a predicted value of a human body posture according to an embodiment of the present disclosure, where the device includes:

an image acquisition unit 100 for acquiring a human body posture image;

a first extraction unit 200 for extracting a feature map of a human body posture image;

the second extraction unit 300 is configured to input the feature map into N feature extraction neural networks 310 to obtain pose feature information of N human body parts, where N is a positive integer greater than or equal to 2, and each feature extraction neural network 310 is configured to obtain pose feature information of one human body part;

the local prediction unit 400 is configured to input pose characteristic information of N human body parts into M local prediction neural networks 410 to obtain local pose predicted values of M human body parts, where each local prediction neural network 410 inputs pose characteristic information of one or more human body parts, and M is a positive integer greater than or equal to 2;

the overall prediction unit 500 is configured to input the local pose predicted values of the M human body parts into the overall prediction neural network 510, to obtain the human body pose predicted values.

In some embodiments, the values of N and M are the same.

In a third aspect, fig. 5 is a schematic structural diagram of a server provided in an embodiment of the disclosure, and as shown in fig. 5, the schematic structural diagram of a server suitable for implementing an embodiment of the disclosure is shown. Taking an electronic device as an example, the server in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (e.g., vehicle-mounted navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 4 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 4, the server 600 may include a processor (e.g., a central processing unit, a graphic processor, etc.) 601 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage device 608 into a Random Access Memory (RAM) 603, for example, implementing a human posture prediction value acquisition method provided by an embodiment of the present disclosure, where the human posture prediction value acquisition method includes:

acquiring a human body posture image;

extracting a feature map of the human body posture image;

In the RAM 603, various programs and data necessary for the operation of the server apparatus 600 are also stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the server 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows a server 600 having various devices, it is to be understood that not all illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the server; or may exist alone without being assembled into the server.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the server to execute the method for acquiring the human body posture prediction value provided by the embodiment, includes: acquiring a human body posture image; extracting a feature map of the human body posture image; inputting the feature map into N feature extraction neural networks to obtain gesture feature information of N human body parts, wherein each feature extraction neural network is used for obtaining gesture feature information of one human body part, and N is a positive integer greater than or equal to 2; inputting the gesture feature information of N human body parts into M local prediction neural networks to obtain local gesture predicted values of the M human body parts, wherein each local prediction neural network inputs the gesture feature information of one or more human body parts, and M is a positive integer greater than or equal to 2; and inputting the local posture predicted values of the M human body parts into the overall predicted neural network to obtain the human body posture predicted values.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the name of the module or unit does not constitute a limitation of the module itself in some cases, for example, the image acquisition unit may also be described as "a unit for acquiring a human body posture image"; the first extraction unit may also be described as "a unit for extracting a feature map of a human body posture image".

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Claims

1. The method for acquiring the predicted value of the human body posture is characterized by comprising the following steps of:

acquiring a human body posture image;

extracting a feature map of the human body posture image;

inputting the gesture feature information of the N human body parts into M local prediction neural networks to obtain local gesture predicted values of the M human body parts, wherein each local prediction neural network inputs the gesture feature information of one or more human body parts, and M is a positive integer greater than or equal to 2;

and inputting the local gesture predicted values of the M human body parts into a global predicted neural network to obtain the human body gesture predicted values.

2. The method of claim 1, wherein N and M have the same value.

3. The method of claim 1, wherein when the local prediction neural network inputs the posture feature information of a plurality of human body parts, the posture feature information of the plurality of human body parts includes posture feature information of one host body part and posture feature information of at least one auxiliary human body part;

when the local prediction neural network inputs the gesture characteristic information of a human body part, the gesture characteristic information of the human body part is input.

4. The method of claim 1, wherein the N human body parts comprise several human body parts of a head, a hand, an upper arm, a lower arm, a foot, a leg, and a torso.

5. The method of claim 1, wherein the N feature extraction neural networks have the same network structure and different weight parameters.

6. The method of claim 1, wherein the feature extraction neural network, the local prediction neural network, and the global prediction neural network are convolutional neural networks comprising at least one input layer, a hidden layer, and an output layer.

7. The method of claim 6, wherein the hidden layer of the feature extraction neural network comprises at least one convolution layer and a pooling layer that form at least one convolution group for extracting features layer by layer.

8. The method of claim 7, wherein the hidden layer further comprises at least one of an active layer, a fully connected layer, and a BN layer.

9. A human body posture prediction value acquisition apparatus, characterized by comprising:

an image acquisition unit for acquiring a human body posture image;

the local prediction unit is used for inputting the gesture feature information of the N human body parts into M local prediction neural networks to obtain local gesture predicted values of the M human body parts, wherein each local prediction neural network inputs the gesture feature information of one or more human body parts, and M is a positive integer greater than or equal to 2;

and the whole prediction unit is used for inputting the local gesture predicted values of the M human body parts into a whole prediction neural network to obtain the human body gesture predicted values.

10. A server, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of obtaining a human posture prediction value as recited in any one of claims 1-8.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements a method of acquiring a human body posture prediction value according to any one of claims 1-8.