CN112287730A

CN112287730A - Gesture recognition method, device, system, storage medium and equipment

Info

Publication number: CN112287730A
Application number: CN201910673693.3A
Authority: CN
Inventors: 黄海安
Original assignee: Gaoyida Technology Shenzhen Co ltd; Robotics Robotics Shenzhen Ltd
Current assignee: Gaoyida Technology Shenzhen Co ltd; Robotics Robotics Shenzhen Ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2021-01-29

Abstract

The application relates to a gesture recognition method, a gesture recognition device, a gesture recognition system, a storage medium and equipment. The gesture recognition method comprises the following steps: acquiring image data of a target object; acquiring a gesture recognition model; and inputting the image data into the gesture recognition model, and outputting a recognition result. By adopting the technical scheme of the invention, the gesture recognition is carried out based on an artificial intelligence method, the gesture recognition precision is improved, and the generalization capability of the gesture recognition is improved on the premise of ensuring the precision.

Description

Gesture recognition method, device, system, storage medium and equipment

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a system, a storage medium, and a device for gesture recognition.

Background

With the improvement of the technological level, the whole society develops towards the direction of intellectualization and automation.

Gesture recognition of target objects based on images is a key to the fields of augmented reality, virtual reality, and robotics.

However, the conventional gesture recognition method has certain defects in the aspects of gesture recognition accuracy and the like under the condition that the background environment, the target object and the like are changed.

Disclosure of Invention

Based on the above, the invention provides a gesture recognition method, a gesture recognition device, a gesture recognition system, a storage medium and equipment.

A first aspect of the present invention provides a gesture recognition method, including:

acquiring image data of a target object;

acquiring a gesture recognition model;

and inputting the image data into the gesture recognition model, and outputting a recognition result.

Preferably, the recognition result is a gesture recognition result of the target object.

Preferably, the recognition result is a recognition result of feature information associated with the target object in the image data.

Preferably, the feature information is: keypoints and/or keypoints lines.

Preferably, after outputting the recognition result, the method further includes:

acquiring the identification result;

and generating a posture recognition result of the target object according to the recognition result.

Preferably, the recognition result is a first partial gesture recognition result and a preprocessing recognition result of the target object.

Preferably, the method further includes, after outputting the gesture recognition result:

acquiring the preprocessing identification result;

determining a second part of gesture recognition results of the target object according to the preprocessing recognition results;

and taking the first part of gesture recognition result and the second part of gesture recognition result as the gesture recognition result of the target object.

Preferably, the gesture recognition method further includes:

and optimizing the posture recognition result of the target object to obtain an optimized result.

Preferably, the image data of the object includes only the object or a single background;

the acquiring of the image data of the target object further comprises:

acquiring initial image data; wherein the initial image data comprises the object and a complex background;

and extracting the target object in the initial image data, and generating the image data of the target object only comprising the target object or a single background.

The second aspect of the present invention provides a gesture recognition training method, including:

acquiring a training sample set;

acquiring an initial model of the gesture recognition model;

training the initial model based on the training sample set to obtain the gesture recognition model; the gesture recognition model is used for outputting a recognition result to the input image data of the target object.

Preferably, the identification result is: a gesture recognition result of the target object; the identification result of the feature information associated with the target object in the image data; or the first part gesture recognition result and the preprocessing recognition result of the target object.

A third aspect of the present invention provides a posture identifying apparatus comprising:

the target image acquisition module is used for acquiring image data of a target object;

the recognition model acquisition module is used for acquiring a gesture recognition model;

and the recognition result output module is used for inputting the image data into the gesture recognition model and outputting a recognition result.

Preferably, the recognition result is a recognition result of feature information associated with the target object in the image data;

the gesture recognition apparatus further includes:

the identification result acquisition module is used for acquiring the identification result;

and the target result generation module is used for generating a posture recognition result of the target object according to the recognition result.

Preferably, the recognition result is a first part gesture recognition result and a preprocessing recognition result of the target object;

the gesture recognition apparatus further includes:

the preprocessing result acquisition module is used for acquiring the preprocessing identification result;

the recognition result determining module is used for determining a second part gesture recognition result of the target object according to the preprocessing recognition result;

and the target result obtaining module is used for taking the first part of gesture recognition results and the second part of gesture recognition results as gesture recognition results of the target object.

Preferably, the gesture recognition apparatus further includes:

and the target result optimization module is used for optimizing the posture recognition result of the target object to obtain an optimization result.

the gesture recognition apparatus further includes:

the initial image acquisition module is used for acquiring initial image data; wherein the initial image data comprises the object and a complex background;

and the target image generation module is used for extracting a target object in the initial image data and generating image data of the target object only comprising the target object or a single background.

A fourth aspect of the present invention provides a posture-recognition training device, including:

the training sample acquisition module is used for acquiring a training sample set;

the initial model acquisition module is used for acquiring an initial model of the gesture recognition model;

the model training module is used for training the initial model based on the training sample set to obtain the gesture recognition model; the gesture recognition model is used for outputting a recognition result to the input image data of the target object.

A fifth aspect of the present invention provides an attitude recognition system including an image sensor and a control device;

the image sensor is used for acquiring image data of a target object;

the control device is used for acquiring the image data of the target object; acquiring a gesture recognition model; inputting the image data of the target object into the gesture recognition model, and outputting a recognition result; or

The image sensor is used for acquiring initial image data comprising the target object and a complex background;

the control device is configured to acquire the initial image data: extracting a target object in the initial image data, and generating image data of the target object only including the target object or a single background; acquiring image data of the target object; acquiring a gesture recognition model; and inputting the image data of the target object into a gesture recognition model, and outputting a recognition result.

A sixth aspect of the present invention provides a computer apparatus comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the gesture recognition method described in any one of the above when executing the computer program; and/or the gesture recognition training method described above.

A seventh aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the gesture recognition method of any one of the above; and/or the gesture recognition training method described above.

An eighth aspect of the present invention provides a method for generating a training sample set in the gesture recognition training method described in the fourth aspect, where the method for generating the training sample set includes:

establishing a corresponding relation between a plurality of visual angle categories and the postures of the target object;

acquiring an image data set corresponding to each visual angle, recording the corresponding relation among the image data set, the visual angle category and the posture, and generating a label of the image data set;

pasting the image data set to a background image to generate an updated image data set;

taking the updated image data set and the label as the training sample set; or

taking the image data set and the label as the training sample set; or

Acquiring an image data set of a target object; wherein each image data in the image data set corresponds to a pose of a target object;

acquiring a 3D model of a target object including 3D feature information;

according to the gesture corresponding to each image data, projecting the 3d feature information to the corresponding image data to generate a label of the 2d feature information;

taking the image data set and the label as the training sample set; or

Acquiring an image data set of a target object; wherein each image data in the image data set corresponds to a target object pose;

acquiring a 3D model of a target object including 3D feature information;

projecting the 3d feature information to corresponding image data according to the posture corresponding to each image data to generate 2d feature information;

generating a label of the prediction graph according to the 2d characteristic information;

and taking the image data set and the label as the training sample set.

By adopting the technical scheme of the invention, the gesture recognition is carried out based on an artificial intelligence method, the gesture recognition precision is improved, and the generalization capability of the gesture recognition is improved on the premise of ensuring the precision.

Drawings

FIG. 1 is a first flowchart of a gesture recognition method in one embodiment;

FIG. 2 is a diagram illustrating a second process of a gesture recognition method according to an embodiment;

FIG. 3 is a diagram illustrating a third process of a gesture recognition method according to an embodiment;

FIG. 4 is a fourth flowchart illustrating a gesture recognition method according to an embodiment;

FIG. 5 is a fifth flowchart illustrating a method for gesture recognition according to an embodiment;

FIG. 6 is a sixth flowchart illustrating a gesture recognition method according to an embodiment;

FIG. 7 is a first flowchart of a training sample set generation method according to an embodiment;

FIG. 8 is a diagram of a second process flow of a training sample set generation method in accordance with an embodiment;

FIG. 9 is a diagram of a second process flow of a training sample set generation method in accordance with an embodiment;

FIG. 10 is a sixth flowchart illustrating a method for gesture recognition training in accordance with an embodiment;

FIG. 11 is a first block diagram of a gesture recognition apparatus in one embodiment;

FIG. 12 is a block diagram showing a second configuration of a posture identifying apparatus in one embodiment;

FIG. 13 is a block diagram of a third configuration of a gesture recognition apparatus in one embodiment;

FIG. 14 is a fourth structural block diagram of a posture identifying apparatus in one embodiment;

FIG. 15 is a block diagram showing a fifth configuration of a posture identifying apparatus in one embodiment;

FIG. 16 is a block diagram showing a sixth configuration of a posture identifying apparatus in one embodiment;

FIG. 17 is a first block diagram of a gesture recognition training apparatus in accordance with one embodiment;

FIG. 18 is a first block diagram of an embodiment of a state identification system;

FIG. 19 is a diagram showing a first configuration of an environment in which a gesture recognition method is applied in one embodiment;

FIG. 20 is a first block diagram of a computer device in one embodiment;

FIG. 21 is a first diagram of pose classification in one embodiment;

FIG. 22 is a first diagram illustrating keypoint switching in one embodiment;

FIG. 23A is a first prediction graph in one embodiment; FIG. 23B is a diagram illustrating a deformation of the prediction graph in one embodiment.

FIG. 24 is a second prediction graph in one embodiment;

FIG. 25 is a third prediction graph in one embodiment;

FIG. 26 is a schematic illustration of a key line for one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The gesture recognition method provided by the present application may be applied to an application environment as shown in fig. 19, where the application environment may include the terminal 600 and/or the server 700, and the terminal 600 communicates with the server 700 through a network. The method can be applied to both the terminal 600 and the server 700. The terminal 600 may be, but is not limited to, various industrial computers, personal computers, notebook computers, smart phones, and tablet computers. Server 700 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 1, a gesture recognition method is provided, which includes the following steps, taking the method as an example for the terminal in fig. 19:

step S101, acquiring image data of a target object;

specifically, the image data collected by the image sensor and transmitted in real time may be acquired, or the image data may be acquired from a memory or a server.

In one embodiment, the image data of the object may be image data including the object and a complex background; besides, in another embodiment, the image data may also be image data including only the object, or including the object and a single background (i.e. including a single background in addition to the object), as shown in fig. 2, then the following method steps may be included before step S101:

step S104, acquiring initial image data; wherein the initial image data comprises an object and a complex background;

step S105 extracts the target object in the initial image data, and generates image data including only the target object or the target object of a single background.

Specifically, the single background means that the background adopts a single pattern or color.

Specifically, the image data may be various types of 2D image data (e.g., RGB image, grayscale or black-and-white image). Image sensors may include, but are not limited to: cameras, video cameras, scanners or other devices with associated functions (cell phones, computers), etc.

Specifically, the extraction may include, but is not limited to, the following methods:

in one embodiment, the foreground can be scratched out of the original image according to the external contour; such as: identifying a foreground part in an initial image based on traditional various image processing methods (such as binarization, edge detection, connected domain and the like), and then extracting the foreground part; or

Further, in an embodiment, the stripped foreground may also be mapped onto a rectangular image with a preset size of a single background, so as to generate a target image of the single background; or

In one embodiment, the original image with a certain size is cut to a certain extent, so that the cut image is the smallest cross-section image surrounding the outer frame of the object (for example, if the original image is a 100 × 200 image and the cut image is 50 × 80, the cut image can be regarded as the object image including only foreground); further, in an embodiment, the cropped image may also be mapped onto a rectangular image with a preset size of a single background, so as to generate a target image with a single background; or

In one embodiment, the initial image is processed such that the processed image includes only a foreground and a single background, etc., such as: traditional vision based methods or artificial intelligence based methods.

Step S102, acquiring a posture recognition model;

the previously trained gesture recognition model is acquired from a memory, a server, or the like.

Step S103 inputs the image data into the gesture recognition model, and outputs the recognition result.

Because the method based on artificial intelligence is adopted to carry out the relevant recognition of the target object posture, the recognition precision is improved; in addition, the generalization capability of the identification can be improved on the premise of ensuring the precision.

Specifically, in step S103, different recognition results may be output according to different design of the gesture recognition model.

In one embodiment, the recognition result may be, but is not limited to: a gesture recognition result of the target object; identifying the characteristic information associated with the target object in the image data; or a combination of the first part of the gesture recognition result of the target object and the pre-processing recognition result.

In one embodiment, the recognition result output by the model may include a category, a mask, and the like of the target object in addition to the result related to the gesture recognition. Such as: the category of the target object is to tell you what this object is, and the mask is usually to tell you that a certain area in the image is an object that you are interested in.

The gesture recognition result of the target object is that three-dimensional gesture information of the target object is directly obtained through the model; the recognition result of the feature information associated with the target object also needs a subsequent further processing method (for example, combining a 3D model of the target object) to obtain a final gesture recognition result of the target object according to the two-dimensional feature information result in the image; the combination of the first part of the gesture recognition result and the preprocessing recognition result of the target object means that the gesture recognition result of part of the target object is directly output through the model, and the other part of the gesture recognition result is the preprocessing recognition result which needs to be further processed to obtain the final overall gesture recognition result of the target object; the three recognition results and the corresponding recognition methods will be described in further detail below.

In one embodiment, the recognition result is a gesture recognition result of the target object, that is, the gesture recognition result of the target object is directly output through the model. Namely, the recognition result can be directly the three-dimensional attitude information of the target object;

specifically, the three-dimensional posture information may be 3d coordinates of a preset coordinate system for the target object; the motion of a rigid body in a 3-dimensional space can be described by 3d coordinates (total 6 degrees of freedom), and specifically, can be divided into rotation and translation, each with 3 degrees of freedom. The translation of the rigid body in the 3-dimensional space is a common linear transformation, and a 3x1 vector can be used for describing the translation position; while rotational gestures are commonly described in a manner including, but not limited to: rotation matrix, rotation vector, quaternion, euler angle and lie algebra.

In one embodiment, as shown in fig. 10, a method for training a gesture recognition model is provided, and in addition, taking the application of the method to the terminal in fig. 19 as an example, the method includes the following method steps:

step S401, acquiring a training sample set;

step S402, obtaining an initial model of the gesture recognition model;

step S403 trains the initial model based on the training sample set to obtain the gesture recognition model.

In one embodiment, the gesture recognition model is used for outputting a recognition result to the input image data of the target object.

Specifically, the network model may include, but is not limited to, a Convolutional Neural Network (CNN), and common CNN models may include, but are not limited to: LeNet, AlexNet, ZFNET, VGG, GoogLeNet, Residual Net, DenseNet, R-CNN, SPP-NET, Fast-RCNN, YOLO, SSD, BB8, YOLO-6D, Deep-6dPose, PoseCNN, Hourglass, CPN and other now known or later developed network model structures.

Specifically, the training method may be supervised learning, semi-supervised learning, or unsupervised learning, or a training method developed now or in the future. Taking supervised learning as an example, an image data set is taken as input, three-dimensional posture information is taken as a label, and an initial model of a posture recognition model is trained, so that the posture recognition model is obtained.

In one embodiment, the recognition result is a recognition result of feature information associated with the object in the image data.

In particular, the feature information may be, but is not limited to, a key point and/or a key line (wherein a key line may be regarded as a combination of a plurality of consecutive key points).

Specifically, the identification result of the feature information may be, but is not limited to, 2d coordinates of the key points and/or key lines, wherein the 2d coordinates of the key lines are formed by combining 2d coordinates of a plurality of continuous key points; or the image data after the key points and/or the key lines are overlapped and marked; or a prediction map for extracting the above-mentioned key points and/or key lines.

Specifically, the key point may be a key point on the belonging target object; or as a key point of belonging to a bounding box that encloses the target object, the following two cases are described in detail respectively:

in one embodiment, the keypoint recognition result is a gesture recognition result of keypoints belonging to a bounding box that encloses the target object. Specifically, 2d coordinates of projection points of 8 vertexes of a 3d bounding box surrounding the target object on the 2d image may be defined; or the model can directly output the image data after the projection point labeling is superimposed (as shown in the right diagram in fig. 22); in one embodiment, the center point of the target object (i.e. a total of 9 key points) may be added in addition to the above-mentioned 8 vertices.

In one embodiment, the gesture recognition result of the key point is a gesture recognition result of a key point on the belonging target object. As will be described in further detail below.

Specifically, the network model may include, but is not limited to, a Convolutional Neural Network (CNN), and common CNN models may include, but are not limited to: LeNet, AlexNet, ZFNET, VGG, GoogLeNet, Residual Net, DenseNet, R-CNN, SPP-NET, Fast-RCNN, YOLO, SSD, BB8, YOLO-6D and other now known or later developed network model structures.

Specifically, regarding the training method of the model, refer to the steps of the gesture recognition training method in the embodiment shown in fig. 10, and are not described herein again.

Specifically, the training method may be supervised learning, semi-supervised learning, or unsupervised learning, which is currently available or developed in the future. Taking supervised learning as an example, taking an image data set as input, and taking 2d coordinates of projection points on a 2d image; or the model can directly output the image data after the projection point mark is superposed as a mark, and an initial model of the gesture recognition model is trained, so that the gesture recognition model is obtained.

In one embodiment, as shown in fig. 8, a method for generating a sample set including keypoint labels in a training sample set of keypoints belonging to a target object is provided:

step S301, acquiring an image data set of a target object; wherein each image data in the image data set corresponds to a pose of a target object;

step S302, acquiring a 3D model of a target object including 3D characteristic information;

in particular, the 3D feature information (keypoints and/or keypoints) may be predefined according to instructions for input, or the 3D model (e.g., CAD model) may be automatically generated according to some feature extraction algorithm.

Step S303, according to the corresponding posture of each image data, projecting the 3D characteristic information on the 3D model to the image data of the corresponding posture to generate the label of the 2D characteristic information;

step S304 takes the image data set and the corresponding label as a training sample set.

Further, in one embodiment, the recognition result of the feature information is a prediction map, that is, each pixel of the image data is predicted, the model outputs the prediction map (as shown in fig. 23A), and then the key points and/or the key lines are extracted according to the prediction map.

Specifically, different meanings can be represented according to different characteristic information such as color, brightness and the like of the prediction graph, for example: the probability of the key points and the positions of the key points; alternatively, the prediction map may be a contour map or any other image representation.

In one embodiment, each pixel predicts the direction of a keypoint relative to the pixel itself, and the difference in color shown in the prediction map (e.g., FIG. 23A) represents the different direction.

In another example, each pixel predicts the likelihood/probability that the current pixel is a keypoint. The higher the probability, the higher the predicted value (e.g., the higher the brightness of the image (as shown in fig. 25), or the greater the height of the contour (as shown in fig. 24)). It should be noted that the predicted value of the pixel points near the key point is usually high.

Specifically, according to the prediction graph, the method for determining the key points and/or the key lines may include, but is not limited to, the following methods:

after obtaining the prediction graph, the prediction graph needs to be converted into key points and/or key lines. The method of converting the prediction graph into keypoints and/or keypoints is related to the meaning of the prediction graph. When the prediction of the prediction graph is a direction, a voting mode can be adopted, and a position with the largest number of pixel points pointing to a certain position is taken as a key point of the image data (as shown in fig. 23B); when the prediction of the prediction graph is the key point possibility, the pixel point with the highest predicted value can be taken as the key point, or a weighted average is obtained for the area with the high predicted value. Specifically, the model can output a plurality of prediction graphs, and a key point is predicted according to each prediction graph; or the model inputs a prediction graph, and a plurality of key points are predicted on each prediction graph.

Because the target object posture recognition is performed by the prediction graph-based method, the method generally has higher precision compared with a method for directly outputting key points and/or key lines, and the training difficulty is reduced.

In one embodiment, as shown in fig. 9, a method for generating a model training sample set is provided:

step S401, acquiring an image data set of a target object; wherein each image data in the image data set corresponds to a pose of a target object;

step S402, acquiring a 3D model (such as CAD) model of the target object including 3D characteristic information;

step S403, projecting the 3d characteristic information on the CAD model to the image data of the corresponding posture according to the posture corresponding to each image data to generate 2d characteristic information;

step S404, generating the label of the prediction graph according to the 2d characteristic information;

namely, the prediction graph is used as the label of the image data;

step S405 takes the image data set and the label as a training sample set.

Specifically, the training method may be supervised learning, semi-supervised learning, or unsupervised learning, or a training method developed now or in the future. Taking supervised learning as an example, the gesture recognition model is trained by taking the image data set as input and the prediction graph as output.

Specifically, the network model may include, but is not limited to, a Convolutional Neural Network (CNN), and common CNN models may include, but are not limited to: LeNet, AlexNet, ZFNET, VGG, GoogLeNet, Residual Net, DenseNet, Mask-RCNN, Hourglass, CPN and other now known or later developed network model structures.

In one embodiment, the gesture recognition result is a key line or a key line and key point combined gesture recognition result.

Specifically, the output of the model may be a key line (e.g., a curve, a straight line) on the two-dimensional image, such as: the model can directly output the 2d coordinates of the projection points of each continuous key point forming the key line on the 2d image; or the model may directly output image data with the keyword labels superimposed thereon.

In particular, these lines need to be predefined, such as: some prominent edge lines of the object are set according to a model such as CAD of the object. Even the combination of some key lines can be expanded and defined, as shown in FIG. 26, the output of the model can be the combination of line segments AB, AC and AD; even some geometry resulting from the combination of multiple key lines (the drawings are omitted).

In general, two end points of a line segment are equivalent to all points on the line segment when matching the 3D model, but in actual practice, because of errors in detection, matching the 3D model with noisy points of a whole line segment is more accurate than matching with two end points of the line segment, which correspond to two corner points connected by an edge of an object in space, so that matching the 3D model with more points can reduce final matching errors.

In addition, previous keypoint-based algorithms may fail when some keypoint corners are not observed, but still have an opportunity to match if it is the detection of a keypoint line.

It should be noted that, because a keyword may be regarded as a composition of a plurality of continuous keywords, the descriptions of the network model type related to the gesture recognition of the keyword, the model training method, the sample generation method, and the like refer to the introduction in the keyword, and are not described herein again.

In one embodiment, as shown in fig. 3, when the gesture recognition result is a recognition result of feature information associated with an object in the image data, the step S103 further includes, after outputting the recognition result:

step S106, obtaining a recognition result;

step S107 generates a posture recognition result of the target object based on the recognition result.

Specifically, in step S107, after the key points respectively recognized on the two images are obtained, the gesture recognition result of the target object may be generated by using some algorithm.

Further, in one embodiment, step S107 may include the following method:

step S1073, acquiring a 3D model including 3D key point information;

step S1074, respectively acquiring 2d key point information associated with the target object in the image data;

step S1075 generates a posture recognition result of the target object based on a linear transformation method according to the 3d key point information and the corresponding 2d key point information.

Specifically, the gesture recognition result of the target object under the image sensor coordinate system can be obtained by using key points obtained based on the image data and corresponding key points on the 3D model based on a direct linear transformation method;

wherein, the direct linear transformation method is a projection relation from 3D key point information to 2D key point information on the 3D model of the object, and we can list an equation set. By solving the equation set, the three-dimensional posture recognition result of the target object can be obtained.

The biggest disadvantage of the direct linear transformation method is that the rotation matrix obtained by solving the direct linear transformation method is a general matrix, while the true rotation matrix is an unit orthogonal matrix, so the direct linear transformation method usually obtains an approximate solution, and the precision may not be very high.

Therefore, in one embodiment, after obtaining the gesture recognition result of the target object generated in step S1075, the following method steps may be further included:

and S1076, optimizing the posture recognition result of the target object to obtain an optimized recognition result.

Specifically, a nonlinear optimization method may be used for optimization. In one embodiment, the optimization method S1076 includes the following method steps:

step S601, acquiring a posture recognition result of a target object;

step S602, according to the gesture recognition result of the target object, calculating the projection of the 3D key point on the 3D model in the image data;

step S603 compares the projection with the position of a key point in the image data to obtain a re-projection error;

step S604, updating the posture recognition result of the target object by taking the minimized reprojection error as a target to obtain a current updating result;

in one embodiment, the problem may be solved using a non-linear optimization algorithm with the goal of minimizing the reprojection error. Non-linear optimization algorithms include, but are not limited to, newton, gauss-newton, levenberg-marquardt.

Step S605 replaces the posture recognition result of the target object with the current update result, and repeats the steps of S602-S604 until the reprojection error is smaller than a preset threshold or the update reaches a preset number of times, thereby obtaining the posture recognition result of the target object after optimization.

Specifically, as shown in fig. 4, when the identification result of the feature information related to the target object is a prediction graph, the step S107 may further include the following steps:

step S1071, according to predicting the picture, confirm the key point;

step S1072 determines a posture recognition result of the target object based on the key point.

Specifically, the posture recognition result of the target object may be determined by a method of calculating the posture recognition result of the target object according to the key point with reference to the above-mentioned steps S1073 to S1075 or steps S1073 to S1076.

In one embodiment, the gesture recognition result is a combination of the first partial gesture recognition result of the target object and the pre-processing recognition result. That is, the translational position and the rotational attitude information in the three-dimensional attitude information for the object can be acquired separately.

Further, in an embodiment, as described in the above embodiment, as shown in fig. 5, after the step S103 outputs the recognition result, the following method steps may be further included:

step S108, acquiring a preprocessing identification result;

such as: directly outputting the rotation attitude information and the preprocessing recognition result of the translation position information through the model, and further processing the preprocessing recognition result of the translation position information to obtain the translation position information; or directly outputting the translation position information and the preprocessing recognition result of the rotation attitude information through the model.

Step S109, determining a second part gesture recognition result of the target object according to the preprocessing recognition result;

step S110 takes the first partial gesture recognition result and the second partial gesture recognition result as the gesture recognition result of the target object. And finally, taking the rotation posture information and the translation position information as the whole posture recognition result of the target object.

In one embodiment, the model may directly output the rotational pose information; and the translation position information can be determined by the preprocessing identification result output by the model, such as: when the model is output as a prediction graph, the position information of a reference point (such as a central point) of the target object can be obtained by a voting method in the prediction graph; alternatively, when the output of the model is a 3d bounding box surrounding the target object, the position information of the target object is determined by calculating a reference point (such as a center point) of the bounding box in combination with depth information of the image data relative to the optical center of the image sensor (the depth information can be directly output by the model or can be directly obtained by some image processing method) in combination with an imaging principle.

The network model may include, but is not limited to, Convolutional Neural Networks (CNNs), and common CNN models may include, but are not limited to: LeNet, AlexNet, ZFNET, VGG, GoogLeNet, Residual Net, DenseNet, R-CNN, SPP-NET, Fast-RCNN, FCN, Mask-RCNN, YOLO, YOLov2, YOLov3, SSD, Deep-6dPose, PoseCNN and other now known or later developed network model structures.

Specifically, the training method may be supervised learning, semi-supervised learning, or unsupervised learning, or a training method developed now or in the future.

Taking supervised learning as an example, a training sample set of a target object may be used as an input of the model, a preprocessing recognition result and a first part of gesture recognition results are used as labels of the training sample set, and an initial model is trained, so as to obtain a gesture recognition model.

Further, in one embodiment, the first portion of the gesture recognition results may be gesture classification results of the target object. Specifically, in one embodiment, the posture classification result may be a visual angle class of the target object, the image data is input into the trained posture recognition model, the visual angle class of the target object is output, and the range data of the rotation posture corresponding to the target object may be obtained by searching the visual angle class-posture table.

Specifically, the object posture classification result aims to obtain the posture of an object relative to a camera, imagine a spherical surface with the object as the center and an arbitrary radius, move the camera on the spherical surface, and take a picture of the object, wherein the posture of the object is related to the position of an image sensor on the spherical surface.

This sphere is discretized as shown in fig. 21. Each point in the diagram is a perspective, and each perspective corresponds to a pose. By such discretization, the original continuous attitude estimation problem is converted into a classification problem, namely, the object only needs to be estimated to which view angle the attitude of the object belongs.

The accuracy of recognition using this method depends on the degree of discretization, with higher accuracy the finer the sphere is segmented.

Specifically, the gesture recognition model may include, but is not limited to, a Convolutional Neural Network (CNN), and common CNN models may include, but are not limited to: LeNet, AlexNet, ZFNET, VGG, GoogLeNet, Residual Net, DenseNet, SSD-6D and other network model structures now known or developed in the future.

Specifically, the training method may be supervised learning, semi-supervised learning, or unsupervised learning, or a training method developed now or in the future. Taking supervised learning as an example, the model may be trained by using images acquired from each view angle of the target object as a training sample set, and using a view angle type ID corresponding to each sample and a bounding box representing a position of the object in the image as labels.

Further, in one embodiment, as shown in fig. 7, there is provided a training sample set generating method:

step S201, establishing a corresponding relation between a plurality of visual angle categories and the postures of the target object;

the gesture space is discretized and equally divided into N visual angles, the object gesture corresponding to each visual angle is calculated, and a visual angle category-gesture recognition result one-to-one corresponding table is established.

Step S202, acquiring an image data set corresponding to each visual angle, recording the corresponding relation among the image data set, the visual angle category and the posture, and generating an annotation of the image data set;

specifically, the real sample image data may be acquired by photographing the object at each view angle, or the virtual sample image data may be generated by rendering with a CAD, and the correspondence relationship between the photograph, the view angle, and the posture may be recorded.

Step S203, pasting the image data set to a background image to generate an updated image data set;

it should be noted that, when the image data of the target object input into the model includes the target object and the complex real background, the step of S203 needs to be executed, so as to complete the establishment of the training sample set.

The photo of the single object may be randomly pasted to the background photo, or a plurality of different objects may be randomly pasted to one background photo.

Step S204, updating an image data set and the label to be used as the training sample set;

in another embodiment, when the image data of the object input to the model includes only the object or a single background, the generation method of the sample set may include only:

step S202, acquiring an image data set corresponding to each visual angle, recording the corresponding relation among the image data set, the visual angle category and the posture, and generating the label of the image data set;

step S204 takes the image data set and the label as a training sample set.

In one embodiment, as shown in fig. 6, when the recognition result output by the model is the gesture recognition result of the target object, step S103 may be followed by step S111 of optimizing the gesture recognition result of the target object to obtain an optimized gesture result; in one embodiment, after determining the posture recognition result of the target object in step S107, step S111 may be further included to optimize the posture result of the target object, so as to obtain an optimized posture result (the drawing is omitted); in one embodiment, after obtaining the posture recognition result of the target object in step S110, step S111 may be further included to optimize the posture result of the target object, so as to obtain an optimized posture result (the drawing is omitted).

In one embodiment, this step S111 may comprise the following method steps:

step S601, acquiring a posture recognition result of a target object;

In one embodiment, the gesture recognition method further comprises:

step S112, sending the identification result to the display; or

Step S113 sends the gesture recognition result of the target object described in the above embodiment to a presenter; or

Step S114 sends the optimized pose results described in the above embodiment to the presenter.

It should be understood that although the various steps in the flow charts of fig. 1-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-10 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 11, there is provided a gesture recognition apparatus including:

a target image acquisition module 101, configured to acquire image data of a target object;

a recognition model obtaining module 102, configured to obtain a gesture recognition model;

and the recognition result output module 103 is used for inputting the image data into the gesture recognition model and outputting a recognition result.

In one embodiment, the gesture recognition result is a gesture recognition result of the target object.

In one embodiment, the gesture recognition result is a recognition result of feature information associated with the target object in the image data; further, in one embodiment, as shown in fig. 13, the gesture recognition apparatus further includes:

an identification result obtaining module 106, configured to obtain an identification result;

and the target result generating module 107 is used for determining the gesture recognition result of the target object according to the recognition result.

Further, in an embodiment, as shown in fig. 14, when the recognition result is a prediction graph, the target result generating module 107 may include:

a characteristic information determination unit 1071 for determining characteristic information from the prediction map;

a target result generation unit 1072 for generating a posture recognition result of the target object based on the feature information.

In one embodiment, the gesture recognition result is a first partial gesture recognition result and a pre-processing recognition result of the target object; further, in one embodiment, as shown in fig. 15, the gesture recognition apparatus further includes:

a preprocessing result obtaining module 108, configured to obtain the preprocessing identification result;

the recognition result determining module 109 is configured to determine a second part gesture recognition result of the target object according to the pre-processing recognition result;

and a target result obtaining module 110, configured to use the first part of the gesture recognition result and the second part of the gesture recognition result as a gesture recognition result of the target object.

In one embodiment, as shown in fig. 16, when the gesture recognition result output by the model is a gesture recognition result of the target object, the gesture recognition apparatus further includes: the target result optimization module 111 is used for optimizing the posture recognition result of the target object to obtain an optimization result; in addition, in an embodiment, when the gesture recognition apparatus includes the target result generation module 107, the apparatus may further include a target result optimization module 111 (omitted from the drawings) for optimizing the gesture recognition result of the target object to obtain an optimized result; in addition, in an embodiment, when the gesture recognition apparatus includes the target result obtaining module 110, the apparatus may further include a target result optimizing module 111 (omitted from the drawings) for optimizing the gesture recognition result of the target object to obtain an optimized result.

In one embodiment, as shown in FIG. 12, when the image data of the object includes only the object or a single background; the gesture recognition apparatus further includes:

an initial image acquisition module 104, configured to acquire initial image data; wherein the initial image data comprises an object and a complex background;

and the target image generation module 105 is configured to extract a target object from the initial image data, and generate image data of the target object including only the target object or a single background.

In one embodiment, the gesture recognition apparatus further includes:

a presentation module 112 (the drawings are omitted) for sending the recognition result described in the above embodiment to the presenter; or

Sending the gesture recognition result of the target object to a displayer; or

And sending the optimized posture result to a displayer.

In one embodiment, as shown in fig. 17, there is provided a posture-recognition training apparatus including:

a training sample obtaining module 201, configured to obtain a training sample set;

an initial model obtaining module 202, configured to obtain an initial model of the gesture recognition model

The model training module 203 is configured to train the initial model based on the training sample set to obtain the gesture recognition model; the gesture recognition model is used for outputting a recognition result to the input image data of the target object.

For the specific limitations of the gesture recognition devices and the gesture recognition training devices, reference may be made to the limitations of the gesture recognition methods and the gesture recognition training methods, which are not described herein again. All or part of each module in each gesture recognition device and the gesture recognition training device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, as shown in fig. 18, a gesture recognition system is provided that includes a gesture recognition control device 400 and an image sensor 500.

An image sensor 500 for acquiring image data of a target object;

a control device 400 for acquiring image data of a target object; acquiring a gesture recognition model; inputting the image data into a gesture recognition model, and outputting a recognition result; or

In another embodiment, an image sensor 500 for acquiring initial image data including an object and a complex background;

control means 400 for acquiring initial image data: extracting a target object in the initial image data to obtain image data of the target object; acquiring image data of a target object; acquiring a gesture recognition model; and inputting the image data into the gesture recognition model, and outputting a recognition result.

For other relevant descriptions of the control device, reference is made to the above embodiments, and the description is not repeated here.

In one embodiment, the gesture recognition system may further include a presenter (the drawings are omitted).

The control device 400 is further configured to send the recognition result, the posture recognition result of the target object, or the optimized posture result to the display;

and the displayer 600 is used for displaying the recognition result, the gesture recognition result of the target object or the optimized gesture result.

The control device 400 may be a Programmable Logic Controller (PLC), a Field-Programmable Gate Array (FPGA), a Computer (Personal Computer, PC), an Industrial Personal Computer (IPC), a server, or the like. The control device generates program instructions according to a pre-fixed program in combination with manually input information, parameters, data collected by an external image sensor, and the like.

For the specific limitations of the above control devices, reference may be made to the limitations of the gesture recognition method, the model training method, and the sample generation method, which are not described herein again.

In one embodiment, as shown in fig. 20, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned gesture recognition method, or steps of the gesture recognition training method when executing the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the above-mentioned gesture recognition method, or steps of a gesture recognition training method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that the above-mentioned control devices and/or sensors may be real control devices and sensors in a real environment, or may also be virtual control devices and/or sensors in a simulation platform, and the effect of connecting the real control devices and/or sensors is achieved through a simulation environment. The control device which completes behavior training depending on the virtual environment is transplanted to the real environment, and the real control device and/or the sensor are controlled or retrained, so that the resources and time of the training process can be saved.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The terms "first," "second," "third," "S101," "S102," "S103," and the like in the claims and in the description and drawings above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover non-exclusive inclusions. For example: a process, method, system, article, or robot that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but includes other steps or modules not explicitly listed or inherent to such process, method, system, article, or robot.

It should be noted that the embodiments described in the specification are preferred embodiments, and the structures and modules involved are not necessarily essential to the invention, as will be understood by those skilled in the art.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A gesture recognition method, characterized in that the gesture recognition method comprises:

acquiring image data of a target object;

acquiring a gesture recognition model;

2. The gesture recognition method according to claim 1, wherein the recognition result is a gesture recognition result of the target object.

3. The gesture recognition method according to claim 1, wherein the recognition result is a recognition result of feature information associated with an object in the image data.

4. The gesture recognition method according to claim 3, wherein the feature information is: keypoints and/or keypoints lines.

5. The gesture recognition method according to claim 3, further comprising, after outputting the recognition result:

acquiring the identification result;

6. The gesture recognition method according to claim 1, wherein the recognition result is a first partial gesture recognition result and a pre-processing recognition result of the target object.

7. The gesture recognition method according to claim 6, further comprising, after outputting the gesture recognition result:

acquiring the preprocessing identification result;

8. The gesture recognition method according to claim 2, 5 or 7, further comprising:

9. The gesture recognition method according to any one of claims 1 to 7, characterized in that the image data of the object includes only the object or a single background;

the acquiring of the image data of the target object further comprises:

10. A posture recognition training method is characterized by comprising the following steps:

acquiring a training sample set;

acquiring an initial model of the gesture recognition model;

11. The gesture recognition training method according to claim 10, wherein the recognition result is: a gesture recognition result of the target object; the identification result of the feature information associated with the target object in the image data; or the first part gesture recognition result and the preprocessing recognition result of the target object.

12. A gesture recognition apparatus, characterized in that the gesture recognition apparatus comprises:

13. The apparatus according to claim 12, characterized in that the recognition result is a recognition result of a posture of the target object.

14. The gesture recognition apparatus according to claim 12, wherein the recognition result is a recognition result of feature information associated with the object in the image data;

the gesture recognition apparatus further includes:

15. The gesture recognition apparatus according to claim 12, wherein the recognition result is a first partial gesture recognition result and a pre-processing recognition result of the target object;

the gesture recognition apparatus further includes:

16. The gesture recognition apparatus according to any one of claims 13 to 15, further comprising:

17. The gesture recognition apparatus of claims 12-15, wherein the image data of the object includes only the object or a single background;

the gesture recognition apparatus further includes:

18. A posture-recognition training device, characterized by comprising:

19. A gesture recognition system, characterized in that the gesture recognition system comprises an image sensor and a control device;

the image sensor is used for acquiring image data of a target object;

20. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the gesture recognition method of any one of claims 1-9 when executing the computer program; and/or the gesture recognition training method of claim 10 or 11.

21. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the gesture recognition method according to any one of claims 1 to 9; and/or the gesture recognition training method of claim 10 or 11.

22. A method for generating a training sample set according to claim 10, wherein the method for generating the training sample set comprises:

taking the updated image data set and the label as the training sample set; or

taking the image data set and the label as the training sample set; or

acquiring a 3D model of a target object including 3D feature information;

taking the image data set and the label as the training sample set; or

acquiring a 3D model of a target object including 3D feature information;

and taking the image data set and the label as the training sample set.