CN111368594B

CN111368594B - Method and device for detecting key points

Info

Publication number: CN111368594B
Application number: CN201811598853.4A
Authority: CN
Inventors: 丁圣勇; 樊勇兵; 陈楠; 黄志兰
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2023-07-18
Anticipated expiration: 2038-12-26
Also published as: CN111368594A

Abstract

The disclosure provides a method and a device for detecting key points, and relates to the technical field of artificial intelligence. The method comprises the following steps: inputting a picture containing at least one target object into a convolutional network, wherein each target object is provided with a plurality of predefined key points and a group of connection relations among the key points, and the connection relations among the key points of different target objects do not exist; outputting a plurality of first feature maps and a plurality of second feature maps by using a convolution network, wherein each first feature map corresponds to a key point of the at least one target object, and each second feature map corresponds to a connection relation of the at least one target object; the keypoints in the pictures are grouped using the plurality of first feature maps and the plurality of second feature maps such that, for each target object, each keypoint of the target object is combined with the next keypoint having the largest response value connected to the keypoint. The present disclosure enables detection of keypoints of a target object.

Description

Method and device for detecting key points

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and apparatus for detecting a keypoint.

Background

Key point detection is to detect the position of key parts of an object in a picture, such as detecting human skeleton points. These skeletal keys include head, neck, shoulder, hip, knee, elbow, ankle, and the like. Fig. 1 is a diagram illustrating detection of human skeletal key points, according to some embodiments. As shown in fig. 1, the box represents the detected position of the whole human body, the solid points represent the corresponding skeletal points of the human body, and the solid lines connect the skeletal points to represent the human body posture.

Key point (e.g., human skeletal key point) detection is typically accomplished in a top-down approach, i.e., the entire object's frame is first detected, and then the location of the various components is located within the frame. When a plurality of people are contained in a picture, people are closely adjacent to each other, and the posture changes greatly, the method cannot be used for ensuring that all skeleton key points of one person can be correctly contained, or the situation that a detection frame contains skeleton key points of other people occurs, as shown in fig. 1.

Disclosure of Invention

One technical problem solved by the present disclosure is: a method for detecting keypoints is provided to detect keypoints of different target objects.

According to one aspect of an embodiment of the present disclosure, there is provided a method for detecting a keypoint, comprising: inputting a picture containing at least one target object into a convolutional network, wherein each target object has a plurality of predefined key points and a group of connection relations among the key points, and the connection relations among the key points of different target objects do not exist; outputting a plurality of first feature maps and a plurality of second feature maps by using the convolution network, wherein each first feature map corresponds to a type of key point of the at least one target object, and each second feature map corresponds to a type of connection relation of the at least one target object; and grouping keypoints in the picture by using the plurality of first feature maps and the plurality of second feature maps, so that each keypoint of the target object is combined with the next keypoint with the largest response value connected with the keypoint for each target object to detect all the keypoints of the target object.

In some embodiments, each first feature map is represented by a first matrix and each second feature map is represented by a second matrix.

In some embodiments, grouping keypoints in the picture such that, for each target object, each keypoint of the target object is combined with a next keypoint having a largest response value connected to the keypoint comprises: for a target object in the picture, determining a key point of the target object as a basic key point; obtaining a first feature map containing all the next key points based on the connection relation type of the basic key points and the next key points, and searching for all the next key points in the first feature map; obtaining connection lines between the basic key points and each searched key point in a second feature diagram containing the connection relation between the basic key points and the next key point; respectively calculating the sum of all element values corresponding to each connecting line on a second matrix corresponding to the second feature diagram; determining a connecting line belonging to the same target object with the basic key point according to the sum of element values corresponding to each connecting line; and obtaining the next key point with the largest response value connected with the basic key point according to the determined connecting line.

In some embodiments, in a second matrix corresponding to a second feature map, all element values corresponding to the connection lines with corresponding connection relations of the second feature map are non-zero values, and other element values of the second matrix are all 0, where the step of determining, according to the sum of the element values corresponding to each connection line, the connection line with the basic key point belonging to the same target object includes: determining the connecting line with the maximum sum of the element values as the connecting line belonging to the same target object with the basic key point; and under the condition that the non-zero value is a positive value, the response value is the sum of all element values corresponding to the corresponding connecting line.

In some embodiments, in a second matrix corresponding to a second feature map, all element values corresponding to the connection lines with corresponding connection relations of the second feature map are 0, and other element values of the second matrix are non-zero values, where the step of determining, according to the sum of the element values corresponding to each connection line, the connection line with the basic key point belonging to the same target object includes: determining a connecting line with the minimum sum of element values as a connecting line belonging to the same target object with the basic key point; and under the condition that the non-zero value is a positive value, the response value is the inverse number of the sum of all element values corresponding to the corresponding connecting line.

In some embodiments, in a first matrix corresponding to a first feature map, element values corresponding to key points of the first feature map are non-zero values, and other element values of the first matrix are all 0; or in the first matrix corresponding to the first feature map, the element value corresponding to the key point of the first feature map is 0, and the other element values of the first matrix are all non-zero values.

In some embodiments, each keypoint is represented in the picture by its coordinates and a state, wherein the state comprises: visible state, invisible state, and partially visible state.

In some embodiments, prior to inputting the picture containing the at least one target object into the convolutional network, the method further comprises: a picture containing a target object with a known keypoint is input into the convolutional network to train the convolutional network.

According to another aspect of an embodiment of the present disclosure, there is provided an apparatus for detecting a keypoint, including: an input unit for inputting a picture containing at least one target object into a convolutional network, wherein each target object has a plurality of predefined key points and a set of connection relations among the plurality of key points, and the connection relations among the key points of different target objects do not exist; the output unit is used for outputting a plurality of first feature maps and a plurality of second feature maps by utilizing the convolution network, wherein each first feature map corresponds to a key point of the at least one target object, and each second feature map corresponds to a connection relation of the at least one target object; and a grouping unit configured to group the keypoints in the picture using the plurality of first feature maps and the plurality of second feature maps, so that, for each target object, each keypoint of the target object is combined with a next keypoint having a largest response value connected to the keypoint to detect all the keypoints of the target object.

In some embodiments, the grouping unit is configured to determine, for a target object in the picture, a key point of the target object as a base key point, obtain a first feature map including all next key points based on a connection type between the base key point and the next key point, find all the next key points in the first feature map, obtain, in a second feature map including a connection relationship between the base key point and the next key point, a connection line between the base key point and each found key point, respectively, calculate, on a second matrix corresponding to the second feature map, a sum of all element values corresponding to each connection line, determine, according to the sum of element values corresponding to each connection line, a connection line that belongs to the same target object as the base key point, and obtain, according to the determined connection line, a next key point with a maximum response value that is connected to the base key point.

In some embodiments, in a second matrix corresponding to the second feature map, all element values corresponding to the connection lines with corresponding connection relations of the second feature map are non-zero values, other element values of the second matrix are all 0, and the grouping unit is configured to determine the connection line with the largest sum of the element values as the connection line belonging to the same target object with the basic key point; and under the condition that the non-zero value is a positive value, the response value is the sum of all element values corresponding to the corresponding connecting line.

In some embodiments, in a second matrix corresponding to the second feature map, all element values corresponding to the connection lines with corresponding connection relations of the second feature map are 0, other element values of the second matrix are all non-zero values, and the grouping unit is configured to determine the connection line with the smallest sum of the element values as the connection line belonging to the same target object as the basic key point; and under the condition that the non-zero value is a positive value, the response value is the inverse number of the sum of all element values corresponding to the corresponding connecting line.

In some embodiments, the input unit is further for inputting a picture containing a target object with a known keypoint into the convolutional network to train the convolutional network.

According to another aspect of an embodiment of the present disclosure, there is provided an apparatus for detecting a keypoint, including: a memory; and a processor coupled to the memory, the processor configured to perform the method as described above based on instructions stored in the memory.

According to another aspect of the disclosed embodiments, there is provided a computer readable storage medium having stored thereon computer program instructions which when executed by a processor implement the steps of the method as previously described.

According to the method, the connection relation among the key points is predicted through the convolution network, and correct grouping of the key points of different target objects is achieved by utilizing the connection relation, so that the key points of different target objects are detected. Moreover, the method can ensure that each target object correctly contains all the key points of the target object, and the situation that the detection frame of one target object contains the key points of other target objects in the prior art can not occur.

Other features of the present disclosure and its advantages will become apparent from the following detailed description of exemplary embodiments of the disclosure, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic diagram illustrating detection of human skeletal key points, according to some embodiments;

FIG. 2 is a flow chart illustrating a method for detecting keypoints according to some embodiments of the disclosure;

FIG. 3 is a schematic diagram illustrating human skeletal key points, according to some embodiments of the present disclosure;

FIG. 4 is a schematic diagram illustrating outputting a second feature map according to some embodiments of the present disclosure;

FIG. 5 is a flow chart illustrating a method of grouping keypoints in a picture according to some embodiments of the disclosure;

6A-6D are schematic diagrams illustrating human skeletal keypoints at several stages in the process of grouping keypoints in pictures, according to some embodiments of the present disclosure;

FIG. 7 is a schematic diagram illustrating a structure of an apparatus for detecting keypoints according to some embodiments of the disclosure;

FIG. 8 is a schematic diagram illustrating a structure of an apparatus for detecting keypoints according to further embodiments of the disclosure;

fig. 9 is a schematic diagram illustrating a structure of an apparatus for detecting keypoints according to other embodiments of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Fig. 2 is a flow chart illustrating a method for detecting keypoints according to some embodiments of the disclosure. As described in fig. 2, the method may include steps S202 to S206.

In step S202, a picture including at least one target object is input into a convolutional network (or referred to as a convolutional neural network (Convolutional Neural Networks, abbreviated as CNN)), wherein each target object has a predefined plurality of key points and a set of connection relations between the plurality of key points, and no connection relation exists between the key points of different target objects.

In some embodiments, a plurality of keypoints and a set of connection relationships between the plurality of keypoints may be predefined for a target object. There is no connection relationship between key points of different target objects. The key points are a plurality of key points to be detected.

The bone key points are described below as examples.

K skeletal keypoints may be defined, and a set of connection relationships is defined using the skeletal keypoints. The connection relationship exists in the same target object, and the connection relationship does not exist between key points of different target objects. For example, there is a connection between the right shoulder and the right elbow (i.e., right elbow) of the same person (right shoulder→right elbow), but for different persons, there is no connection between the right shoulder of one person and the right elbow of another person.

For example, skeletal keypoints defined by AIChallenge (artificial intelligence challenge) may be used, for a total of 14. Table 1 shows key points of human bones and corresponding serial numbers.

TABLE 1 human skeleton key points and corresponding sequence numbers

1/right shoulder	2/Right elbow	3/Right wrist	4/left shoulder	5/left elbow
					6/left wrist	7/right hip	8/right knee	9/right ankle	10/left hip
11/left knee	12/left ankle	13/crown of the head	14/neck

In some embodiments, each keypoint may be represented in a picture using its coordinates and state.

For example, for each person p= [ (x) ₁ ,y ₁ ,v ₁ ),(x ₂ ,y ₂ ,v ₂ ),...,(x _n ,y _n ,v _n )]Wherein (x) _i ,y _i ) Representing the coordinate position of the key points of the human skeleton in the figure, v _i Representing a human bodyThe state of skeleton key points in the graph is that i is more than or equal to 1 and less than or equal to n, and both i and n are positive integers. For these key points, a specific set of connection relationships is also defined, such as (Right shoulder->Right elbow), etc.

In some embodiments, the state may include: visible state, invisible state, and partially visible state. Here, the visible state indicates that the current keypoint is visible, the invisible state indicates that the current keypoint is blocked from being visible, and the partially visible state indicates that the current keypoint is partially blocked from being partially visible. By setting the state of the keypoints, missing some invisible or partially visible keypoints can be prevented.

In step S204, a plurality of first feature maps and a plurality of second feature maps are output by using the convolutional network, where each first feature map corresponds to a type of key point of the at least one target object, and each second feature map corresponds to a type of connection relationship of the at least one target object.

For example, the convolutional network may output k first feature maps, k being a positive integer. Each first feature map corresponds to a class of keypoints of the same type of the at least one target object (e.g. a plurality of target objects).

Taking bone key points as an example, each first feature map corresponds to the same type of bone key points of a plurality of people in the picture. And (5) integrating all the feature images to finish the detection of all skeleton key points of multiple persons in one image.

For example, based on the 14 human skeletal keypoints described above, 14 first feature maps may be generated through a convolution network, each first feature map representing a class of skeletal keypoints for a plurality of persons. For example, one first feature graph represents a right shoulder keypoint of a plurality of persons, another first feature graph represents a right elbow keypoint of a plurality of persons, and so on.

In some embodiments, each first feature map is represented using a first matrix.

In some embodiments, in the first matrix corresponding to the first feature map, the element values corresponding to the keypoints of the first feature map are non-zero values, and the other element values of the first matrix (i.e., the element values corresponding to the positions other than the keypoints) are all 0.

For example, if the position (x _i ,y _i ) Bone keypoints of type m occur, then at the corresponding position (x _i ′,y _i ') the element value of the first matrix corresponding to the pixel point is a non-zero value (e.g., 1), and the element values of the other positions of the first matrix are 0. The position correspondence can be deduced by simple scaling. All feature maps are combined to form one output map that maps all skeletal key points. Fig. 3 is a schematic diagram illustrating human skeletal key points, according to some embodiments of the present disclosure. This figure 3 shows all skeletal keypoints of two persons generated by a convolutional network.

In other embodiments, in the first matrix corresponding to the first feature map, the element value corresponding to the keypoint of the first feature map is 0, and the other element values of the first matrix (i.e., the element values corresponding to the positions other than the keypoint) are all non-zero values.

In some embodiments, each second feature map is represented using a second matrix.

In some embodiments, in the second matrix corresponding to the second feature map, all the element values corresponding to the connection lines of the second feature map having the corresponding connection relationship are non-zero values (the non-zero values may be positive values, for example, 1) (i.e., all the pixel points of the connection lines correspond to some elements of the second matrix, and all the element values corresponding to the connection lines are non-zero values), and all the other element values of the second matrix (i.e., the element values corresponding to other positions except the connection lines) are 0. Such a second matrix may be referred to as a first second matrix. In this embodiment, the convolution network may generate a plurality of predefined second feature maps, each corresponding to a connection relationship, so as to generate a larger output on a connection line (or called an edge) reflecting the connection relationship (i.e., all element values corresponding to the connection line reflecting the connection relationship in the second matrix are non-zero values), and the other positions are zero. This second feature map may also be referred to as an edge map.

For example, for each connection, take "right shoulder→right elbow" as an example, the convolutional network outputs one second feature map. There is a large response value on the line between the right shoulder and the right elbow within the same target object, and small elsewhere.

Fig. 4 is a schematic diagram illustrating outputting a second feature map according to some embodiments of the present disclosure. This figure 4 shows a second feature map generated by two persons "right shoulder→right elbow" after passing through a convolutional network. In the second feature map, two white lines correspond to the connection relationship of "right shoulder→right elbow" of the two persons, respectively. For example, all the element values corresponding to the pixels of the two white lines on the second matrix (not shown in the figure) are non-zero values, and the element values of the other positions of the second matrix are all 0 (corresponding to the black region of the second feature map).

In other embodiments, in the second matrix corresponding to the second feature map, all the element values corresponding to the connection lines with the corresponding connection relation of the second feature map are 0 (i.e. all the pixel points of the connection lines correspond to some elements of the second matrix, and all the element values corresponding to the connection lines are 0), and all the other element values of the second matrix are non-zero values (the non-zero values may be positive values, such as 1). Such a second matrix may be referred to as a second type of second matrix.

Returning to fig. 2, in step S206, the keypoints in the picture are grouped using the plurality of first feature maps and the plurality of second feature maps, so that, for each target object, each keypoint of the target object is combined with the next keypoint having the largest response value connected to the keypoint to detect all keypoints of the target object.

In some embodiments, with multiple first feature maps and multiple second feature maps, different keypoints may be grouped using a greedy algorithm, i.e., one keypoint is selected to be combined with the next keypoint to which it is most likely to be connected. This possibility can be obtained by a response value. For example, the response value may be the sum of all the element values corresponding to the respective connection line or the inverse of the sum of all the element values. It should be noted that the response value may be used as a response value of a connection line, or may be used as a response value of a connection of key points at two ends of the connection line (i.e., a response value of a connection action or a connection state between two key points).

For example, the connection line between some key points possibly having connection relation with the initial key point and the initial key point may be determined first, and the connection line with the largest response value of the connection line is selected from the possible connection lines, where the key point of the connection line is the next key point having connection relation with the initial key point. For example, in the picture, a head skeleton key point of a person is taken as an initial key point, a neck skeleton key point is preset as a second key point, neck skeleton key points of all persons are found from the first feature map, then all neck skeleton key points are connected with the head skeleton key point of the previous person in the second feature map, and a key point corresponding to a connection line with the largest response value is selected from the connection lines as the second key point of the person.

For example, for the first second matrix, in the case where the non-zero value is a positive value, the response value is the sum of all the element values corresponding to the respective connection line, that is, the response value=the sum of the element values. In other embodiments, for the first second matrix, in the case that the non-zero value is a negative value, the response value is the inverse of the sum of all the element values corresponding to the respective connection line.

For another example, for the second matrix, in the case where the non-zero value is a positive value, the response value is the inverse of the sum of all the element values corresponding to the respective connection line, i.e., the response value= - (sum of the element values). In other embodiments, in the case where the non-zero value is a negative value, the response value is the sum of all the element values corresponding to the respective connection line.

In the above step, by grouping different keypoints, all the keypoints of each target object individually form a group. Thus, the number of target objects can be divided into a plurality of groups.

To this end, methods for detecting keypoints according to some embodiments of the present disclosure. In the method, a picture containing at least one target object is input into a convolutional network. Each target object has a predefined plurality of keypoints and a set of connection relationships between the plurality of keypoints. There is no connection relationship between key points of different target objects. The convolution network is utilized to output a plurality of first feature maps and a plurality of second feature maps. Each first feature map corresponds to a class of key points of the at least one target object, and each second feature map corresponds to a class of connection relationships of the at least one target object. The key points in the pictures are grouped by utilizing the first feature maps and the second feature maps, so that each key point of the target object is combined with the next key point with the largest response value connected with the key point for each target object to detect all the key points of the target object. According to the method, the connection relation among the key points is predicted through the convolution network, and correct grouping of the key points of different target objects is achieved by utilizing the connection relation, so that the key points of different target objects are detected (or positioned). In addition, the method can better resist deformation. Here, countering deformation refers to countering a change in attitude, i.e., being able to handle different attitudes.

The method can ensure that each target object correctly contains all key points of the target object, and the situation that the detection frame of one target object contains the key points of other target objects as in the prior art does not occur.

In some embodiments, prior to step S202, the method may further comprise: a picture containing a target object with a known keypoint is input into the convolutional network to train the convolutional network. That is, the convolutional network may be trained by inputting pictures (for which the key points and connection relationships of the target object are known) into the convolutional network.

Fig. 5 is a flow chart illustrating a method of grouping keypoints in a picture according to some embodiments of the disclosure. The method describes a specific implementation of step S206 in fig. 2. As shown in fig. 5, the method may include steps S502 to S512. In this method, a key point at which one target object is detected is described as an example.

In step S502, for a (i.e., a certain) target object in the picture, a key point of the target object is determined as a base key point. For example, the base keypoint may be an initial keypoint.

For example, for a person in a picture, the head bone keypoints of the person are determined as base keypoints.

In step S504, a first feature map including all the next key points is obtained based on the connection relationship type between the basic key point and the next key point, and all the next key points are found in the first feature map.

It should be noted that, although there is no connection relationship between the key points of different target objects, the connection between the key points of different target objects may belong to the same connection relationship type. For example, for a picture containing a plurality of persons, a head bone key point of a person and a neck bone key point of the person form a connection relationship, whereas a head bone key point of the person and a neck bone key point of the other person do not form a connection relationship, but the head bone key point of the person and the neck bone key point of the other person are connected in a connection relationship type of "head bone key point→neck bone key point".

In this way, in the case that a plurality of target objects (for example, a plurality of persons) exist in the picture, a first feature map containing all the next key points can be obtained based on the connection relationship type between the basic key point and the next key point, and all the next key points can be found in the first feature map.

In step S506, in the second feature map including the connection relationship between the basic key point and the next key point, the connection between the basic key point and each of the found key points is obtained.

And finding a second feature map containing the connection relation of the key points according to the basic key points and all the next key points obtained by searching, and respectively connecting the basic key points with each key point obtained by searching in the second feature map to obtain a plurality of connecting lines. For example, a connection between a head bone key point of a person and a neck bone key point of all persons can be obtained.

In step S508, the sum of all the element values corresponding to each connection line is calculated on the second matrix corresponding to the second feature map.

For example, the plurality of connection lines obtained above respectively correspond to some element values of the second matrix corresponding to the second feature map, and the sum of all the element values corresponding to each connection line is calculated respectively.

In step S510, a connection line belonging to the same target object as the base key point is determined according to the sum of the element values corresponding to each connection line.

In some embodiments, in the second matrix corresponding to the second feature map, all the element values corresponding to the connection lines with the corresponding connection relationship of the second feature map are non-zero values (for example, the non-zero values are positive values), and the other element values of the second matrix are all 0. In such a case, the step S510 may include: and determining the connecting line with the maximum sum of the element values as the connecting line belonging to the same target object with the basic key point. In addition, in the case where the non-zero value is a positive value, a response value to be described later is the sum of all element values corresponding to the respective links.

For example, for a line of a head bone keypoint of a person with a neck bone keypoint of all people, all element values corresponding to the line of the head bone keypoint of the person with the neck bone keypoint of the person are non-zero values, while the element values corresponding to the line of the head bone keypoint of the person with the neck bone keypoint of the person may have at most two non-zero values, so that the sum of the element values corresponding to the line of the head bone keypoint of the person with the neck bone keypoint of the person is maximum. Thus, the connection line with the largest sum of the element values can be determined as the connection line which belongs to the same person with the head skeleton key point of the person. Therefore, in the case of the value of the element value of the second matrix, the line with the largest sum of the element values is determined as the line belonging to the same target object as the basic key point.

In other embodiments, in the second matrix corresponding to the second feature map, all the element values corresponding to the connection lines with the corresponding connection relationship of the second feature map are 0, and the other element values of the second matrix are all non-zero values (for example, the non-zero values are positive values). In such a case, the step S510 may include: and determining the connecting line with the minimum sum of the element values as the connecting line belonging to the same target object with the basic key point. In addition, in the case where the non-zero value is a positive value, a response value to be described later is an inverse number of the sum of all element values corresponding to the respective links.

For example, for a line of a head skeleton key point of a person with all neck skeleton key points of the person, all element values corresponding to the line of the head skeleton key point of the person with the neck skeleton key point of the person are 0, and the element values corresponding to the line of the head skeleton key point of the person with the neck skeleton key point of the person may be at most only 0, so that the sum of the element values corresponding to the line of the head skeleton key point of the person with the neck skeleton key point of the person is minimum. Thus, the connection line with the smallest sum of the element values can be determined as the connection line which belongs to the same person with the head skeleton key point of the person. Therefore, in the case of the value of the element value of the second matrix, the line with the smallest sum of the element values is determined as the line belonging to the same target object as the basic key point.

In step S512, the next key point with the largest response value connected to the base key point is obtained according to the determined connection line.

Since the connection line belonging to the same target object as the basic key point has been determined above, another key point except the basic key point to which the connection is connected is the next key point with the largest response value.

Thus far, methods of grouping keypoints in pictures according to some embodiments of the present disclosure are provided.

In some embodiments, after determining the above-mentioned next key point, the steps in fig. 5 are repeatedly performed with the key point as the next basic key point, so that the next key point with the largest response value connected to the key point can be obtained.

For example, the first basic key point may be an initial key point, the initial key point is used as a basic key point, the second key point is obtained after the above steps, then the second key point is used as a basic key point, the third key point is obtained after the above steps, and so on, so that all the key points of the target object can be obtained.

The method of the embodiment of the disclosure can enable each target object to correctly contain all key points of the target object, and the situation that the detection frame of one target object contains key points of other target objects as in the prior art does not occur.

Fig. 6A-6D are schematic diagrams illustrating human skeletal keypoints at several stages in the process of grouping keypoints in pictures according to some embodiments of the present disclosure.

First, as shown in fig. 6A, a skeletal key point of a head of a person is found as a basic key point (or an initial key point).

Next, as shown in fig. 6B, from the head bone key points of the person, the neck bone key points corresponding to the same person are found by the method shown in fig. 5.

Next, as shown in fig. 6C, starting from the neck bone key point of the person, the next bone key point is found by means of the connection relationship until all the bone key points of the same person are determined, which together constitute an example of the person.

Next, as shown in fig. 6D, one head bone key point (for example, the key point having the largest response value) is selected from the remaining head bone key points, and the above-described process is repeated. And finally, determining the respective skeleton key points of all the people, and finishing the detection of the key points of multiple people.

Therefore, the detection of the key points of multiple persons in the picture is realized. The method can enable one person in the picture to correctly contain all the bone key points of the person, and the situation that the detection frame of one person contains the bone key points of other people as in the prior art does not occur.

Fig. 7 is a schematic diagram illustrating a structure of an apparatus for detecting keypoints according to some embodiments of the disclosure. As shown in fig. 7, the apparatus may include an input unit 702, an output unit 704, and a grouping unit 706.

The input unit 702 may be used to input a picture containing at least one target object into a convolutional network. Each target object has a predefined plurality of keypoints and a set of connection relationships between the plurality of keypoints. There is no connection relationship between key points of different target objects.

The output unit 704 may be configured to output the plurality of first feature maps and the plurality of second feature maps using a convolutional network. Each first feature map corresponds to a class of keypoints of the at least one target object. Each second feature map corresponds to a class of connection relationships of the at least one target object.

The grouping unit 706 may be configured to group keypoints in the pictures using the plurality of first feature maps and the plurality of second feature maps, so that, for each target object, each keypoint of the target object is combined with a next keypoint having a largest response value connected to the keypoint to detect all keypoints of the target object.

To this end, an apparatus for detecting keypoints according to some embodiments of the present disclosure is provided. The device predicts the connection relation between the key points through the convolution network, and realizes the correct grouping of the key points of different target objects by using the connection relation, thereby detecting (or locating) the key points of different target objects. The device can enable each target object to correctly contain all key points of the target object, and the situation that the detection frame of one target object contains the key points of other target objects as in the prior art does not occur.

In some embodiments, the grouping unit 706 may be configured to determine, for a target object in the picture, a key point of the target object as a base key point, obtain a first feature map including all the next key points based on a connection relationship type between the base key point and the next key point, find all the next key points in the first feature map, obtain, in a second feature map including a connection relationship between the base key point and the next key point, a connection line between the base key point and each of the found key points, calculate, on a second matrix corresponding to the second feature map, a sum of all element values corresponding to each connection line, determine, according to the sum of element values corresponding to each connection line, a connection line belonging to the same target object as the base key point, and obtain, according to the determined connection line, a next key point with a maximum response value connected to the base key point.

In some embodiments, in the second matrix corresponding to the second feature map, all element values corresponding to the connection lines with the corresponding connection relationship of the second feature map are non-zero values, and other element values of the second matrix are all 0. The grouping unit 706 may be configured to determine a connection line with the largest sum of element values as a connection line belonging to the same target object as the base keypoint. And under the condition that the non-zero value is a positive value, the response value is the sum of all element values corresponding to the corresponding connecting line.

In other embodiments, in the second matrix corresponding to the second feature map, all the element values corresponding to the connection lines with the corresponding connection relationship of the second feature map are 0, and the other element values of the second matrix are all non-zero values. The grouping unit 706 may be configured to determine a connection line with the smallest sum of element values as a connection line belonging to the same target object as the base keypoint. And under the condition that the non-zero value is a positive value, the response value is the inverse number of the sum of all element values corresponding to the corresponding connecting line.

In some embodiments, in the first matrix corresponding to the first feature map, the element values corresponding to the key points of the first feature map are non-zero values, and the other element values of the first matrix are all 0.

In other embodiments, in the first matrix corresponding to the first feature map, the element values corresponding to the key points of the first feature map are 0, and the other element values of the first matrix are all non-zero values.

In some embodiments, each keypoint is represented in the picture by its coordinates and state. The state may include: visible state, invisible state, and partially visible state.

In some embodiments, the input unit 702 may also be used to input pictures containing target objects with known keypoints into a convolutional network to train the convolutional network.

Fig. 8 is a schematic diagram illustrating a structure of an apparatus for detecting keypoints according to other embodiments of the present disclosure. The apparatus includes a memory 810 and a processor 820. Wherein:

memory 810 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is used to store instructions in the corresponding embodiments of fig. 2 and/or 5.

Processor 820 is coupled to memory 810 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 820 is configured to execute instructions stored in the memory to achieve a proper grouping of keypoints of different target objects, thereby detecting keypoints of different target objects.

In some embodiments, as also shown in FIG. 9, the apparatus 900 includes a memory 910 and a processor 920. Processor 920 is coupled to memory 910 through BUS 930. The device 900 may also be coupled to external storage 950 via a storage interface 940 for invoking external data, and may also be coupled to a network or another computer system (not shown) via a network interface 960, which is not described in detail herein.

In this embodiment, the data instruction is stored in the memory, and the processor processes the instruction, so that the correct grouping of the key points of different target objects can be realized, and the key points of different target objects are detected.

In other embodiments, the present disclosure also provides a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the methods of the corresponding embodiments of fig. 2 and/or 5. It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Thus far, the present disclosure has been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.

The methods and systems of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method for detecting keypoints, comprising:

inputting a picture containing at least one target object into a convolutional network, wherein each target object has a plurality of predefined key points and a group of connection relations among the key points, and the connection relations among the key points of different target objects do not exist;

outputting a plurality of first feature maps and a plurality of second feature maps by using the convolution network, wherein each first feature map corresponds to a type of key point of the at least one target object, and each second feature map corresponds to a type of connection relation of the at least one target object; and

grouping keypoints in the picture by using the plurality of first feature maps and the plurality of second feature maps, so that for each target object, each keypoint of the target object is combined with the next keypoint with the largest response value connected with the keypoint to detect all the keypoints of the target object;

Wherein each first feature map is represented by a first matrix and each second feature map is represented by a second matrix;

grouping keypoints in the picture such that for each target object, each keypoint of the target object is combined with the next keypoint having the largest response value connected to the keypoint comprises:

for a target object in the picture, determining a key point of the target object as a basic key point;

obtaining a first feature map containing all the next key points based on the connection relation type of the basic key points and the next key points, and searching for all the next key points in the first feature map;

obtaining connection lines between the basic key points and each searched key point in a second feature diagram containing the connection relation between the basic key points and the next key point;

respectively calculating the sum of all element values corresponding to each connecting line on a second matrix corresponding to the second feature diagram;

determining a connecting line belonging to the same target object with the basic key point according to the sum of element values corresponding to each connecting line; and

And obtaining the next key point with the largest response value connected with the basic key point according to the determined connecting line.

2. The method of claim 1, wherein,

in the second matrix corresponding to the second feature diagram, all the element values corresponding to the connecting lines with corresponding connection relation of the second feature diagram are non-zero values, the other element values of the second matrix are 0,

the step of determining the connection line which belongs to the same target object with the basic key point according to the sum of the element values corresponding to each connection line comprises the following steps: determining the connecting line with the maximum sum of the element values as the connecting line belonging to the same target object with the basic key point;

and under the condition that the non-zero value is a positive value, the response value is the sum of all element values corresponding to the corresponding connecting line.

3. The method of claim 1, wherein,

in the second matrix corresponding to the second feature diagram, all the element values corresponding to the connecting lines with corresponding connection relation of the second feature diagram are 0, the other element values of the second matrix are non-zero values,

the step of determining the connection line which belongs to the same target object with the basic key point according to the sum of the element values corresponding to each connection line comprises the following steps: determining a connecting line with the minimum sum of element values as a connecting line belonging to the same target object with the basic key point;

And under the condition that the non-zero value is a positive value, the response value is the inverse number of the sum of all element values corresponding to the corresponding connecting line.

4. The method of claim 1, wherein,

in a first matrix corresponding to a first feature map, element values corresponding to key points of the first feature map are non-zero values, and other element values of the first matrix are all 0; or alternatively, the process may be performed,

in a first matrix corresponding to the first feature map, the element values corresponding to the key points of the first feature map are 0, and the other element values of the first matrix are all non-zero values.

5. The method of claim 1, wherein,

in the picture, each key point is represented by coordinates and a state of the key point, wherein the state comprises: visible state, invisible state, and partially visible state.

6. The method of claim 1, wherein prior to inputting the picture containing the at least one target object into the convolutional network, the method further comprises:

a picture containing a target object with a known keypoint is input into the convolutional network to train the convolutional network.

7. An apparatus for detecting keypoints, comprising:

an input unit for inputting a picture containing at least one target object into a convolutional network, wherein each target object has a plurality of predefined key points and a set of connection relations among the plurality of key points, and the connection relations among the key points of different target objects do not exist;

The output unit is used for outputting a plurality of first feature maps and a plurality of second feature maps by utilizing the convolution network, wherein each first feature map corresponds to a key point of the at least one target object, and each second feature map corresponds to a connection relation of the at least one target object; and

a grouping unit, configured to group the keypoints in the picture by using the plurality of first feature maps and the plurality of second feature maps, so that, for each target object, each keypoint of the target object is combined with a next keypoint with a largest response value connected to the keypoint to detect all keypoints of the target object;

the grouping unit is used for determining a key point of a target object in the picture as a basic key point, obtaining a first feature map containing all next key points based on the connection relation type of the basic key point and the next key point, searching the first feature map to obtain all the next key points, obtaining connection lines between the basic key point and each searched key point in a second feature map containing the connection relation between the basic key point and the next key point, respectively calculating the sum of all element values corresponding to each connection line on a second matrix corresponding to the second feature map, determining the connection line which belongs to the same target object with the basic key point according to the sum of element values corresponding to each connection line, and obtaining the next key point with the largest response value connected with the basic key point according to the determined connection line.

8. The apparatus of claim 7, wherein,

the grouping unit is used for determining the connecting line with the maximum sum of element values as the connecting line belonging to the same target object with the basic key point;

9. The apparatus of claim 7, wherein,

the grouping unit is used for determining a connecting line with the minimum sum of element values as a connecting line belonging to the same target object as the basic key point;

10. The apparatus of claim 7, wherein,

11. The apparatus of claim 7, wherein,

12. The apparatus of claim 7, wherein,

the input unit is also used for inputting pictures containing target objects with known key points into the convolutional network to train the convolutional network.

13. An apparatus for detecting keypoints, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of any of claims 1-6 based on instructions stored in the memory.

14. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 6.