CN109145802B

CN109145802B - Kinect-based multi-person gesture man-machine interaction method and device

Info

Publication number: CN109145802B
Application number: CN201810921343.XA
Authority: CN
Inventors: 陶彦博; 阮松波; 梁斌
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-08-14
Filing date: 2018-08-14
Publication date: 2021-05-14
Anticipated expiration: 2038-08-14
Also published as: CN109145802A

Abstract

The invention discloses a Kinect-based multi-person gesture human-computer interaction method and device, wherein the method comprises the following steps: acquiring a color image and a depth image of a scene through a Kinect camera, and acquiring the human body joint position and the face characteristic parameters of an operator in the scene; obtaining cursor positions corresponding to the two hands of the operator according to the human joint position and the facial characteristic parameters by using a light straight line propagation principle; and segmenting the palm area image of the operator in the color image according to the position of the human body joint, and identifying and classifying the operation instruction by a pre-trained gesture classification model. The method is natural and quick to operate, does not need other external equipment, adopts a non-contact interaction mode, can deal with special scenes such as hospitals and scientific researches, can be used by multiple persons at the same time, and meets more complex interaction requirements in the future.

Description

Kinect-based multi-person gesture man-machine interaction method and device

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a Kinect-based multi-person gesture human-computer interaction method and device.

Background

Human-computer interaction is the exchange of information between a person and a computer, and the information is interacted between the person and the computer. The former man-machine interaction mainly takes a computer as a center, for example, interaction modes such as a command line, a graphical user interface and the like, and people can only interact with the computer through modes such as a keyboard, a mouse, a touch screen and the like. Nowadays, more novel interaction forms conforming to the natural interaction of human beings are becoming more mature, and the new technologies are closer to each other between human beings and computers and can improve the interaction efficiency. Human-computer interaction has also undergone a transition and evolution from command lines, graphical interfaces, to natural user interfaces. Natural user interface means that a user interacts with a computer in a natural way of interaction, such as speech, gestures, etc.

The gesture is a novel interaction mode in human-computer interaction, and the interaction mode has no contact and is more in line with the natural interaction behavior of human beings. Currently, the research of gesture recognition is mainly based on color image flow, and the target is detected and tracked according to the color, texture, gray scale feature or motion feature of the image. However, since the color image only has two-dimensional coordinate information and is easily affected by corresponding background, illumination and environment, the detection and tracking effects on the target are poor, and the target under a complex environment is difficult to detect and track. The Kinect camera is a special camera, and can acquire a color image and a depth image of a scene at the same time so as to obtain three-dimensional information of the scene. According to the acquired three-dimensional information, the information of the human joints and the gestures in the scene can be accurately extracted.

The existing gesture interaction system is mostly combined with a graphical interface for use, but mostly passes through static gesture recognition, and can only replace a remote controller or a keyboard to interact with the graphical interface in a menu mode. To replace the more complex image interface of the most simple and common keyboard and mouse combination language for interaction, the function of cursor movement needs to be realized through gestures. The position of the cursor is the position which is most concerned by the operator in the graphical interface and can be reflected by the sight line of the operator and the position pointed by the finger.

The existing interactive system is used by a single person, and in some special scenes, such as design and creation, multiple persons may be required to interact with a computer at the same time. And the gesture-based interaction mode can interact within the visual field range of the camera without other interaction equipment, so that the gesture-based interaction mode has the condition that a plurality of people interact with the computer at the same time.

However, the man-machine interaction mode of the related art needs to be realized through a mouse, a keyboard, a handle and other devices, and all the control devices can be connected in a wired or wireless mode. In a wired mode, the control distance is short; under the wireless mode, the operation is slightly dull, and both modes do not have the nature of controlling. Another interaction method is touch screen manipulation, which is currently most applied in small-sized mobile phones and tablet computers. For large touch screens up to 60 inches or even 100 inches in size, some basic operations become awkward for the operator to operate because the operator is too close to the screen to see all of the content on the screen.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a Kinect-based multi-person gesture human-computer interaction method which is natural and quick to operate, does not need other external equipment, adopts a non-contact interaction mode, can cope with special scenes such as hospitals and scientific researches, can be used by multiple persons at the same time, and meets more complex interaction requirements in the future.

The invention also aims to provide a Kinect-based multi-person gesture human-computer interaction device.

In order to achieve the above object, an embodiment of the invention provides a human-computer interaction method based on Kinect for multi-person gestures, which includes the following steps: acquiring a color image and a depth image of a scene through a Kinect camera, and acquiring the human joint position and the face characteristic parameters of an operator in the scene; obtaining cursor positions corresponding to the two hands of the operator according to the human joint position and the facial characteristic parameters by a light linear propagation principle; and segmenting the palm area image of the operator in the color image according to the human body joint position, and identifying and classifying the operation instruction by a pre-trained gesture classification model.

According to the Kinect-based multi-person gesture human-computer interaction method, the information of joints and faces of a human body is obtained through the Kinect camera, the position of an operation cursor is determined according to the positions of eyes and hands of an operator, and an operation instruction is determined according to the acquired gesture image of the operator, so that interaction with a graphical interface of equipment is realized, the operation is natural and rapid, other external equipment is not needed, a non-contact interaction mode is adopted, special scenes such as hospitals and scientific researches can be met, multiple persons can use the method at the same time, and more complex interaction requirements in the future are met.

In addition, the Kinect-based multi-person gesture human-computer interaction method according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the method further includes: and acquiring a camera coordinate system, a screen coordinate system and a human face coordinate system, and detecting the fingertip and the cursor point.

Further, in an embodiment of the present invention, the method further includes: detecting a position of a screen origin in the camera coordinate system and a rotation matrix of an orientation of the position in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the screen coordinate system to obtain a screen position; and acquiring a position of a human head joint under the camera coordinate system and a rotation matrix of the orientation of the human head joint under the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the human face coordinate system to obtain the human face direction and the visual field.

Further, in an embodiment of the present invention, the obtaining of the cursor positions corresponding to the two hands of the operator by using a principle of light propagation along a straight line according to the human joint position and the facial feature parameter further includes: detecting the positions of the fingertips and the noseroots of the person under the camera coordinate system, and judging whether the fingertips are in the visual field range or not according to a transformation matrix from the camera coordinate system to a face coordinate system and the coordinates of the fingertips and the noseroots of the person under the face coordinate system; and obtaining a transformation matrix from the camera coordinate system to the screen coordinate system through three-dimensional calibration to obtain the position of the fingertip under the camera coordinate system and the position of the nasion under the camera coordinate system, obtaining the representation of a straight line passing through the fingertip and the nasion under the camera coordinate system, obtaining the intersection point of the straight line passing through the fingertip and the nasion and a screen plane, and obtaining the position of a cursor in the screen coordinate system through a conversion matrix from the camera to the screen coordinate system to judge whether a cursor point is in a screen range.

Further, in an embodiment of the present invention, the segmenting the palm region image of the operator from the color image according to the positions of the human joints, and performing recognition and classification on the operation instructions by a pre-trained gesture classification model, further includes: acquiring the pixel position of the palm of the operator in the color image, and segmenting by taking the palm of the operator as a midpoint to generate a gesture image with a preset size; and classifying and identifying the gesture image by the artificial neural network, and determining an instruction corresponding to the gesture.

In order to achieve the above object, an embodiment of the present invention provides a multi-person gesture human-computer interaction device based on Kinect, including: the system comprises an acquisition module, a display module and a control module, wherein the acquisition module is used for acquiring a color image and a depth image of a scene through a Kinect camera and acquiring the human body joint position and the face characteristic parameters of an operator in the scene; the processing module is used for obtaining cursor positions corresponding to the two hands of the operator according to the human joint position and the facial characteristic parameters through a light linear propagation principle; and the recognition and classification module is used for segmenting the palm area image of the operator in the color image according to the human body joint position and recognizing and classifying the operation instruction by a pre-trained gesture classification model.

According to the Kinect-based multi-person gesture human-computer interaction device, the Kinect camera is used for obtaining information of joints and faces of a human body, the position of an operation cursor is determined according to the positions of eyes and hands of an operator, and an operation instruction is determined according to the collected gesture image of the operator, so that interaction with a graphical interface of equipment is realized, the operation is natural and rapid, other external equipment is not needed, a non-contact interaction mode is adopted, special scenes such as hospitals and scientific researches can be met, multiple persons can use the device at the same time, and more complex interaction requirements in the future are met.

In addition, the Kinect-based multi-person gesture human-computer interaction device according to the above embodiment of the invention may further have the following additional technical features:

further, in an embodiment of the present invention, the method further includes: the first acquisition module is used for acquiring a camera coordinate system, a screen coordinate system and a human face coordinate system and detecting fingertips and cursor points.

Further, in an embodiment of the present invention, the method further includes: the detection module is used for detecting the position of the screen origin position in the camera coordinate system and the rotation matrix of the orientation of the screen origin position in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the screen coordinate system to obtain a screen position; and the second acquisition module is used for acquiring a rotation matrix of the position of the human head joint under the camera coordinate system and the direction of the human head joint in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the human face coordinate system so as to obtain the human face direction and the visual field.

Further, in one embodiment of the present invention, the processing module includes: the detection unit is used for detecting the positions of the fingertips and the noseroots under the camera coordinate system and judging whether the fingertips are in the visual field range or not according to a transformation matrix from the camera coordinate system to a human face coordinate system and the coordinates of the fingertips and the noseroots under the human face coordinate system; the acquisition unit is used for acquiring a transformation matrix from the camera coordinate system to the screen coordinate system through three-dimensional calibration so as to obtain the position of the fingertip under the camera coordinate system and the position of the nasion under the camera coordinate system, acquiring the representation of a straight line passing through the fingertip and the nasion under the camera coordinate system, acquiring the intersection point of the straight line passing through the fingertip and the nasion and a screen plane, and acquiring the position of a cursor in the screen coordinate system through the transformation matrix from the camera to the screen coordinate system so as to judge whether a cursor point is in a screen range.

Further, in an embodiment of the present invention, the recognition and classification module is further configured to acquire a pixel position of the palm of the operator in the color image, segment and generate a gesture image with a preset size by using the palm of the operator as a midpoint, perform classification and recognition on the gesture image by using the artificial neural network, and determine an instruction corresponding to the gesture.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a Kinect-based multi-person gesture human-machine interaction method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a human joint obtained by Kinect for Windows SDK V2 according to one embodiment of the present invention;

FIG. 3 is a diagram of facial feature effects extracted by Kinect for Windows SDK V2 according to one embodiment of the present invention;

FIG. 4 is a diagram of a cursor position calculation coordinate system according to one embodiment of the present invention;

FIG. 5 is a flowchart of a Kinect-based multi-person gesture human-machine interaction method according to an embodiment of the present invention;

FIG. 6 is a schematic view of a human field of view according to one embodiment of the invention;

FIG. 7 is a schematic structural diagram of a Kinect-based multi-person gesture human-computer interaction device according to an embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The multi-person gesture human-computer interaction method and device based on the Kinect provided by the embodiment of the invention are described below with reference to the attached drawings, and firstly, the multi-person gesture human-computer interaction method based on the Kinect provided by the embodiment of the invention is described with reference to the attached drawings.

FIG. 1 is a flow chart of a Kinect-based multi-person gesture human-computer interaction method according to an embodiment of the invention.

As shown in fig. 1, the Kinect-based multi-person gesture human-computer interaction method includes the following steps:

in step S101, a color image and a depth image of a scene are acquired by a Kinect camera, and a human joint position and facial feature parameters of an operator in the scene are acquired.

It can be understood that the human body joint and facial feature information of the operator in the scene are identified through the Kinect camera extraction model in the embodiment of the invention. Specifically, the embodiment of the present invention acquires a color image and a depth image of a scene through a Kinect camera, for example, the human joint position and facial feature parameters of an operator in the scene, such as eyebrow, eye, nose, mouth position and facial contour, may be obtained through microsoft Kinect for Windows SDK 2.0 development kit. The acquired human body joint parameters are shown in fig. 2, and the human body facial features are shown in fig. 3.

In step S102, cursor positions corresponding to both hands of the operator are obtained according to the human joint position and the facial feature parameter by the principle of light propagation along a straight line.

It is understood that after the human joint position and the face feature parameter are acquired, the cursor position of the operation of each operator is found. That is to say, the embodiments of the present invention may calculate the cursor positions corresponding to the two hands of each operator according to the principle that light propagates along a straight line, that is, the eyes of the operator, the fingertips of the two hands of the operator, and the cursor are aligned with each other, through the operator joints and the face model parameters obtained in the previous step.

Further, in an embodiment of the present invention, the method of an embodiment of the present invention further includes: and acquiring a camera coordinate system, a screen coordinate system and a human face coordinate system, and detecting the fingertip and the cursor point.

Specifically, as shown in fig. 4, the embodiment of the present invention has three coordinate systems in total, including a camera coordinate system, a screen coordinate system, and a face coordinate system, and two key points, including a fingertip and a cursor point. Wherein the camera coordinate system is used as a world coordinate system and the origin is O_cCorresponding to the imaging center of the Kinect camera, Z_cThe axis coinciding with the optical axis, Y_cAxis and X_cThe axis is parallel to the imaging plane. The origin of the screen coordinate system is O_iCorresponding to the upper left corner of the screen, its Y_iAxis and X_iThe axis being parallel to the camera imaging plane and, in order to coincide with the general image coordinates, therefore its Z_iThe axes are opposite to the camera coordinate system. The origin of the face coordinate system is O_fCorresponding to the position of the head joint, Z_fThe axis is oriented with the connecting surface portion. The conversion between the respective coordinate systems is effected by a homogeneous transformation, e.g. a transformation matrix of the camera coordinate system to the screen coordinate system as

The transformation matrix from the camera coordinate system to the face coordinate system is

Wherein the content of the first and second substances,

the relative position of the camera and the screen is determined as a fixed parameter of the system;

the head pose supplied by the function provided by Kinect for Windows SDK V2 is found to be changing in value.

The description in the two coordinate systems { A } and { B } for an arbitrary point p is^Ap and^Bp, and the following homogeneous transformation relations:

homogeneous transformation matrix

Is a 4x4 square matrix having the form:

wherein the content of the first and second substances,

shows a translation transformation

And rotational transformation

Wherein the content of the first and second substances,

a translation vector representing the coordinate system { B } relative to the coordinate system { A } describing its position relative to the coordinate system;

the orientation of the coordinate system { B } relative to the coordinate system { A } is described. SDK gives the quaternion representation q ═ q₀,q₁,q₂,q₃]^TIt may represent a rotation operation from coordinate system { A } to coordinate system { B }, with a relationship to the rotation matrix:

description of the coordinate system B relative to the coordinate system A is known

Determining { A } description relative to { B }

The problem of homogeneous transformation inversion is specifically as follows:

in a three-dimensional homogeneous coordinate system, points are paired with planes, so that the planar representation coincides with points. The plane is represented by a four-dimensional vector, pi ═ pi (pi)₁,π₂,π₃,π₄)^TIf point x is (x)₁,x₂,x₃,x₄)^TOn a plane:

π₁x₁+π₂x₂+π₃x₃+π₄x₄＝0 (5)，

a plane is defined by three non-collinear points, namely:

when the point is on a plane, the matrix M ═ x, x₁,x₂,x₃]Has a determinant of 0, i.e.:

detM＝x₁D₂₃₄-x₁D₁₃₄+x₁D₁₂₄-x₁D₁₂₃＝0 (7)，

thus, the plane can be expressed as:

π＝(D₂₃₄,-D₁₃₄,D₁₂₄,-D₁₂₃) (8)，

wherein D is_jklRepresents a 4x 3 matrix x₁,x₂,x₃]The jkl row of (a) constitutes the determinant of the matrix.

In the three-dimensional homogeneous coordinate system, a straight line is represented by a 4 × 4 matrix, and a straight line L is represented by two points x₁And x₂Determining:

and the intersection of the plane pi with the line L is denoted x:

x＝Lπ (10)。

further, in an embodiment of the present invention, the method of an embodiment of the present invention further includes: detecting the position of the screen origin in a camera coordinate system and a rotation matrix of the orientation of the screen origin in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the screen coordinate system to obtain a screen position; and acquiring a rotation matrix of the position and the direction of the human head joint under the camera coordinate system and a transformation matrix from the camera coordinate system to the human face coordinate system to obtain the human face direction and the visual field.

Specifically, (1) screen position, as shown in fig. 5

Obtaining a screen origin position O by measurement_iPosition in camera coordinate system

And its orientation in the camera coordinate systemIs represented by a rotation matrix

Obtaining a transformation matrix from a camera coordinate system to a screen coordinate system according to equation 4

And

the screen plane, i.e. the xy plane of the screen coordinate system, is then determined

Representation in camera coordinate system

(2) Face direction and field of view estimation

Human head joint position O is obtained through Kinect for Windows SDK V2_fPosition in camera coordinate system

And rotation matrix representation of its orientation in the camera coordinate system

Obtaining a transformation matrix from a camera coordinate system to a face coordinate system according to the formula 4

Further, in an embodiment of the present invention, the obtaining of the cursor positions corresponding to the two hands of the operator according to the human joint position and the facial feature parameter by the principle of light propagating along a straight line further includes: detecting the positions of the fingertips and the noseroots under a camera coordinate system, and judging whether the fingertips are in a visual field range or not according to a transformation matrix from the camera coordinate system to a human face coordinate system and the coordinates of the fingertips and the noseroots under the human face coordinate system; obtaining a transformation matrix from a camera coordinate system to a screen coordinate system through three-dimensional calibration to obtain the position of a fingertip under the camera coordinate system and the position of a nasion under the camera coordinate system, obtaining the representation of a straight line passing through the fingertip and the nasion under the camera coordinate system, obtaining the intersection point of the straight line passing through the fingertip and the nasion and a screen plane, and obtaining the position of a cursor in the screen coordinate system through the transformation matrix from the camera to the screen coordinate system to judge whether a cursor point is in a screen range.

Specifically, (1) fingertip position estimation, as shown in fig. 5

In the embodiment of the invention, the positions of the human fingertip M and the human nasion N in the camera coordinate system are obtained through the Kinect for Windows SDK V2^cp_MAnd^cp_Nfrom the above, the transformation matrix from the camera coordinate system to the face coordinate system is obtained

Combined formula 1, the coordinates of the finger tip and the nose root under the face coordinate system are obtained^fp_MAnd^fp_Nthereby judging whether the finger tip is in the visual field range of human eyes. Specifically, the angle between the projection of the connecting line of the fingertip M and the nasion N on the xy plane and the xz plane and the x axis is obtained, if the angle is smaller than the horizontal visual angle of human eyes by 62 degrees and the vertical visual angle by 50 degrees, the angle is in the human visual field range, otherwise, the angle is not. In which the field of vision of the human eye is shown in figure 6.

(2) Cursor position estimation

After the system is built, obtaining a transformation matrix from a camera coordinate system to a screen coordinate system through three-dimensional calibration

The position of the human fingertip M under the camera coordinate system is obtained by combining the Kinect for Windows SDK V2^cp_MAnd the position of the nasion N in the camera coordinate system^cp_NAnd the expression of the straight line passing through the fingertip and the nasion in the camera coordinate system is obtained by the formula 9^cL_MN. Line through fingertip and nasion and screen plane using equation 10

I.e. the secondary coordinate representation of the cursor in the camera coordinate system^cp_I. Finally, the matrix is converted from the camera to the screen coordinate system

Combination formula 1 obtains the position of cursor in screen coordinate systemⁱp_I. And finally, judging whether the cursor point is in the screen range, and if not, determining that the cursor point is invalid.

In step S103, the palm region image of the operator is segmented in the color image according to the joint position of the human body, and the operation command is recognized and classified by the pre-trained gesture classification model.

It is understood that after the above steps are performed, the embodiment of the present invention determines the operation instruction according to the gesture type of the operator. That is, according to the human joint position extracted in step S101, the embodiment of the present invention segments the image of the palm region of the operator from the color image, and recognizes and classifies the operation command by the gesture classification model trained in advance.

Further, in an embodiment of the present invention, the method includes segmenting a palm region image of an operator from a color image according to positions of joints of a human body, and performing recognition and classification on an operation instruction by using a pre-trained gesture classification model, and further includes: acquiring pixel positions of the palm of the operator in the color image, and segmenting by taking the palm of the operator as a midpoint to generate a gesture image with a preset size; and classifying and identifying the gesture images by an artificial neural network, and determining an instruction corresponding to the gesture.

Specifically, (1) gesture image segmentation, as shown in FIG. 5

The embodiment of the invention can obtain the pixel position of the palm of the operator in the color image through the Kinect for Windows SDK V2, and divide the image with the size of 128 × 128 by taking the palm of the operator as a midpoint as a gesture image.

(2) Gesture recognition

And classifying and identifying the segmented gesture image by a pre-trained artificial neural network to determine an instruction corresponding to the gesture. For the artificial neural network, the convolutional neural network is used as a training model, images of eight different gestures of the volunteer collected by the Kinect under different conditions are used as a data set, and the images are obtained by training after the data set is enhanced. While the final command is determined by a single gesture or multiple gestures.

In summary, the method of the embodiment of the invention is suitable for equipment which has a large-size display device and needs to be operated by no contact or multiple persons, and comprises three steps of acquiring human joints, calculating the position of a cursor and identifying an operation gesture. The embodiment of the invention aims to complete a novel human-computer interaction system solution which is used for controlling user interfaces of computers, projectors, game machines and the like in a non-contact mode and can simultaneously deal with a multi-user control scene. The embodiment of the invention can interact with the machine through the gesture without other equipment, and the gesture has the advantages of naturalness, intuition, easy understanding and the like and is more in line with the daily communication habit of human beings. Moreover, the control can be realized as long as the operator is within the Kinect visual field range, so that the operator can control the equipment with a larger real interface size by 1-4 meters away from the display. Finally, the system designed according to the embodiment of the invention can support multi-person operation, and Kinect supports real-time tracking of human joint positions of 6 operators at most, so that the method provided by the embodiment of the invention can be applied to more complex human-computer interaction scenes in the future.

According to the Kinect-based multi-person gesture human-computer interaction method provided by the embodiment of the invention, the information of joints and faces of a human body is obtained through a Kinect camera, the position of an operation cursor is determined according to the positions of eyes and hands of an operator, and an operation instruction is determined according to the acquired gesture image of the operator, so that the interaction with a graphical interface of equipment is realized, the operation is natural and quick, other external equipment is not needed, a non-contact interaction mode is adopted, special scenes such as hospitals and scientific researches can be met, multiple persons can use the method at the same time, and more complex interaction requirements in the future can be met.

Next, a multi-person gesture human-computer interaction device based on Kinect according to an embodiment of the present invention will be described with reference to the drawings.

As shown in fig. 7, the Kinect-based multi-person gesture human-computer interaction device 10 includes: an acquisition module 100, a processing module 200, and an identification classification module 300.

The acquisition module 100 is configured to acquire a color image and a depth image of a scene through a Kinect camera, and acquire a human joint position and facial feature parameters of an operator in the scene. The processing module 200 is configured to obtain cursor positions corresponding to the two hands of the operator according to the human joint position and the facial feature parameters based on the principle that light propagates along a straight line. The recognition and classification module 300 is configured to segment the palm region image of the operator from the color image according to the position of the human body joint, and recognize and classify the operation instruction by a pre-trained gesture classification model. The device 10 of the embodiment of the invention can be operated naturally and quickly, does not need other external equipment, adopts a non-contact interaction mode, can deal with special scenes such as hospitals, scientific researches and the like, can be used by multiple people at the same time, and meets more complex interaction requirements in the future.

Further, in one embodiment of the present invention, the apparatus 10 of the embodiment of the present invention further comprises: a first obtaining module. The first acquisition module is used for acquiring a camera coordinate system, a screen coordinate system and a human face coordinate system and detecting fingertips and cursor points.

Further, in one embodiment of the present invention, the apparatus 10 of the embodiment of the present invention further comprises: the device comprises a detection module and a second acquisition module.

The detection module is used for detecting the position of the screen origin position in the camera coordinate system and the rotation matrix of the orientation of the screen origin position in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the screen coordinate system to obtain the screen position. The second acquisition module acquires the position of the human head joint under the camera coordinate system and the rotation matrix of the orientation of the human head joint in the camera coordinate system, and acquires the transformation matrix from the camera coordinate system to the human face coordinate system so as to obtain the human face direction and the visual field.

Further, in one embodiment of the present invention, the processing module comprises: a detection unit and an acquisition unit.

The detection unit is used for detecting the positions of the finger tips and the nose roots of the people in a camera coordinate system and judging whether the finger tips are in the visual field range or not according to a transformation matrix from the camera coordinate system to a human face coordinate system and the coordinates of the finger tips and the nose roots of the people in the human face coordinate system. The acquisition unit is used for acquiring a transformation matrix from a camera coordinate system to a screen coordinate system through three-dimensional calibration so as to acquire the position of a fingertip under the camera coordinate system and the position of a nasion under the camera coordinate system, acquiring the representation of a straight line passing through the fingertip and the nasion under the camera coordinate system, acquiring the intersection point of the straight line passing through the fingertip and the nasion and a screen plane, and acquiring the position of a cursor in the screen coordinate system through the transformation matrix from the camera to the screen coordinate system so as to judge whether a cursor point is in a screen range.

Further, in an embodiment of the present invention, the recognition and classification module 300 is further configured to obtain a pixel position of the palm of the operator in the color image, segment the palm of the operator as a midpoint to generate a gesture image with a preset size, perform classification and recognition on the gesture image by using an artificial neural network, and determine an instruction corresponding to the gesture.

It should be noted that the explanation of the embodiment of the multi-person gesture human-computer interaction method based on the Kinect is also applicable to the multi-person gesture human-computer interaction device based on the Kinect of the embodiment, and is not repeated herein.

According to the Kinect-based multi-person gesture human-computer interaction device provided by the embodiment of the invention, the information of joints and faces of a human body is obtained through the Kinect camera, the position of an operation cursor is determined according to the positions of eyes and hands of an operator, and an operation instruction is determined according to the acquired gesture image of the operator, so that the multi-person gesture human-computer interaction device interacts with a graphical interface of equipment.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A Kinect-based multi-person gesture man-machine interaction method is characterized by comprising the following steps:

acquiring a color image and a depth image of a scene through a Kinect camera, and acquiring the human joint position and the face characteristic parameters of an operator in the scene;

acquiring a camera coordinate system, a screen coordinate system and a human face coordinate system, and detecting fingertip and cursor points, specifically: the camera coordinate system is used as a world coordinate system and the origin thereof is O_cCorresponding to the imaging center of the Kinect camera, Z_cThe axis coinciding with the optical axis, Y_cAxis and X_cThe axis is parallel to the imaging plane; the origin of the screen coordinate system is O_iCorresponding to the upper left corner of the screen, its Y_iAxis and X_iThe axis being parallel to the camera imaging plane, Z_iThe axis is opposite to the camera coordinate system; the origin of the face coordinate system is O_fCorresponding to the position of the head joint, Z_fThe shaft faces the connecting surface part; the conversion between the coordinate systems is realized by homogeneous transformation, and the transformation matrix from the camera coordinate system to the screen coordinate system is

Wherein the content of the first and second substances,

then the head pose provided by the function provided by the Kinect for Windows SDK V2 is obtained, and the value of the head pose is changed continuously; the description in the two coordinate systems { A } and { B } for an arbitrary point p is^Ap and^Bp, and the following homogeneous transformation relations:

homogeneous transformation matrix

Is a 4x4 square matrix:

wherein the content of the first and second substances,

shows a translation transformation

And rotational transformation

Wherein the content of the first and second substances,

a translation vector representing the coordinate system { B } relative to the coordinate system { A };

describe the orientation of the coordinate system { B } relative to the coordinate system { A }; SDK gives the quaternion representation q ═ q₀,q₁,q₂,q₃]^TIt may represent a rotation operation from coordinate system { A } to coordinate system { B }, with a relationship to the rotation matrix:

description from coordinate system { B } to coordinate system { A }

Determining { A } description relative to { B }

The method specifically comprises the following steps:

in a three-dimensional homogeneous coordinate system, a point is coupled with a plane, the representation form of the plane is consistent with that of the point, and the plane is represented by a four-dimensional vector with pi ═ pi (pi ═ pi)₁,π₂,π₃,π₄)^TIf point x is (x)₁,x₂,x₃,x₄)^TOn a plane: pi₁x₁+π₂x₂+π₃x₃+π₄x₄A plane is defined by three non-collinear points, 0:

when the point is on a plane, the matrix M ═ x, x₁,x₂,x₃]The determinant of (a) is 0: detM ═ x₁D₂₃₄-x₁D₁₃₄+x₁D₁₂₄-x₁D₁₂₃0; the plane is represented as: pi ═ D (D)₂₃₄,-D₁₃₄,D₁₂₄,-D₁₂₃) Wherein D is_jklRepresents a 4x 3 matrix x₁,x₂,x₃]The jkl row of (a) constitutes a determinant of the matrix; in the three-dimensional homogeneous coordinate system, a straight line is represented by a 4 × 4 matrix, and a straight line L is represented by two points x₁And x₂Determining:

the intersection of the plane pi with the line L is denoted x: x is L pi;

detecting a position of a screen origin in the camera coordinate system and a rotation matrix of an orientation of the position in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the screen coordinate system to obtain a screen position; specifically, the method comprises the following steps: obtaining a screen origin position O by measurement_iPosition in camera coordinate system

According to

Obtaining a transformation matrix from a camera coordinate system to a screen coordinate system

And

Representation in camera coordinate system

Acquiring a position of a human head joint under the camera coordinate system and a rotation matrix of the orientation of the human head joint under the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the human face coordinate system to obtain a human face direction and a visual field; specifically, the method comprises the following steps: human head joint position O is obtained through Kinect for Windows SDK V2_fPosition in camera coordinate system

According to

Obtaining transformation matrix from camera coordinate system to human face coordinate system

Obtaining cursor positions corresponding to the two hands of the operator according to the human joint position and the facial characteristic parameters by a light linear propagation principle; the obtaining of the cursor positions corresponding to the two hands of the operator according to the human joint position and the facial feature parameters through a light linear propagation principle further comprises: detecting the positions of the fingertips and the noseroots of the person under the camera coordinate system, and judging whether the fingertips are in the visual field range or not according to a transformation matrix from the camera coordinate system to a face coordinate system and the coordinates of the fingertips and the noseroots of the person under the face coordinate system; obtaining a transformation matrix from the camera coordinate system to the screen coordinate system through three-dimensional calibration to obtain the position of the fingertip under the camera coordinate system and the position of the nasion under the camera coordinate system, obtaining the representation of a straight line passing through the fingertip and the nasion under the camera coordinate system, obtaining the intersection point of the straight line passing through the fingertip and the nasion and a screen plane, and obtaining the position of a cursor in the screen coordinate system through a conversion matrix from the camera to the screen coordinate system to judge whether a cursor point is in a screen range; and

and segmenting the palm area image of the operator in the color image according to the human body joint position, and identifying and classifying the operation instruction by a pre-trained gesture classification model.

2. The Kinect-based multi-person gesture human-computer interaction method as claimed in claim 1, wherein the step of segmenting the palm area image of the operator from the color image according to the human body joint position and performing recognition and classification on the operation instruction by a pre-trained gesture classification model further comprises:

acquiring the pixel position of the palm of the operator in the color image, and segmenting by taking the palm of the operator as a midpoint to generate a gesture image with a preset size;

and classifying and identifying the gesture image by an artificial neural network, and determining an instruction corresponding to the gesture.

3. The utility model provides a many people gesture human-computer interaction device based on Kinect which characterized in that includes:

the system comprises an acquisition module, a display module and a control module, wherein the acquisition module is used for acquiring a color image and a depth image of a scene through a Kinect camera and acquiring the human body joint position and the face characteristic parameters of an operator in the scene;

the first acquisition module is used for acquiring a camera coordinate system, a screen coordinate system and a human face coordinate system and detecting fingertips and cursor points, and specifically comprises the following steps: the camera coordinate system is used as a world coordinate system and the origin thereof is O_cCorresponding to the imaging center of the Kinect camera, Z_cThe axis coinciding with the optical axis, Y_cAxis and X_cThe axis is parallel to the imaging plane; the origin of the screen coordinate system is O_iCorresponding to the upper left corner of the screen, its Y_iAxis and X_iThe axis being parallel to the camera imaging plane, Z_iShaft andthe camera coordinate systems are opposite; the origin of the face coordinate system is O_fCorresponding to the position of the head joint, Z_fThe shaft faces the connecting surface part; the conversion between the coordinate systems is realized by homogeneous transformation, and the transformation matrix from the camera coordinate system to the screen coordinate system is

Wherein the content of the first and second substances,

homogeneous transformation matrix

Is a 4x4 square matrix:

wherein the content of the first and second substances,

shows a translation transformation

And rotational transformation

Wherein the content of the first and second substances,

description from coordinate system { B } to coordinate system { A }

Determining { A } description relative to { B }

The method specifically comprises the following steps:

the intersection of the plane pi with the line L is denoted x: x is L pi;

the detection module is used for detecting the position of the screen origin position in the camera coordinate system and the rotation matrix of the orientation of the screen origin position in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the screen coordinate system to obtain a screen position; specifically, the method comprises the following steps: obtaining a screen origin position O by measurement_iPosition in camera coordinate system

According to

And

Representation in camera coordinate system

The second acquisition module is used for acquiring a position of a human head joint under the camera coordinate system and a rotation matrix of the orientation of the human head joint in the camera coordinate system, and acquiring a transformation matrix from the camera coordinate system to the human face coordinate system to obtain the human face direction and the visual field; specifically, the method comprises the following steps: human head joint position O is obtained through Kinect for Windows SDK V2_fPosition in camera coordinate system

According to

The processing module is used for obtaining cursor positions corresponding to the two hands of the operator according to the human joint position and the facial characteristic parameters through a light linear propagation principle; the processing module comprises: the detection unit is used for detecting the positions of the fingertips and the noseroots under the camera coordinate system and judging whether the fingertips are in the visual field range or not according to a transformation matrix from the camera coordinate system to a human face coordinate system and the coordinates of the fingertips and the noseroots under the human face coordinate system; the acquisition unit is used for acquiring a transformation matrix from the camera coordinate system to the screen coordinate system through three-dimensional calibration so as to obtain the position of the fingertip under the camera coordinate system and the position of the nasion under the camera coordinate system, acquiring the representation of a straight line passing through the fingertip and the nasion under the camera coordinate system, acquiring the intersection point of the straight line passing through the fingertip and the nasion and a screen plane, and acquiring the position of a cursor in the screen coordinate system through the transformation matrix from the camera to the screen coordinate system so as to judge whether a cursor point is in a screen range; and

and the recognition and classification module is used for segmenting the palm area image of the operator in the color image according to the human body joint position and recognizing and classifying the operation instruction by a pre-trained gesture classification model.

4. The Kinect-based multi-person gesture human-computer interaction device as claimed in claim 3, wherein the recognition and classification module is further configured to obtain pixel positions of the palm of the operator in the color image, segment the palm of the operator as a midpoint to generate a gesture image with a preset size, perform classification and recognition on the gesture image through an artificial neural network, and determine a command corresponding to a gesture.