CN114779932A

CN114779932A - User gesture recognition method, system, device and storage medium

Info

Publication number: CN114779932A
Application number: CN202210386221.1A
Authority: CN
Inventors: 谢娟; 杜承阳
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-07-22

Abstract

The application provides a user gesture recognition method, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring a depth image acquired by depth acquisition equipment in real time, and acquiring motion sensing data acquired by sensing equipment held by a target user in real time; calculating an attitude change estimation value according to the depth image of the target user; and according to the motion sensing data, correcting the attitude change estimation value on line. According to the method and the device, the attitude change estimation value is corrected on line by utilizing the motion sensing data acquired by the sensing equipment held by the user, accurate attitude estimation can be realized even if the depth detection precision of the depth acquisition equipment is not high, more people can be accommodated in a larger detection area to realize man-machine interaction, the influences of shielding, camera frame loss and the like in a real scene on the attitude estimation are effectively relieved, and the landability in a real commercial scene is improved. And the behavior of a plurality of users can be tracked simultaneously, and the mutual interference of the interaction behaviors among the users is prevented.

Description

User gesture recognition method, system, device and storage medium

Technical Field

The application belongs to the technical field of computers, and particularly relates to a user gesture recognition method, a user gesture recognition system, user gesture recognition equipment and a storage medium.

Background

With the development of science and technology, a new generation of smart televisions represented by smart screens has become popular. In the use scene of wisdom screen, often relate to the intelligent interactive scene based on human gesture, especially hand gesture.

In the related art, a user video is captured only by a general camera, and a user gesture is calculated based on the captured user video. Human body images collected by a common camera comprise human face images of users, and safety risks exist in the aspects of data compliance, user privacy and the like. And the user gesture is analyzed only through the user video collected by the common camera, and the types of interaction which can be carried out are few.

Disclosure of Invention

The application provides a user posture identification method, a user posture identification system, a user posture identification device and a storage medium, wherein the posture change estimation value is corrected on line by utilizing motion sensing data acquired by a user holding a sensing device in a hand mode, even if the depth detection precision of a depth acquisition device is not high, accurate posture estimation can be achieved, more people can be accommodated in a larger detection area to achieve more types of human-computer interaction, the face information of the user cannot be revealed in a depth image acquired by the depth acquisition device, and the safety is higher.

An embodiment of a first aspect of the present application provides a user gesture recognition method, including:

the method comprises the steps of obtaining a depth image which is collected by a depth collection device in real time, wherein the depth image comprises a depth image of a target user, and obtaining motion sensing data which is collected by a sensing device held by the target user in real time;

calculating a posture change estimation value of the target user according to the depth image of the target user;

and according to the motion sensing data, carrying out online correction on the attitude change estimated value to obtain a corrected attitude change estimated value.

An embodiment of a second aspect of the present application provides a user gesture recognition system, including: the device comprises depth acquisition equipment, sensing equipment and display equipment;

the depth acquisition equipment is used for acquiring a depth image in real time, wherein the depth image comprises a depth image of at least one user;

the sensing equipment is used for detecting motion sensing data corresponding to a user holding the sensing equipment;

the display device is used for receiving the depth image transmitted by the depth acquisition device and the motion sensing data transmitted by the sensing device; calculating a posture change estimation value of a target user according to the depth image of the target user; and according to the motion sensing data corresponding to the target user, performing online correction on the attitude change estimation value to obtain a corrected attitude change estimation value.

Embodiments of the third aspect of the present application provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method of the first aspect.

An embodiment of a fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, the program being executed by a processor to implement the method of the first aspect.

The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages:

in the embodiment of the application, the depth image of the user is acquired by adopting the depth acquisition equipment, and the motion sensing data of the user is acquired by the sensing equipment held by the user. And predicting the attitude change estimated value of the user based on the depth image of the user. The attitude change estimation value is corrected on line by utilizing the motion sensing data of the user, so that accurate attitude estimation can be realized even if the depth detection precision of the depth acquisition equipment is not high, and more people can be accommodated in a larger detection area to realize intelligent human-computer interaction. Based on the linkage and sensing fusion technology of multi-terminal equipment such as the depth acquisition equipment, the sensing equipment, the display equipment and the like, the system precision of the attitude estimation of the depth acquisition equipment is effectively improved, the influence of the problems of shielding and camera frame loss in a real scene on the attitude estimation is effectively relieved, and the landability of the system in a real commercial scene is improved. And the behavior of a plurality of users can be tracked simultaneously, the user image number is distributed to each user, and the mutual interference of the interactive behaviors among the users is prevented.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings.

In the drawings:

FIG. 1 is a schematic diagram illustrating a network architecture on which a user gesture recognition method according to an embodiment of the present application is based;

FIG. 2 shows a flow diagram of a user preparation phase provided by an embodiment of the present application;

FIG. 3 is a flow chart illustrating a method for user gesture recognition provided by an embodiment of the present application;

FIG. 4 is another schematic diagram illustrating a network architecture on which a user gesture recognition method provided by an embodiment of the present application is based;

FIG. 5 is a schematic structural diagram of a user gesture recognition apparatus provided in an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an electronic device according to an embodiment of the present application;

fig. 7 is a schematic diagram of a storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

A user gesture recognition method, a system, a device and a storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

The embodiment of the application provides a user gesture recognition method, which is characterized in that distance information between a human body and a depth collection device is collected through the depth collection device, motion sensing data collected by a sensing device held by a user is combined, and real-time collection and dynamic analysis of human body gestures are realized based on a sensing fusion technology, so that real-time recognition of human body interaction intentions and interaction strength is finally realized.

Wherein, the depth acquisition equipment can include the equipment of the distance information between the human body and the depth acquisition equipment of depth camera, distance sensor or other arbitrary collection that can gather. The sensing device may comprise an acceleration sensor, a gravimeter, an angular velocity sensor, a gyroscope or other device capable of measuring acceleration, angular velocity, displacement and other motion parameters, and the sensing device may comprise one or more of these devices. The sensing device may be an independent device, or may be integrated in a terminal such as a mobile phone or a tablet computer of a user, or may be integrated in a wearable device such as a watch, a helmet, glasses, a knee pad, or a shoe of the user. Accordingly, the limb part of the sensing device held by the user may be the hand, leg, foot or head of the user.

Fig. 1 shows a schematic diagram of a network architecture on which an embodiment of the present application is based, and as shown in fig. 1, the network architecture includes a depth acquisition device, a sensing device, and a display device. Wherein, display device is the screen equipment that can carry out human-computer interaction based on human gesture and user, and for example, display device can be TV, wisdom screen etc. and the wisdom screen is the display device who has more intelligent function except that display function, if wisdom screen can be based on human gesture and user to carry out the interaction, can support video conversation etc.. The depth acquisition equipment and the sensing equipment are connected with the display equipment in a communication mode, the communication connection can be wired connection or wireless connection, and the wireless connection can be wireless communication modes such as wifi or Bluetooth. The communication between the depth acquisition device and the sensing device and the display device can adopt a protocol except MQTT (message queue telemetry transmission) to carry out data transmission, or adopt Socket to carry out data transmission. In addition, other protocols or data transmission manners may also be adopted, and the embodiment of the present application is not limited to this specifically.

Based on the network architecture shown in fig. 1, before the user gesture is recognized by the method provided in the embodiment of the present application and then the human-computer interaction between the user and the display device is performed, the user needs to log in the display device first, and the depth image of the user acquired by the depth acquisition device is bound with the motion sensing data acquired by the sensing device held by the user on the display device side.

As shown in fig. 2, the following steps S1-S9 are specifically performed to log in the user to the display device and complete the binding operation, which specifically includes:

s1: the display device displays graphic codes for human-computer interaction.

The graphic code includes information of an application program capable of human-computer interaction with the display device, and the information of the application program may include one or more of information of an IP address, a port, a program name, and the like of the application program. The application may be an applet that can be called by the code scanner, or another application separate from the code scanner. The code scanning program is an application program which can call the camera device to scan the graphic code. The display device may display the graphic code at a preset position of the screen, such as at the upper left corner or the lower right corner of the screen. The graphic code can be a bar code or a two-dimensional code, etc.

The graphic code may be posted on a wall or a non-display area of the display device in the form of a drawing or a return report, instead of being displayed by the display device.

S2: and the sensing equipment held by the user scans the graphic code, identifies the information of the application program contained in the graphic code, and calls the corresponding man-machine interaction program to send a login request to the display equipment according to the information of the application program, wherein the login request comprises the equipment identifier of the sensing equipment.

The sensing equipment held by the user can be mobile phones, tablet computers, watches, helmets, glasses and other equipment of the user, and sensors capable of collecting motion sensing data of the user are integrated in the equipment.

S3: the display device receives a login request sent by the sensing device held by the user based on the graphic code, judges whether the number of people who have logged in currently is smaller than a preset number, if so, executes step S5, and if not, executes step S4.

Because the collection scope of degree of depth collection equipment can hold the user figure be limited, and display device carries out real-time man-machine interaction with the user, and the very smooth interactive experience of every user of wanting to make, and the number of people of interaction simultaneously also can not be too much. Therefore, the number of people capable of interacting simultaneously is set in the display device in advance to be at most a preset number, and the preset number may be 10, 20, or 30, and the like, which is not limited by the embodiment of the present application.

When the display equipment receives a login request of a user, whether the number of the current logged-in people is smaller than a preset number is judged, and if yes, the subsequent steps are carried out to log in the current user. If not, step S4 is performed to deny the current user login.

S4: and the display equipment sends prompt information to the sensing equipment, wherein the prompt information is used for prompting that the number of people logged in by the user is full currently.

S5: the display device stores the device identification of the sensing device and sends interactive guidance information to the sensing device.

The interactive guidance information is used for guiding the user to realize the interactive actions required by different interactive functions, namely the interactive guidance information comprises the mapping relation between the interactive function description information and the corresponding interactive action explanation information.

And after the sensing equipment held by the user receives the interaction guide information, displaying the interaction guide information. The interaction guidance information comprises preset actions required to be made by binding the motion sensing data acquired by the sensing equipment with the depth image of the user acquired by the depth acquisition equipment. The preset action may be swinging an arm downwards, swinging an arm upwards, or the like, and the preset action is not particularly limited in the embodiment of the present application.

After the sensing equipment held by the user displays the interaction guidance information, the user can face the display equipment, stand in the acquisition range of the depth acquisition equipment, and make the preset action according to the explanation information of the preset action recorded in the interaction guidance information.

S6: the sensing equipment held by the user collects the motion sensing data during the preset action of the user and transmits the motion sensing data to the display equipment.

The motion sensing data comprises motion parameters such as acceleration, angular velocity, speed, displacement and the like of the limb part of the sensing device held by the user at each moment.

S7: the display device receives motion sensing data collected by the sensing device in the process that the user makes a preset action according to the interaction guidance information, and extracts a depth image of each user included in a depth image collected by the depth collection device.

The sensing device of each user transmits motion sensing data to the display device and also transmits a respective device identification to the display device. Therefore, the display device can determine the motion sensing data sent by the sensing device held by the user who just logs in from the data transmitted by the plurality of sensing devices.

The acquisition range of the depth acquisition device can be positioned in front of the display device, and the depth acquisition device acquires a depth image corresponding to the acquisition range in real time, wherein the depth image comprises a depth image of each user standing in the acquisition range. The display equipment receives each frame of depth image transmitted by the depth acquisition equipment, and extracts each frame of depth image corresponding to each user from each frame of depth image.

S8: the display device determines whether a depth image matching the motion sensing data corresponding to the preset motion exists in the depth image of each user.

Since the process of determining whether to match the motion sensing data of the user who just logged in is the same for each of the depth images of the users. Therefore, in the embodiment of the present application, the depth image of the first user is taken as an example to be described in detail, and the determination process of the depth image of other users can refer to the determination process of the depth image of the first user. Wherein the first user is for any of the above-mentioned each user.

In one implementation, the display device first determines a time period for a user to make a preset action based on received motion sensing data during the user's making of the preset action. And then extracting each frame of depth image positioned in the time period from each frame of depth image of the first user. And respectively extracting each frame of human skeleton image of the first user through a preset human posture estimation algorithm according to each frame of depth image of the first user in the time period, and then calculating the human skeleton inter-frame difference value of the first user according to each frame of human skeleton image of the first user. And predicting the limb movement information of the first user in the time period according to the human body skeleton inter-frame difference value of the first user. And determining whether the depth image of the first user is matched with the motion sensing data corresponding to the preset action or not according to the limb motion information and the motion sensing data in the time period.

The preset human body posture estimation algorithm may include openpos, Hourglass + associated Embedding, HigherHRNet, MSPN (human body posture estimation and detection algorithm), HRNet, and the like, and the specific human body posture estimation algorithm used in the embodiment of the present application is not particularly limited.

The human body skeleton inter-frame difference value can be used for expressing the difference of coordinates of the same pixel points between two adjacent human body skeleton images. The limb movement information comprises the movement parameters of acceleration, angular velocity, speed, displacement and the like of the limb part movement of the first user predicted by a preset human body posture estimation algorithm.

As an example, the display device calculates a first difference between an acceleration included in the limb movement information and an acceleration included in the motion sensing data for the period of time, and calculates a second difference between an angular velocity included in the limb movement information and an angular velocity included in the motion sensing data. And if the first difference value is smaller than a first preset threshold value and the second difference value is smaller than a second preset threshold value, determining that the depth image of the first user is matched with the motion sensing data corresponding to the preset action. Otherwise, it is determined that the depth image of the first user does not match the motion sensing data corresponding to the preset action.

In other embodiments, a difference between the displacement or the velocity included in the limb movement information and the displacement or the velocity included in the movement sensing data may also be calculated, and when one or more of the first difference, the second difference, the difference corresponding to the displacement, and the difference corresponding to the velocity is smaller than a certain threshold, it is determined that the depth image of the first user matches the movement sensing data corresponding to the preset action.

In another implementation manner, the display device calculates the first motion change matrix according to the acceleration and the angular velocity included in the motion sensing data corresponding to the preset action made by the current user. In other embodiments, the first motion change matrix may be calculated according to one or more motion parameters, such as acceleration, angular velocity, and displacement, included in the motion sensing data. The first motion change matrix comprises a displacement of each point of the limb part of the user holding the sensing device in motion in space.

The display equipment determines a time period of the preset action of the user according to the motion sensing data corresponding to the preset action, and extracts each frame of depth image in the time period from the depth image of the first user. And respectively extracting each frame of human body skeleton image of the first user according to each frame of depth image of the first user in the time period. And calculating a second motion change matrix of the first user according to each frame of human skeleton image of the first user. The second motion change matrix includes a displacement of motion in space of each point on the limb portion over the period of time by the first user.

And if the difference value between the first motion change matrix and the second motion change matrix is smaller than a third preset threshold value, determining that the depth image of the first user is matched with the motion sensing data corresponding to the preset action. Otherwise, it is determined that the depth image of the first user does not match the motion sensing data corresponding to the preset action.

And for the depth image of each user in the depth image acquired by the depth acquisition equipment, sequentially determining whether the depth image of each user is matched with the motion sensing data of the current user according to any one implementation mode, and if the depth image of a certain user is judged to be matched, stopping the judgment operation on the depth images of the rest users. And if the depth image matched with the motion sensing data of the user is not determined after the depth images of all the users are judged, sending prompt information for prompting the user to make a preset action again.

S9: and if the depth image exists, distributing a user image number for the matched depth image, and storing the mapping relation between the user image number and the equipment identifier of the sensing equipment.

If it is determined in step S8 that there is a depth image that matches the motion sensing data of the current user, a user image number is assigned to the depth image. The user image number is used to uniquely identify the user, and may be a character sequence formed by a plurality of characters. The display device stores the mapping relation between the image number of the user and the device identification of the sensing device held by the current user.

The display device may further label the user image number at the depth image corresponding to the user in each frame of depth image acquired by the depth acquisition device. Therefore, when a new user logs in subsequently, whether the depth image matched with the motion sensing data of the logged-in user exists in the depth image which is not marked with the user image number can be determined, so that the matching process can be accelerated, and the calculation amount is reduced.

The user preparation is completed by the operations of the above steps S1-S9, and the mapping relationship between the user image number and the device identification of the sensing device held by the user is stored in the display device. The user may then have real-time human-machine interaction with the display device.

Fig. 3 shows a flow of recognizing a user gesture in the network architecture of fig. 1, and as shown in fig. 2, the method specifically includes the following steps:

step 101: the display device obtains a depth image which is collected by the depth collection device in real time, wherein the depth image comprises a depth image of a target user, and obtains motion sensing data which is collected by a sensing device held by the target user in real time.

The process of posture tracking and recognition of each user by the display device is the same for each user who completes the user preparation phase operation. Therefore, the embodiment of the present application takes a target user as an example to describe the recognition process of the user gesture in detail, and the target user may be any user in each user who currently completes the user preparation phase.

The depth acquisition equipment acquires a depth image corresponding to the acquisition range in real time and transmits each acquired frame of depth image to the display equipment. Each frame of depth image comprises one or more depth images of users. And the sensing equipment held by each logged-in user collects the motion sensing data generated by the motion of the user in real time and transmits the collected motion sensing data to the display equipment.

The display device receives each frame of depth image transmitted by the depth acquisition device and receives motion sensing data transmitted by each sensing device.

Step 102: and the display equipment calculates the attitude change estimation value of the target user according to the depth image of the target user.

For a target user, the display device stores a user image number of the target user and a device identifier of the sensing device held by the target user in a user preparation stage, and marks the user image number of the target user at the depth image of the target user in the depth image acquired by the depth acquisition device.

And the display equipment respectively extracts each frame of depth image of the target user from each frame of depth image acquired by the depth acquisition equipment according to the user image number of the target user. And then, respectively extracting each frame of human skeleton image by using a preset human posture estimation algorithm according to each frame of depth image of the target user. And calculating the human body skeleton inter-frame difference value corresponding to the target user according to each frame of human body skeleton image. And determining the posture change estimation value of the target user according to the human body skeleton interframe difference value of the target user.

Namely, the display device can predict the attitude change estimation value of the target user in real time according to the depth image of the target user acquired by the depth acquisition device in real time.

Step 103: and the display equipment carries out on-line correction on the attitude change estimated value according to the motion sensing data corresponding to the target user to obtain the corrected attitude change estimated value.

The display device determines the motion sensing device corresponding to the target user from the motion sensing data sent by the multiple sensing devices according to the device identifier of the sensing device held by the target user. And then dynamically correcting the attitude change estimation value predicted based on the depth image of the target user acquired by the depth acquisition equipment on line by using the motion sensing data of the target user based on a preset correction algorithm.

The preset correction algorithm may include a kalman filter algorithm, an LSTM (long short term memory artificial neural network), and the like, and the embodiment of the present application does not limit what kind of preset correction algorithm is specifically adopted. As an example, based on a kalman filtering algorithm, dynamic compensation of the motion sensing data on the attitude change estimation value is implemented.

The attitude change estimation value predicted based on the depth image of the target user is corrected on line by utilizing the motion sensing data acquired by the sensing equipment held by the user in real time, so that the influence of the problems of distortion, shielding, frame loss and the like on the attitude on-line estimation can be effectively reduced, and the accuracy of the attitude estimation is improved.

After obtaining the corrected posture change estimation value corresponding to the target user in the above manner, the display device further identifies the behavior intention and the intention strength of the user based on the corrected posture change estimation value. Specifically, the display device identifies the behavior intention category and the intention strength of the target user through a pre-trained behavior intention classification model according to the corrected posture change estimation value.

In the embodiment of the present application, the behavior intention may be divided into various categories, such as sliding a page up, sliding a page down, enlarging a page, reducing a page, dragging a target, and the like. The behavioral intent categories relate to actions taken by the user, and the intent strengths relate to the magnitude or speed of the actions taken by the user. For example, the intention strength may be the speed of the user swinging an arm, the arm displacement caused by swinging an arm, or the like. The embodiment of the application does not specially limit the specific division of the behavior intention category and the intention strength.

The network structure of the behavior intention classification model can be any neural network model capable of realizing classification, such as ResNet, CNN (convolutional neural network) network and the like. The embodiments of the present application do not limit the specific network structure adopted by the behavior intention classification model.

The embodiment of the application acquires a training set in advance, the training set comprises a plurality of groups of posture change values, and behavior intention categories and intention strengths corresponding to the posture change values are marked in each group of posture change values. And training the behavior intention classification model through the training set to obtain the trained behavior intention classification model.

Inputting the corrected posture change estimation value obtained in step 103 into the trained behavior intention classification model, and outputting the behavior intention type and intention strength corresponding to the action performed by the target user. And the display equipment determines a control instruction corresponding to the behavior intention type and the intention strength and executes the control operation corresponding to the control instruction. And finishing the man-machine interaction process of the target user acting and the display equipment executing the control operation corresponding to the action. And for each other logged-in user, performing human-computer interaction with the display device according to the process.

According to the embodiment of the application, the behavior intentions of the user are classified through the neural network model, the intention strength of the user is regressed, the behavior intention category of the user is identified, and the intention strength of the user is judged, so that the behavior intention of the user can be known more accurately, and the human-computer interaction of multiple categories and multiple amplitudes is supported. Particularly for operations such as page sliding and target dragging, if the intention strength is high, the operation can be performed faster by sliding or dragging, and if the intention strength is low, the operation can be performed slower by sliding or dragging.

In other embodiments of the present application, after the display device stores the mapping relationship between the user image number corresponding to the user and the device identifier of the sensor device, it may also be periodically determined whether the motion sensing data acquired by the sensor device is matched with the depth image corresponding to the user image number. The specific process of determining whether to match is the same as the way of determining whether to match in step S8, and is not described herein again.

And if the display equipment determines that the duration of unmatched motion sensing data acquired by the sensing equipment held by the user and the depth image of the user image number corresponding to the user reaches a first preset duration, deleting the mapping relation between the equipment identifier of the sensing equipment held by the user and the user image number corresponding to the user.

In other embodiments, if the user needs to log out, the user may send a log-out command to the display device through the sensing device he or she holds. Further, the user may call the human-computer interaction program in step S2 through the sensing device to send an exit instruction to the display device. The exit instruction includes a device identification of the sensing device held by the user. And the display equipment receives an exit instruction sent by the sensing equipment and deletes the mapping relation between the equipment identifier of the sensing equipment and the user image number of the user.

In some embodiments, if the display device detects that the sensing device of a certain user does not update the motion sensing data acquired by the sensing device of the certain user within a second preset time period, the user may not act for a long time, and therefore it is determined that the user does not need to perform human-computer interaction with the display device, and the mapping relationship between the device identifier of the sensing device and the user image number of the user is deleted.

After the mapping relation between the device identifier of the sensing device and the corresponding user image number is deleted through any implementation mode, the display device logs out the login state of the user, the gesture change of the user is not tracked and recognized any more, and the interaction with the user is stopped.

The depth image of the user is acquired by the depth acquisition equipment, the depth image only comprises distance information between the user and the depth acquisition equipment and does not comprise information such as the face of the user, so that the face information of the user cannot be leaked, and the safety is higher. On the basis of the depth acquisition equipment, the motion sensing data is acquired through the sensing equipment held by a user. The data registration and the online fusion of the sensing data are realized at the local end side (such as a display device), and the system abnormity problem caused by the conditions of shielding, improper detection distance and the like is effectively avoided. By introducing the auxiliary device of the sensing device held by the user and based on the linkage and sensing fusion technology of the multi-terminal devices such as the depth acquisition device, the sensing device and the display device, the system precision of attitude estimation of the depth acquisition device is effectively improved. The method can realize multi-user interaction simultaneously, track the behaviors of a plurality of users simultaneously, allocate user image numbers to each user and prevent the interaction behaviors among the users from being mutually interfered.

And the behavior intention type and intention strength of the user are identified based on the posture change estimated value after on-line correction, richer man-machine interaction can be realized based on the behavior intention type and intention strength, and richer interaction functions can be realized through more types of limb actions.

In the embodiment of the application, the attitude estimation based on the depth acquisition equipment is corrected on line by utilizing the motion sensing data acquired by the sensing equipment held by the user, so that the attitude estimation can be accurately realized even if the depth detection precision of the depth acquisition equipment is not high, and more people can be accommodated in a larger detection area to realize intelligent human-computer interaction. And the influence of the shielding and camera frame loss problems in the real scene on the attitude estimation can be effectively relieved, and the landability of the system in the real commercial scene is improved.

In addition to the network architecture shown in fig. 1, the network architecture according to the embodiment of the present application may also be the network architecture shown in fig. 4, as shown in fig. 4, the network architecture includes a depth acquisition device, a sensing device, a display device, and a server. The server can be a public cloud, a private cloud, or a common server device. The depth acquisition equipment, the sensing equipment, the display equipment and the server are connected in a wired or wireless mode. Under the condition of wireless connection, the depth acquisition equipment, the sensing equipment and the display equipment can be connected in a wireless communication mode such as WiFi or Bluetooth. The depth acquisition device, the sensing device and the display device can be in communication connection with the server through WiFi. In another implementation, the depth acquisition device and the sensing device may not be directly connected to the server, but indirectly connected to the server through the display device, as shown in fig. 4.

In the network architecture shown in fig. 4, when the user needs to perform human-computer interaction with the display device, the user preparation phase may be completed according to the operations of steps S1-S9. The depth acquisition device or the display device can also transmit each frame of acquired depth image to the server, and the sensing device or the display device held by the user can transmit the motion sensing data corresponding to the user to the server. And then the server determines whether a depth image matched with the motion sensing data corresponding to the preset action made by the user exists in the depth image of each user, and after the matched depth image is determined, the server allocates a user image number to the depth image, and stores the mapping relation between the user image number and the equipment identification of the sensing equipment held by the user on the server.

And then, in the interaction process with the user, the depth acquisition equipment or the display equipment transmits each frame of depth image acquired in real time to the server, and the sensing equipment or the display equipment held by the user transmits the motion sensing data corresponding to the user to the server. And calculating the posture change estimated value after the online correction by the server according to the depth image and the motion sensing data corresponding to the user, and identifying the behavior intention type and the intention strength of the user based on the corrected posture change estimated value. And the server determines a corresponding control instruction according to the behavior intention type and the intention strength of the user, and issues the control instruction to the display equipment. The display equipment receives and executes the control instruction, so that the man-machine interaction process with the user is realized.

In the embodiment in which the server participates, the specific calculation process of gesture tracking and recognition is realized by the server, so that the calculation amount of the end side is reduced, the calculation resources of the display device are saved, and more calculation resources can be provided to support human-computer interaction with more users at the same time.

In the embodiment of the application, the depth image of the user is acquired by adopting the depth acquisition equipment, and the motion sensing data of the user is acquired by the sensing equipment held by the user. And predicting the attitude change estimated value of the user based on the depth image of the user. The attitude change estimation value is corrected on line by utilizing the motion sensing data of the user, so that accurate attitude estimation can be realized even if the depth detection precision of the depth acquisition equipment is not high, and more people can be accommodated in a larger detection area to realize intelligent human-computer interaction. Based on the linkage and sensing fusion technology of the multi-terminal devices such as the depth acquisition device, the sensing device and the display device, the system precision of attitude estimation of the depth acquisition device is effectively improved, the influence of the problems of shielding and camera frame loss in a real scene on the attitude estimation is effectively relieved, and the landfall performance of the system in the real commercial scene is improved. And the behaviors of a plurality of users can be tracked simultaneously, the user image number is distributed to each user, and the mutual interference of the interactive behaviors among the users is prevented.

The embodiment of the present application further provides a user gesture recognition system, where the system is configured to execute the user gesture recognition method provided in any one of the above embodiments, and as shown in fig. 1, the system includes a depth acquisition device, a sensing device, and a display device;

the display equipment is used for receiving the depth image transmitted by the depth acquisition equipment and the motion sensing data transmitted by the sensing equipment; calculating a posture change estimation value of the target user according to the depth image of the target user; and according to the motion sensing data corresponding to the target user, carrying out online correction on the attitude change estimated value to obtain a corrected attitude change estimated value.

As shown in fig. 4, the system may further include a server, configured to receive the depth image sent by the depth acquisition device or the display device, and receive motion sensing data sent by the display device or a sensing device held by a user; acquiring a posture change estimation value of a target user according to the depth image of the target user and the motion sensing data corresponding to the target user; and identifying the behavior intention type and intention strength of the target user through a pre-trained behavior intention classification model according to the obtained attitude change estimated value, and issuing a control command corresponding to the behavior intention type and the intention strength to display equipment.

The user gesture recognition system provided by the above embodiment of the present application and the user gesture recognition method provided by the embodiment of the present application have the same inventive concept and have the same beneficial effects as the method adopted, operated or implemented by the application program stored in the user gesture recognition system.

The embodiment of the application also provides a user gesture recognition device, which is used for executing the user gesture recognition method provided by any one of the embodiments. As shown in fig. 5, the apparatus includes:

the data acquisition module 501 is configured to acquire a depth image acquired by a depth acquisition device in real time, where the depth image includes a depth image of a target user, and acquire motion sensing data acquired by a sensing device held by the target user in real time;

the attitude estimation module 502 is used for calculating an attitude change estimation value of the target user according to the depth image of the target user;

the online correction module 503 is configured to perform online correction on the posture change estimation value according to the motion sensing data, so as to obtain a corrected posture change estimation value.

The posture estimation module 502 is specifically configured to extract each frame of human skeleton image according to each frame of depth image of the target user; calculating a human body skeleton inter-frame difference value corresponding to a target user according to each frame of human body skeleton image; and determining the posture change estimation value of the target user according to the human body skeleton interframe difference value.

The device also includes: the user binding module is used for displaying a graphic code for man-machine interaction; receiving a login request sent by sensing equipment held by a user based on a graphic code, wherein the login request comprises an equipment identifier of the sensing equipment; and if the number of the current logged persons is less than the preset number, storing the equipment identification of the sensing equipment, and sending interaction guide information to the sensing equipment.

The user binding module is also used for receiving motion sensing data collected by the sensing equipment in the process that the user makes a preset action according to the interaction guidance information; extracting a depth image of each user in a depth image acquired by a depth acquisition device; determining whether a depth image matched with motion sensing data corresponding to a preset action exists in the depth image of each user; and if the depth image exists, distributing a user image number for the matched depth image, and storing the mapping relation between the user image number and the equipment identifier of the sensing equipment.

The user binding module is used for determining a time period for the user to make the preset action according to the motion sensing data corresponding to the preset action; respectively extracting each frame of human skeleton image of a first user in a time period according to each frame of depth image of the first user, wherein the first user is any one user in each user; calculating a human body skeleton interframe difference value of a first user according to each frame of human body skeleton image of the first user in a time period; predicting limb movement information of a first user in a time period according to the human body skeleton inter-frame difference value of the first user; and determining whether the depth image of the first user is matched with the motion sensing data corresponding to the preset action or not according to the limb motion information and the motion sensing data in the time period.

The user binding module is used for calculating a first difference value between the acceleration included by the limb movement information and the acceleration included by the movement sensing data in a time period; calculating a second difference between the angular velocity included in the limb movement information and the angular velocity included in the movement sensory data; and if the first difference is smaller than a first preset threshold value and the second difference is smaller than a second preset threshold value, determining that the depth image of the first user is matched with the motion sensing data corresponding to the preset action.

The user binding module is used for calculating a first motion change matrix according to the acceleration and the angular velocity included in the motion sensing data corresponding to the preset action; determining a time period for a user to make a preset action according to motion sensing data corresponding to the preset action; respectively extracting each frame of human body skeleton image of a first user according to each frame of depth image of the first user in a time period, wherein the first user is any one of the users; calculating a second motion change matrix of the first user according to each frame of human body skeleton image of the first user; and if the difference value between the first motion change matrix and the second motion change matrix is smaller than a third preset threshold value, determining that the depth image of the first user is matched with the motion sensing data corresponding to the preset action.

The device also includes: the unbinding module is used for periodically determining whether the motion sensing data acquired by the sensing equipment is matched with the depth image corresponding to the user image number; and if the unmatched duration of the motion sensing data acquired by the sensing equipment and the depth image corresponding to the user image number reaches a first preset duration, deleting the mapping relation corresponding to the sensing equipment.

The unbinding module is also used for receiving an exit instruction sent by the sensing equipment and deleting the mapping relation corresponding to the sensing equipment; or, if the motion sensing data collected by the sensing equipment is not updated within the second preset time, deleting the mapping relation corresponding to the sensing equipment.

The device also includes: the intention recognition module is used for recognizing the behavior intention type and intention strength of the target user through a pre-trained behavior intention classification model according to the corrected posture change estimated value; and determining the control instruction corresponding to the behavior intention type and the intention strength, and executing the control operation corresponding to the control instruction.

The user gesture recognition device provided by the embodiment of the application and the user gesture recognition method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as methods adopted, operated or realized by application programs stored in the device.

The embodiment of the application further provides electronic equipment for executing the user gesture recognition method. Please refer to fig. 6, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 6, the electronic apparatus 4 includes: a processor 400, a memory 401, a bus 402 and a communication interface 403, wherein the processor 400, the communication interface 403 and the memory 401 are connected through the bus 402; the memory 401 stores a computer program that can be executed on the processor 400, and the processor 400 executes the computer program to execute the user gesture recognition method provided in any of the foregoing embodiments of the present application.

The Memory 401 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the apparatus and at least one other network element is implemented through at least one communication interface 403 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like may be used.

Bus 402 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 401 is configured to store a program, and the processor 400 executes the program after receiving an execution instruction, where the user gesture recognition method disclosed in any of the foregoing embodiments of the present application may be applied to the processor 400, or implemented by the processor 400.

Processor 400 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 400. The Processor 400 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 401, and the processor 400 reads the information in the memory 401 and completes the steps of the method in combination with the hardware.

The electronic equipment provided by the embodiment of the application and the user gesture recognition method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic equipment.

Referring to fig. 7, the computer readable storage medium is an optical disc 30, and a computer program (i.e., a program product) is stored thereon, and when being executed by a processor, the computer program performs the user gesture recognition method according to any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the user gesture recognition method provided by the embodiment of the present application have the same beneficial effects as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.

It should be noted that:

in the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted to reflect the following schematic: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for user gesture recognition, the method comprising:

and according to the motion sensing data, carrying out online correction on the attitude change estimation value to obtain a corrected attitude change estimation value.

2. The method of claim 1, wherein the calculating an estimate of the change in pose of the target user from the depth image of the target user comprises:

respectively extracting each frame of human body skeleton image according to each frame of depth image of the target user;

calculating a human body skeleton frame difference value corresponding to the target user according to each frame of human body skeleton image;

and determining the posture change estimation value of the target user according to the human body skeleton interframe difference value.

3. The method of claim 1, further comprising:

displaying a graphic code for human-computer interaction;

receiving a login request sent by sensing equipment held by a user based on the graphic code, wherein the login request comprises an equipment identifier of the sensing equipment;

and if the number of the registered people is smaller than the preset number, storing the equipment identification of the sensing equipment, and sending interactive guidance information to the sensing equipment.

4. The method of claim 3, wherein after sending the interaction guidance information to the sensing device, further comprising:

receiving motion sensing data collected by the sensing equipment in the process that the user makes a preset action according to the interaction guidance information;

extracting a depth image of each user included in the depth image acquired by the depth acquisition equipment;

determining whether a depth image matched with the motion sensing data corresponding to the preset action exists in the depth image of each user;

and if the depth image exists, distributing a user image number for the matched depth image, and storing the mapping relation between the user image number and the equipment identifier of the sensing equipment.

5. The method of claim 4, wherein the determining whether a depth image matching the motion sensing data corresponding to the preset action exists in the depth images of each user comprises:

determining a time period for the user to make the preset action according to the motion sensing data corresponding to the preset action;

respectively extracting each frame of human body skeleton image of a first user in the time period according to each frame of depth image of the first user, wherein the first user is any one of the users;

calculating a human body skeleton inter-frame difference value of the first user according to each frame of human body skeleton image of the first user in the time period;

predicting limb movement information of the first user in the time period according to the human body skeleton inter-frame difference value of the first user;

and determining whether the depth image of the first user is matched with the motion sensing data corresponding to the preset action or not according to the limb motion information and the motion sensing data in the time period.

6. The method of claim 5, wherein the determining whether the depth image of the first user matches the motion sensing data corresponding to the preset motion according to the limb motion information and the motion sensing data in the time period comprises:

calculating a first difference between an acceleration included in the limb motion information and an acceleration included in the motion sensing data over the time period;

calculating a second difference between the angular velocity included in the limb movement information and the angular velocity included in the motion sensing data;

and if the first difference is smaller than a first preset threshold value and the second difference is smaller than a second preset threshold value, determining that the depth image of the first user is matched with the motion sensing data corresponding to the preset action.

7. The method of claim 4, wherein the determining whether a depth image matching the motion sensing data corresponding to the preset action exists in the depth images of each user comprises:

calculating a first motion change matrix according to the acceleration and the angular velocity included in the motion sensing data corresponding to the preset action;

respectively extracting each frame of human skeleton image of a first user according to each frame of depth image of the first user in the time period, wherein the first user is any one user in each user;

calculating a second motion change matrix of the first user according to each frame of human skeleton image of the first user;

and if the difference value between the first motion change matrix and the second motion change matrix is smaller than a third preset threshold value, determining that the depth image of the first user is matched with the motion sensing data corresponding to the preset action.

8. The method according to any one of claims 4 to 7, wherein after storing the mapping relationship between the user image number and the device identifier of the sensing device, the method further comprises:

periodically determining whether the motion sensing data collected by the sensing equipment is matched with the depth image corresponding to the user image number;

and if the duration of the unmatched motion sensing data acquired by the sensing equipment and the depth image corresponding to the user image number reaches a first preset duration, deleting the mapping relation corresponding to the sensing equipment.

9. The method according to any one of claims 4 to 7, wherein after storing the mapping relationship between the user image number and the device identifier of the sensing device, the method further comprises:

receiving an exit instruction sent by the sensing equipment, and deleting the mapping relation corresponding to the sensing equipment; or,

and if the motion sensing data collected by the sensing equipment is not updated within a second preset time, deleting the mapping relation corresponding to the sensing equipment.

10. The method according to any one of claims 1-7, wherein after obtaining the corrected posture change estimate, further comprising:

according to the corrected posture change estimated value, recognizing the behavior intention type and intention strength of the target user through a pre-trained behavior intention classification model;

and determining the behavior intention type and a control instruction corresponding to the intention strength, and executing the control operation corresponding to the control instruction.

11. A user gesture recognition system is characterized by comprising a depth acquisition device, a sensing device and a display device;

12. The system of claim 11, further comprising a server;

the server is used for receiving the depth image sent by the depth acquisition equipment or the display equipment and receiving the motion sensing data sent by the display equipment or sensing equipment held by a user; acquiring a posture change estimation value of a target user according to a depth image of the target user and motion sensing data corresponding to the target user; and identifying the behavior intention type and intention strength of the target user through a pre-trained behavior intention classification model according to the obtained attitude change estimated value, and issuing a control instruction corresponding to the behavior intention type and the intention strength to the display equipment.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of any of claims 1-10.

14. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method of any of claims 1-10.