CN113946216A

CN113946216A - Man-machine interaction method, intelligent device, storage medium and program product

Info

Publication number: CN113946216A
Application number: CN202111212024.XA
Authority: CN
Inventors: 邵柏韬; 刘朋浩; 李颖; 姜飞俊
Original assignee: Alibaba China Co Ltd; Alibaba Cloud Computing Ltd
Current assignee: Alibaba China Co Ltd; Alibaba Cloud Computing Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-18

Abstract

The embodiment of the application provides a human-computer interaction method, intelligent equipment, a computer storage medium and a program product, wherein the human-computer interaction method comprises the following steps: carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to a detection result of the multi-target detection; acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand; and performing interactive control on the intelligent equipment according to the gesture information. According to the scheme provided by the embodiment of the application, the interaction control efficiency for the intelligent equipment is improved.

Description

Man-machine interaction method, intelligent device, storage medium and program product

Technical Field

The embodiment of the application relates to the technical field of intelligent equipment, in particular to a human-computer interaction method, intelligent equipment, a computer storage medium and a computer program product.

Background

With the development of the AIoT (Artificial Intelligence & Internet of Things) technology, more and more intelligent devices are applied to the work and life of people. The intelligent equipment with the image acquisition function, such as an intelligent television, an intelligent large screen, intelligent glasses and the like, is an important type of the intelligent equipment.

Currently, most of these types of smart devices implement human-computer interaction by recognizing gestures in captured images. When the interaction is carried out specifically, the detection and the recognition of the air gesture and the contact gesture of the user can be realized by wearing a hardware sensor similar to a bracelet by the user.

However, this approach requires additional hardware sensors, which greatly increases the implementation cost of the smart device and hinders the development and large-scale use of the smart device.

Disclosure of Invention

In view of the above, embodiments of the present application provide a human-computer interaction solution to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, a human-computer interaction method is provided, including: carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to the detection result of the multi-target detection; acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand; and carrying out interactive control on the intelligent equipment according to the gesture information.

According to a second aspect of the embodiments of the present application, there is provided another human-computer interaction method, including: acquiring a video image of a space where intelligent equipment is located in real time through an image acquisition device arranged in the intelligent equipment; carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to the detection result of the multi-target detection; acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand; according to the gesture posture information, determining target content displayed on a display screen of the intelligent device corresponding to the gesture, and performing interactive control operation on the target content; and performing the interactive control operation on the target content.

According to a third aspect of embodiments of the present application, there is provided a smart device, including: the device comprises an image acquisition device, a display screen and a processor; the display screen is used for obtaining the content to be displayed from the processor and displaying the content; the image acquisition device is used for acquiring a video image of the space where the intelligent equipment is located in real time; the processor is used for carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to the detection result of the multi-target detection; acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand; according to the gesture posture information, determining target content displayed on the display screen and aimed at by the corresponding gesture, and performing interactive control operation on the target content; performing the interactive control operation on the target content; and the display screen is also used for displaying the result of the interactive control operation.

According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium having a computer program stored thereon, the program, when executed by a processor, implementing the human-computer interaction method according to the first or second aspect.

According to a fifth aspect of embodiments of the present application, there is provided a computer program product, which includes computer instructions for instructing a computing device to perform operations corresponding to the human-computer interaction method according to the first aspect or the second aspect.

According to the human-computer interaction scheme provided by the embodiment of the application, corresponding portrait and human hand tracking detection is carried out based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent equipment according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the detection is carried out based on the video image, the target portrait and the target hand are firstly determined, then the tracking detection aiming at the target hand is carried out, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection aiming at the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency aiming at the intelligent equipment is also improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an exemplary system suitable for use with embodiments of the present application;

FIG. 2 is a flowchart illustrating steps of a human-computer interaction method according to a first embodiment of the present disclosure;

FIG. 3A is a flowchart illustrating steps of a human-computer interaction method according to a second embodiment of the present application;

FIG. 3B is a diagram illustrating a gesture location map in the embodiment of FIG. 3A;

FIG. 4A is a flowchart illustrating steps of a human-computer interaction method according to a third embodiment of the present application;

FIG. 4B is a diagram illustrating a tracking detection process in the embodiment of FIG. 4A;

FIG. 4C is a diagram of a hand key point in the embodiment of FIG. 4A;

FIG. 5 is a flowchart illustrating steps of a human-computer interaction method according to a fourth embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an intelligent device according to a fifth embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

FIG. 1 illustrates an exemplary system to which embodiments of the present application may be applied. As shown in FIG. 1, the system generally includes a smart device 102, illustrated in FIG. 1 as a smart display device.

The smart device 102 is provided with an image capturing device and a display screen, and the image capturing device is exemplified as a camera provided in the smart device 102 in fig. 1, but it should be understood by those skilled in the art that, in some cases, the image capturing device may be provided independently of the smart device 102, and may be electrically connected to the smart device 102 in a wireless or wired manner.

The smart device 102 may display corresponding content through the display screen, where at least a part of the displayed content is interactive content, that is, a user may interact with the smart device 102 through an operation on the part of the content. For example, if a plurality of video programs are included in the content displayed on the display screen, the user may select a desired video program by a gesture operation. But not limited thereto, the user may also perform interactive control on the content that the smart device 102 fails to display through a gesture, for example, the user may adjust the volume of the smart device 102 through a corresponding gesture. The specific corresponding relationship between the gesture and the interactive control operation may be set by a person skilled in the art according to a requirement, and the embodiment of the present application does not limit this. In addition, the smart device 102 may also interact with the user in other forms through user gestures, for example, detecting that the user has a heart gesture, an animation of the heart pattern may be displayed on the display screen, and so on.

The gesture operation of the user may be performed by capturing a video image through an image capturing device in the smart device 102, and after the captured video image is detected and processed by a processor in the smart device 102, corresponding gesture information and information of the interactive control operation corresponding to the gesture information are obtained, and a target object to which the interactive control operation is directed (for example, some content displayed on a display screen or content that cannot be displayed but is adjustable, such as volume, brightness of the display screen, and the like) is obtained.

Based on the system, the embodiment of the application provides a human-computer interaction method, which is described in the following through a plurality of embodiments.

Example one

Referring to fig. 2, a flowchart illustrating steps of a human-computer interaction method according to a first embodiment of the present application is shown.

The man-machine interaction method of the embodiment comprises the following steps:

step S202: and carrying out multi-target detection on the video images acquired in real time.

Wherein the multi-target detection includes at least detection of multiple target types including portrait detection and human hand detection.

For an intelligent device, especially an intelligent display device, when a user needs to interact with the intelligent device, the user usually stays in a space range capable of being acquired by an image acquisition device in the intelligent device to perform corresponding interaction control operation. In some cases, only one user may interact with the smart device; in other cases, however, there may be multiple users, some of which interact with the smart device. Based on this, there may be one or more user's portraits in the video images captured by the smart device in real time, and for each portrait, part (one hand if occluded) or all of its hands may be captured, or none of its hands may be captured (both hands are occluded). The image acquisition device can be a monocular sensor, a binocular sensor, or an RGBD sensor.

Specifically, in the present embodiment, the multi-target detection means that the video image is detected by both the portrait detection and the human hand detection, and in the case of multiple portraits and/or multiple human hands, the multiple portraits and/or multiple human hands can be detected simultaneously. That is, the multi-target detection in the embodiment of the present application may detect a plurality of different target types, such as a portrait and a human hand, but at the same time, for any one of the target types, it may also perform detection on a plurality of target objects of the same type, such as a plurality of portraits and a plurality of human hands.

It should be noted that, the specific implementation of the multi-target detection can be implemented by those skilled in the art in any appropriate manner according to actual needs, including but not limited to the form of a neural network model for multi-target detection, and the like, and the embodiments of the present application are not limited thereto. In the embodiments of the present application, the numbers "plural" and "plural" relating to "plural" mean two or more unless otherwise specified.

Step S204: and determining the target portrait and the target human hand according to the detection result of the multi-target detection.

The target portrait means a portrait corresponding to a user who performs interactive control operation on the intelligent device, for example, it is detected that a user performs a "hand-waving" operation in the video image, and this operation can wake up the intelligent device to perform subsequent tracking detection and processing, so that the portrait in the video image corresponding to the user is the target portrait, and the hand that the user performs the "hand-waving" operation is the target hand in the video image.

The detection result of the multi-target detection usually includes a portrait positioning frame and a category corresponding to the portrait (for indicating whether the portrait is in a preset spatial range, such as at a door far away from the intelligent device, in front of the intelligent device, etc.), and a hand positioning frame and a category corresponding to a hand (for indicating a motion category of the hand, such as opening five fingers, clenching, waving, and OK hand, etc.), and based on these information, the target portrait and the target hand can be determined.

Step S206: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.

Because the human hand is a part of a human body, the human hand part in the video image also belongs to a part of a portrait, although the final goal of the embodiment is to perform human hand tracking detection, because of the relationship between the human hand and the human body, it is necessary to perform tracking detection on both the target portrait and the target human hand, perform tracking detection on the human hand region based on the portrait region, obtain information for the target human hand based on the tracking detection result, and perform targeted detection processing on the target human hand based on the information of the target human hand to obtain gesture posture information of the target human hand.

Step S208: and performing interactive control on the intelligent equipment according to the gesture information.

The gesture posture information can effectively represent the posture of the human hand, and includes but is not limited to specific gesture classification and human hand position information.

Based on this, in a feasible manner, the interaction position or the interaction area corresponding to the gesture can be mapped to a corresponding position or area in a display area displayed on a display screen of the smart device through a position mapping algorithm, the target content targeted by the gesture is determined, and the interaction control operation corresponding to the gesture is executed on the target content.

In another feasible manner, when interactive control operation is performed on content, such as volume, which is not displayed by the smart device through the display screen, the interactive control operation may be performed on the smart device according to the interactive control operation corresponding to the gesture.

In yet another possible approach, the smart device may perform an interactive operation with the user according to the gesture, which is independent of the device display content or the device's own function. For example, if the user makes a heart gesture, the smart device may flash a heart pattern on the display screen in response to the user gesture, and so on.

According to the embodiment, corresponding portrait and human hand tracking detection is carried out based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent device according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the detection is carried out based on the video image, the target portrait and the target hand are firstly determined, then the tracking detection aiming at the target hand is carried out, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection aiming at the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency aiming at the intelligent equipment is also improved.

Example two

Referring to fig. 3A, a flowchart illustrating steps of a human-computer interaction method according to a second embodiment of the present application is shown.

step S302: and carrying out system initialization on the intelligent equipment.

The system initialization includes, but is not limited to: the camera angle, the focal length, the resolution, the position and the like of an image acquisition device of the intelligent equipment, such as a camera, are initialized so as to carry out effective image acquisition. For example, the camera angle is 180 degrees, the focal length is about 3.67mm, the resolution is 1080P or more, the camera position is about 1-3 meters in front of the interaction location or interaction zone the user may be at, and so on.

Through system initialization, the video stream of the current image acquisition device under the current setting can be ensured to be acquired through intelligent equipment and used as the input of subsequent data processing.

Step S304: and acquiring a video image in real time through an image acquisition device of the intelligent equipment.

Step S306: and carrying out multi-target detection on the video images acquired in real time.

In one possible approach, multi-target detection based on RGB temporal sequences may be employed. Different from the traditional method of carrying out target detection by identifying key points of human bodies and key points of human hands, the target detection is carried out based on RGB video images in the embodiment of the application. Compared with key point detection, the RGB information of the video image contains rich texture and other information, and semantic feature extraction can be performed, so that the information obtained by processing the current video image can be effectively transmitted to the subsequent video image tracking detection process, and the tracking detection efficiency is improved.

And because the video images in the video stream are images with a time sequence relation, the multi-target detection based on the RGB time sequence can be realized by carrying out the multi-target detection based on the continuous RGB video images.

Step S308: and determining a target portrait and a target hand according to the detection result of the multi-target detection, and tracking and detecting the target portrait and the target hand.

As described above, the target portrait means a portrait corresponding to a user performing interactive control operation on the smart device, and the target human hand means a hand of the user performing a preset smart device wake-up operation. In tracking detection, the detection of the target human hand is carried out based on the target human figure area to which the target human hand belongs, so that the data processing burden of tracking detection is reduced, the tracking detection efficiency is improved, and the requirement on the hardware performance of intelligent equipment is lowered.

Based on this, in one possible way, the present step can be implemented as: acquiring portrait information and hand information from a detection result of multi-target detection; judging whether a hand which is subjected to a preset intelligent equipment awakening operation gesture exists or not according to the hand information; and if so, determining the hand which is subjected to the intelligent equipment awakening operation gesture as a target hand, and determining the portrait corresponding to the target hand as a target portrait. In the embodiment of the application, the awakening operation gesture is set for the intelligent device, and if a user executes the gesture, the user considers that the user needs to awaken the intelligent device so as to interact with the intelligent device. By setting the awakening operation gesture, the target user and the hand of the target user for awakening gesture operation can be efficiently and quickly determined and reflected in the video image, namely the target portrait and the target hand corresponding to the target user. Therefore, tracking and detecting of other portrait or human hands are not needed in the follow-up process, the tracking and detecting efficiency is greatly improved, and the human-computer interaction efficiency is further improved. In addition, in order to improve the interchangeability of human-computer interaction, after the wakeup operation gesture is detected, corresponding prompt information can be provided through a display screen of the intelligent device, and the prompt information can be information needing interactive confirmation, such as "do you start interacting with me? "etc., and the subsequent processing is performed after the user confirms; the reminder may also be a message that does not require confirmation of interaction, such as "thank you to interact with me, let us start a bar! "and the like.

Wherein the portrait information includes, but is not limited to: the portrait positioning frame and the information of the corresponding type of the portrait; human hand information includes, but is not limited to: the information of the type corresponding to the human hand positioning frame and the human hand.

In addition, in order to further improve the data processing speed and reduce the processing time delay, optionally, corresponding portrait identifiers and corresponding hand identifiers can be respectively set for the portrait corresponding to the portrait information and the hand corresponding to the hand information; and tracking and detecting the target portrait and the target human hand according to the portrait identifier and the human hand identifier. And, can show portrait mark and/or people's hand mark through smart machine, wherein, portrait mark includes at least one of following: icon identifications corresponding to the portrait (such as head portraits, icons, LOGO and the like set by users), ID identifications corresponding to the portrait, name identifications corresponding to the portrait (such as user names or nicknames) and role identifications corresponding to the portrait (such as roles of the users at home, such as dad, mom, baby and the like); the human hand mark comprises at least one of the following: icon identification corresponding to the human hand (such as an icon and a LOGO which are set by a user), ID identification corresponding to the human hand, and name identification corresponding to the human hand (such as a left hand and a right hand). Of course, other implementations of portrait identifiers and human hand identifiers are equally applicable to the embodiments of the present application.

When tracking detection is carried out on a target portrait and a target hand based on the portrait identifier and the hand identifier, multi-target tracking detection can be carried out on a video image acquired in real time according to the portrait identifier and the hand identifier, wherein the multi-target tracking detection comprises target portrait tracking detection and target hand tracking detection; and carrying out single-target tracking detection aiming at the target human hand based on the detection result of the multi-target tracking detection. By the mode, the multi-target tracking detection is transited to the single-target tracking detection aiming at human hands, so that the data volume required to be processed by the tracking detection is greatly reduced, the requirement on the hardware performance of intelligent equipment is further reduced, the tracking detection speed and efficiency are improved, and the time delay of man-machine interaction is reduced. Optionally, a detection mode based on an RGB time sequence may be adopted to obtain richer information and improve detection accuracy.

For example, for a frame of video image a, through multi-target tracking detection, a portrait positioning frame X therein and a human hand positioning frame X ' in the portrait positioning frame are determined, and then information of an image area of the human hand positioning frame X ' may be given to a subsequent neural network model to perform tracking detection for a human hand in the human hand positioning frame X '.

And the tracking detection aiming at the target portrait can be realized as follows: determining a portrait area in the video image according to the portrait identifier; and tracking and detecting the target portrait and the target human hand based on the portrait area, the portrait identifier and the human hand identifier. Because the tracking detection of the target portrait is still in the stage of multi-target tracking detection, the tracking detection of the corresponding portrait and the human hand of the portrait needs to be performed based on the corresponding identification, so as to provide an effective and accurate basis for the subsequent single-target tracking detection of the human hand.

In addition, in order to improve the interactivity of human-computer interaction, in a feasible mode, if the detection result of the multi-target detection indicates that a plurality of portraits or a plurality of human hands exist in the video image, an information popup window is displayed through a display screen of the intelligent device so as to display the option information of the portraits and the human hands in the information popup window; and determining the target portrait and the target human hand according to the selection operation of the option information. In this case, there may be multiple users in front of the smart device, and at least two users of the multiple users may perform the same wake-up gesture operation as if waving hands at the same time. Then, in order to improve the efficiency of subsequent tracking detection and interaction, portrait information of a plurality of users who have performed the wakeup gesture operation at the same time can be displayed through the display screen for selection by the users, a portrait corresponding to the portrait information selected by the users is taken as a target portrait, and a human hand corresponding to the target portrait and having performed the wakeup gesture operation is taken as a target human hand.

It should be noted that, in the embodiment of the present application, no matter multi-target detection, or subsequent multi-target tracking detection and single-target tracking detection, the neural network models with corresponding functions after training may be adopted for implementation, and the embodiment of the present application does not limit the specific training process and the specific implementation structure of the neural network models, and only needs to implement the corresponding functions.

Step S310: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.

Based on the tracking detection aiming at the target portrait and the target hand, a corresponding tracking detection result can be obtained. In this embodiment of the present application, the tracking detection result at least includes: a tracking frame of the target portrait, a type of the target portrait, a tracking frame of the target hand, and a type of the target hand. To distinguish from the aforementioned non-tracked multi-target detection, a "tracking frame" is used herein to distinguish from the aforementioned "positioning frame" obtained by target detection.

As described in step S308, the tracking detection of the target portrait and the target human hand may transition from multi-target tracking detection to single-target tracking detection for the human hand, based on which a human hand region corresponding to the target human hand may be obtained, and based on the image of the human hand region, gesture posture detection is performed, and corresponding gesture posture information may be obtained. In addition, in the process of tracking and detecting the target human hand, the motion track of the target human hand can be displayed in real time through a display screen of the intelligent device, so that a user can know the mapping condition of the gesture of the user in the intelligent device more clearly.

Step S312: and performing interactive control on the intelligent equipment according to the gesture information.

In a feasible manner, if the interactive control operation corresponding to the gesture posture information is directed to non-display content of the intelligent device, such as volume, and the like, the corresponding interactive control operation can be determined according to the gesture corresponding to the gesture posture information, and then the intelligent device is interactively controlled, such as volume is increased or decreased.

In another feasible manner, if the interactive control operation corresponding to the gesture posture information is an operation for the content displayed on the display screen of the intelligent device, the operation of the human hand in the real physical space needs to be mapped to the content displayed on the display screen finally. For example, the corresponding interactive control operation is determined according to the gesture, the position of the human hand mapped on the display area of the display screen is further determined according to the position of the human hand, and the target object aimed by the gesture and the operation performed on the target object are determined. In order to improve the presentability of the operation, corresponding indication icons such as indication arrows and the like can also be displayed in the display area, so that the user can clearly know the specific position and operation information of the gesture operation.

However, in order to further improve the user experience, in a feasible manner, three-dimensional gesture reconstruction and hand position mapping may be performed according to gesture posture information, so as to map the reconstructed three-dimensional gesture to a position corresponding to the hand position on the display screen of the smart device.

In one specific example, taking the gesture operation as "waving" as an example, as shown in fig. 3B, multiple "waving" positions may be obtained based on multiple frames of video images containing the "waving" gesture through tracking detection. Further, an initial gesture box (set as xmin, ymin, xmax, ymax) can be obtained by calculating the average "hand waving" position. The width w and the height h of the initial gesture box can be obtained through simple calculation. According to w and h of the initial gesture frame, the size of the corresponding gesture control frame is determined to be 2w x 2h, and meanwhile the central point of the initial gesture frame is used as the center for expansion. In this way, in the video image, the actual effective interaction area of the gesture operation can be obtained. Further, assuming that the display areas of the display screen are w _ screen and h _ screen, the equal-scale mapping is performed according to the size of the gesture control box area, and the conversion from the operation range of the gesture operation to the screen coordinate can be completed.

In another feasible mode, displaying an interactive operation option corresponding to the gesture information on a display screen of the intelligent device; and receiving selection operation of the interactive operation options, and performing interactive control on the intelligent equipment according to the interactive operation options selected by the selection operation. By displaying the interactive operation options, the user can more flexibly determine the required interactive operation, and the interactivity of the human-computer interaction is improved. The interactive operation options can be realized by those skilled in the art in any appropriate manner according to actual requirements, such as interactive buttons, interactive questions, and the like, and are displayed in a manner of a small pop-up window or a floating layer.

In another possible way, interactive response animations or texts responding to the gesture posture information can be displayed on the display screen of the intelligent device according to the gesture posture information. For example, if a user is detected to make a heart gesture, an animation of the heart pattern may be displayed on the display screen, and so on.

EXAMPLE III

Referring to fig. 4A, a flowchart illustrating steps of a human-computer interaction method according to a third embodiment of the present application is shown.

In this embodiment, a human-computer interaction method according to this embodiment is described by taking an example of implementing human-computer interaction by combining a plurality of neural network models.

step S402: and carrying out multi-target detection on the video images acquired in real time.

In this embodiment, the multi-target detection of the video image may be implemented by a neural network model with multi-target detection, such as a convolutional neural network model, and optionally, the multi-target detection may be implemented by using a lightweight convolutional neural network model. For convenience of description, the neural network model is referred to as a first neural network model in the present embodiment.

It should be noted that, in this embodiment, if the acquired video image includes a plurality of human figures, that is, there may be a plurality of users in the acquisition space range of the image acquisition device of the intelligent device, this step may be implemented as: carrying out multi-target detection on a video image acquired in real time to obtain a plurality of detection frames corresponding to a plurality of candidate objects; combining the detection frames with overlapped detection frames or the distance between the detection frames within a preset distance range in the plurality of detection frames; and carrying out multi-target detection again based on the combined detection frame. The preset distance range can be set by a person skilled in the art according to actual conditions, and the embodiment of the present application does not limit this. By the method, the efficiency and the accuracy of identification can be effectively ensured. But not limited thereto, in practical applications, the same applies in a conventional manner for detecting each portrait.

Step S404: and determining the target portrait and the target human hand according to the detection result of the multi-target detection.

Through multi-target detection of the first neural network model, one or more portrait positioning frames and the corresponding categories of the portraits and one or more hand positioning frames and the corresponding categories of the hands can be output. The action corresponding to the hand can be obtained through the category corresponding to the hand, and therefore whether the hand performs the awakening gesture operation aiming at the intelligent equipment is judged. And if the fact that one hand carries out the awakening gesture operation is determined, determining the hand as the target hand, and taking the corresponding portrait as the target portrait.

Step S406: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.

In the embodiment, multi-target tracking detection aiming at a target portrait and a target hand is carried out on a video image collected in real time through a multi-target tracking network model; obtaining a detection result of the target human hand from detection results of multi-target tracking detection; based on the detection result of the target hand, carrying out single-target tracking detection aiming at the target hand on the video image acquired in real time through a single-target tracking network model; and determining a hand area in the video image according to the detection result, performing gesture posture detection based on key points of the hand on the hand area, and obtaining gesture posture information of the target hand according to the gesture posture detection result.

The multi-target tracking network model can adopt the same lightweight network model structure as the first neural network model, or can adopt the same neural network model structure as the first neural network model. Under the condition, the neural network model has a tracking detection function, and only the multi-target detection function is used when the neural network model is used for carrying out multi-target detection on a video image acquired in real time.

After the single-target tracking network model is connected to the multi-target tracking network model, the single-target tracking detection aiming at the human hand is carried out by using the detection result of the human hand part output by the multi-target tracking network model, such as the position of the target human hand in the video image. The single target tracking detection result comprises a more accurate human hand area of the human hand in the video image. Further, gesture posture detection based on the human hand area can be carried out through a second neural network model for carrying out gesture posture detection, so that gesture posture information of the human hand can be obtained.

In one possible approach, the single target tracking network model may be implemented in the form of a twin network based single target tracking network model. Because the multi-target tracking network model can adopt a lightweight network model structure, the tracking detection with low computational power can be realized; and based on the hand identification of the hand obtained by tracking detection, a twin network-based single-target tracking network model is used, so that more accurate hand tracking detection can be realized, the hardware performance requirement on intelligent equipment is reduced, and the interaction time delay is reduced.

The second neural network model may also be a lightweight convolutional neural network model that may be used for gesture detection. In a possible way, the second neural network model can be implemented as a multi-objective regression network model, and performs gesture 3D keypoint regression and gesture classification on the detected human hand with 21 keypoints at the same time. Through the form, the multi-target regression network model can be trained in a training stage through the regression of key points and the gesture classification in a mutual cooperation mode, and the training of the two aspects has the functions of mutual enhancement and mutual promotion in a multi-task mode.

In addition, optionally, the detected gesture posture information includes position information of the human hand in the video image; then, the performing gesture posture detection based on the human hand key points on the human hand region may include: acquiring gesture posture information of a human hand in the previous N frames of video images adjacent to the current video image; presume the gesture attitude information of the human hand in the present video image according to the gesture attitude information of the human hand in the previous N frames of video images; and performing gesture posture detection based on a human hand key point on a human hand region in the current video image by taking the presumed gesture posture information as auxiliary information, wherein N is a positive integer. When the gesture posture is detected through the second neural network model, the result output by the second neural network model, namely the gesture posture information of the human hand in the video image can be used as a reference, the gesture posture information of the human hand in the current video image is estimated based on the gesture posture information of the human hand in the first N frames of video images of the current video image, and the intermediate detection result (namely the gesture posture in the current video image) detected by the model is corrected by taking the estimated gesture posture information as auxiliary information, so that the position of the gesture can be more stable, and the continuity of the position of the gesture is ensured.

Schematically, a schematic of one of the above tracking detection processes is shown in fig. 4B, in which the DetNet part is responsible for human hand tracking detection; according to a tracking box bounding box of a hand output by the DetNet, scratching a hand region from an original video image as an input image of the KeyNet; KeyNet is responsible for detecting gesture gestures from an input image, i.e. an image of a human hand region, including in this example 2D handoff (i.e. position coordinate information) and 1D Heatmap (i.e. depth information) of each key point of a human hand (a schematic representation of a hand key point is shown in fig. 4C), which will be post-processed to obtain 3D human hand key points keypoints; parameterizing the keypoint (parameters required for three-dimensional reconstruction of a hand are set by a person skilled in the art according to actual requirements), and finally reconstructing the gesture in the video image. Meanwhile, as can be seen from the figure, when the image at the time t +1 is detected, the gesture attitude information at the time t +1 is presumed from the gesture attitude information of the human hand at the time t and the gesture attitude information of the human hand at the time t-1; and the gesture posture information is used as an auxiliary information input model and input as auxiliary information of KeyNet, and the gesture posture information detected and obtained by the model pair at the moment of t +1 is corrected.

Step S408: and performing interactive control on the intelligent equipment according to the gesture information.

The specific implementation of this step can be seen from the description of the relevant parts in the foregoing embodiment one or two, and is not described herein again.

Step S410: and displaying the result of the interactive control.

For example, a progress indication of volume adjustment is displayed on a display screen of the smart device according to the interactive control, or new content is displayed on the display screen according to the interactive control, or a video program is played according to the interactive control, and so on.

Example four

Referring to fig. 5, a flowchart illustrating steps of a man-machine interaction method according to a fourth embodiment of the present application is shown.

In this embodiment, a man-machine interaction method according to the present application is described by taking an intelligent device as an intelligent display device, such as an intelligent television, an intelligent large screen, an intelligent screen with a common size, or an intelligent small screen as an example.

step S502: the method comprises the steps of collecting video images of a space where the intelligent equipment is located in real time through an image collecting device arranged in the intelligent equipment.

Wherein, image acquisition device can be the camera, carries out the real-time collection of video image through this camera to being located the space of smart machine place.

Step S504: and carrying out multi-target detection on the video images acquired in real time.

Step S506: and determining the target portrait and the target human hand according to the detection result of the multi-target detection.

The detection result of the multi-target detection includes corresponding portrait information and human hand information, for example, portrait positioning frame and portrait category information, and human hand positioning frame and human hand category information. Whether the human hand carries out awakening gesture operation aiming at the intelligent equipment or not can be determined based on the class information of the human hand, if yes, the human hand is determined as a target human hand, and a portrait corresponding to the target human hand is determined as a target portrait.

Step S508: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.

After the target portrait and the target hand are determined, the video image collected in real time can be tracked and detected. The specific process can comprise multi-target tracking detection aiming at the target portrait and the target human hand, further determining related information of the target human hand based on the detection result, such as information of an Identification (ID) and/or a tracking frame, and transitioning from the multi-target tracking detection to single-target tracking detection aiming at the human hand; and then determining a hand region based on the single-target tracking detection result, and performing gesture posture recognition based on the hand region to obtain gesture posture information.

Step S510: and determining target content displayed on a display screen of the intelligent device corresponding to the gesture according to the gesture posture information, and performing interactive control operation aiming at the target content.

In this embodiment, the gesture posture of the user is set for operating the content displayed on the display screen of the smart display device, and therefore, it is necessary to determine, based on the gesture posture information, a position or a region in which the position information is mapped onto the display region of the display screen, and determine an interactive control operation (such as clicking a certain video program, replacing a display page, and the like) corresponding to the gesture posture. Therefore, the target content and the interactive control operation aimed by the gesture are determined.

Step S512: and performing the interactive control operation on the target content.

According to the embodiment, the intelligent device carries out corresponding portrait and human hand tracking detection based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent device according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the intelligent device detects based on the video image, the target portrait and the target hand are firstly determined, then tracking detection is carried out on the target hand, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection on the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency on the intelligent device is further improved.

In addition, it should be noted that, in the present embodiment, implementation of some steps is similar to that in the foregoing embodiments, and therefore description is brief, and reference may be made to the description of relevant portions in the foregoing embodiments for corresponding specific implementation.

EXAMPLE five

Referring to fig. 6, a schematic structural diagram of an intelligent device according to a fifth embodiment of the present application is shown.

The smart device of this embodiment includes: an image capture device 602, a display screen 604, and a processor 606.

Wherein:

and a display screen 604 for obtaining and displaying the content to be displayed from the processor 606.

The image acquisition device 602 is configured to acquire a video image of a space where the smart device is located in real time.

The processor 606 is configured to implement the human-computer interaction method described in any of the foregoing embodiments. For example, processor 606 performs multi-target detection on the video images captured in real-time, the multi-target detection including at least detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to a detection result of the multi-target detection; acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand; according to the gesture posture information, determining target content displayed on the display screen 604 corresponding to the corresponding gesture, and interactive control operation aiming at the target content; and performing the interactive control operation on the target content.

And the display screen 604 is further used for displaying the result of the interactive control operation.

The intelligent device of this embodiment is used to implement the corresponding human-computer interaction method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the description of the corresponding parts in the foregoing method embodiments can be referred to for the functional implementation of each module in the intelligent device of this embodiment, and is not repeated here.

The embodiment of the present application further provides a computer program product, which includes a computer instruction, where the computer instruction instructs an intelligent device to execute an operation corresponding to any one of the human-computer interaction methods in the multiple method embodiments.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the human-machine interaction methods described herein. Further, when a general-purpose computer accesses code for implementing the human-computer interaction method illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the human-computer interaction method illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A human-computer interaction method, comprising:

carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection;

determining a target portrait and a target hand according to the detection result of the multi-target detection;

acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand;

and carrying out interactive control on the intelligent equipment according to the gesture information.

2. The method of claim 1, wherein the determining a target portrait and a target human hand from the detection results of the multiple target detection comprises:

acquiring portrait information and hand information from the detection result of the multi-target detection;

judging whether a human hand which is subjected to a preset intelligent equipment awakening operation gesture exists or not according to the human hand information;

and if so, determining the hand which is subjected to the intelligent equipment awakening operation gesture as a target hand, and determining a portrait corresponding to the target hand as a target portrait.

3. The method of claim 2, wherein the method further comprises:

respectively setting corresponding portrait marks and corresponding hand marks for the portrait corresponding to the portrait information and the hand corresponding to the hand information;

and tracking and detecting the target portrait and the target human hand according to the portrait identifier and the human hand identifier.

4. The method of claim 3, wherein the method further comprises:

displaying the portrait identifier and/or the human hand identifier through the intelligent equipment, wherein the portrait identifier comprises at least one of the following: icon identification corresponding to the portrait, ID identification corresponding to the portrait, name identification corresponding to the portrait and role identification corresponding to the portrait; the hand mark comprises at least one of the following: icon identification corresponding to the human hand, ID identification corresponding to the human hand and name identification corresponding to the human hand.

5. The method according to claim 3 or 4, wherein the tracking detection of the target portrait and the target human hand according to the portrait identifier and the human hand identifier comprises:

according to the portrait mark and the hand mark, multi-target tracking detection is carried out on the video image collected in real time, wherein the multi-target tracking detection comprises target portrait tracking detection and target hand tracking detection;

and carrying out single-target tracking detection aiming at the target human hand based on the detection result of the multi-target tracking detection.

6. The method according to claim 3 or 4, wherein the tracking detection of the target portrait and the target human hand according to the portrait identifier and the human hand identifier comprises:

determining a portrait area in the video image according to the portrait identifier;

and tracking and detecting the target portrait and the target human hand based on the portrait area, the portrait mark and the human hand mark.

7. The method of claim 1, wherein the interactively controlling the smart device according to the gesture pose information comprises:

according to the gesture posture information, three-dimensional gesture reconstruction and hand position mapping are carried out, so that the reconstructed three-dimensional gesture is mapped to a position, corresponding to the hand position, on a display screen of the intelligent device;

alternatively, the first and second electrodes may be,

displaying interactive operation options corresponding to the gesture posture information on a display screen of the intelligent equipment; receiving selection operation of the interactive operation options, and performing interactive control on the intelligent equipment according to the interactive operation options selected by the selection operation;

alternatively, the first and second electrodes may be,

and displaying interactive response animation or characters responding to the gesture posture information on a display screen of the intelligent equipment according to the gesture posture information.

8. The method of claim 1, wherein the obtaining gesture pose information of the target human hand according to the tracking detection results of the target human figure and the target human hand comprises:

performing multi-target tracking detection aiming at the target portrait and the target human hand on a video image acquired in real time through a multi-target tracking network model;

obtaining a detection result of the target human hand from the detection result of the multi-target tracking detection;

based on the detection result of the target human hand, carrying out single-target tracking detection aiming at the target human hand on a video image acquired in real time through a single-target tracking network model;

determining a hand area in the video image according to the detection result, performing gesture posture detection based on key points of the hand area, and obtaining gesture posture information of the target hand according to the gesture posture detection result.

9. The method according to claim 8, wherein the gesture posture information comprises position information of a human hand in the video image;

the gesture posture detection based on the key points of the human hand is carried out on the human hand area, and the gesture posture detection comprises the following steps: acquiring gesture posture information of a human hand in the previous N frames of video images adjacent to the current video image; presume the gesture attitude information of the human hand in the present video image according to the gesture attitude information of the human hand in the video image of the said previous N frames; and performing gesture posture detection based on a human hand key point on a human hand region in the current video image by taking the presumed gesture posture information as auxiliary information, wherein N is a positive integer.

10. The method of claim 1, wherein the determining a target portrait and a target human hand from the detection results of the multiple target detection comprises:

if the detection result of the multi-target detection indicates that a plurality of portraits or a plurality of hands exist in the video image, displaying an information popup window through a display screen of the intelligent equipment so as to display option information of the portraits and the hands in the information popup window;

and determining the target portrait and the target human hand according to the selection operation of the option information.

11. The method of claim 1, wherein the method further comprises:

and in the tracking detection process of the target human hand, displaying the motion track of the target human hand in real time through a display screen of the intelligent equipment.

12. A human-computer interaction method, comprising:

acquiring a video image of a space where intelligent equipment is located in real time through an image acquisition device arranged in the intelligent equipment;

according to the gesture posture information, determining target content displayed on a display screen of the intelligent device corresponding to the gesture, and performing interactive control operation on the target content;

and performing the interactive control operation on the target content.

13. A smart device, comprising: the device comprises an image acquisition device, a display screen and a processor;

wherein the content of the first and second substances,

the display screen is used for obtaining the content to be displayed from the processor and displaying the content;

the image acquisition device is used for acquiring a video image of the space where the intelligent equipment is located in real time;

the processor is used for carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to the detection result of the multi-target detection; acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand; according to the gesture posture information, determining target content displayed on the display screen and aimed at by the corresponding gesture, and performing interactive control operation on the target content; performing the interactive control operation on the target content;

and the display screen is also used for displaying the result of the interactive control operation.

14. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a human-computer interaction method as claimed in any one of claims 1 to 12.