CN106547356B

CN106547356B - Intelligent interaction method and device

Info

Publication number: CN106547356B
Application number: CN201611025898.3A
Authority: CN
Inventors: 王天一; 刘聪; 王智国; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-11-17
Filing date: 2016-11-17
Publication date: 2020-09-11
Anticipated expiration: 2036-11-17
Also published as: CN106547356A

Abstract

The application provides an intelligent interaction method and an intelligent interaction device, wherein the intelligent interaction method comprises the following steps: acquiring a hand action image of a user, wherein the hand action image of the user is obtained by shooting the hand action of the user; determining an operation type corresponding to the hand action of the user according to the hand action image of the user; and responding to the hand motion of the user according to the operation type. The method can realize natural and efficient interaction of human and machine.

Description

Intelligent interaction method and device

Technical Field

The application relates to the technical field of human-computer interaction, in particular to an intelligent interaction method and device.

Background

With the increasing maturity of the related technologies of artificial intelligence, people's lives are becoming intelligent, for example, various smart homes have gradually entered into ordinary families, for example, various augmented reality devices have begun to enter into practicality, so that the interaction between people and machines is becoming ordinary and necessary, and during man-machine interaction, most concerned by users is whether people can naturally interact with machines, even the degree of interaction between people can be achieved; therefore, more and more technicians are beginning to study how to implement the process of human-computer interaction naturally and efficiently.

In the related art, when a user uses a hand to perform intelligent interaction with a machine, firstly, recording equipment such as a handwriting pen, a handwriting finger stall and the like needs to be used on the hand; then acquiring two-dimensional or three-dimensional coordinate data of the hand action of the user according to the recording equipment; and then, the motion of the hand of the user or the motion track of the hand is identified according to the collected hand data so as to determine the operation of the user, and a response result of the corresponding operation is given by the system.

However, the intelligent interaction mode does not conform to natural interaction habits, and inaccurate acquired data is easily caused, so that the interaction effect is not ideal.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present application is to provide an intelligent interaction method, which can achieve natural and efficient human-computer interaction.

Another object of the present application is to provide an intelligent interaction device.

In order to achieve the above object, an intelligent interaction method provided in an embodiment of a first aspect of the present application includes: acquiring a hand action image of a user, wherein the hand action image of the user is obtained by shooting the hand action of the user; determining an operation type corresponding to the hand action of the user according to the hand action image of the user; and responding to the hand motion of the user according to the operation type.

According to the intelligent interaction method provided by the embodiment of the first aspect of the application, the operation type is determined according to the hand action image of the user, corresponding response is carried out according to the operation type, special equipment does not need to be worn on the hand of the user, so that the natural interaction habit is met, the accuracy of collected data can be improved through processing the image, and natural and efficient interaction of a human and a machine is achieved.

In order to achieve the above object, an embodiment of a second aspect of the present application provides an intelligent interaction device, including: the acquisition module is used for acquiring a hand action image of a user, wherein the hand action image of the user is obtained by shooting the hand action of the user; the determining module is used for determining an operation category corresponding to the hand action of the user according to the hand action image of the user; and the response module is used for responding to the hand action of the user according to the operation type.

The intelligent interaction device provided by the embodiment of the second aspect of the application determines the operation type according to the hand action image of the user and performs corresponding response according to the operation type, so that special equipment does not need to be worn on the hand of the user, natural interaction habits are met, the accuracy of collected data can be improved through processing the image, and natural and efficient interaction between a human and a machine is realized.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram illustrating an intelligent interaction method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating an intelligent interaction method according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a network topology of a user hand motion recognition model in an embodiment of the present application;

FIG. 4 is a schematic diagram of a set of images of a user's hand movements in an embodiment of the present application;

FIG. 5 is a schematic diagram showing an image of a skin region in an embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of the present application showing an image of a hand region;

FIG. 7 is a schematic diagram illustrating a hand motion of a user during a single-click operation in the embodiment of the present application;

FIG. 8 is a diagram illustrating a corresponding user hand motion when a text operation is selected in an embodiment of the present application;

FIG. 9 is a schematic diagram of a user's hand movements corresponding to various operations in the embodiment of the present application;

FIG. 10 is a schematic structural diagram of an intelligent interaction device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an intelligent interaction device according to another embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar modules or modules having the same or similar functionality throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a schematic flowchart of an intelligent interaction method according to an embodiment of the present application.

As shown in fig. 1, the method of the present embodiment includes:

s11: acquiring a hand motion image of a user, wherein the hand motion image of the user is obtained by shooting the hand motion of the user.

The hand motion of the user refers to a hand motion formed by combining a track of hand movement of the user, finger activity and the like, the hand motion is generally used for operating screen or aerial display content, the display content may be text, images, application programs and the like displayed on the screen or in the air, and the specific scheme is not limited.

The hand action can be a single-hand action of the user, or a double-hand or multi-hand action of the user, and when the multi-hand action is received, a plurality of users participate in interaction; the hand motions are like fist making, palm opening, forefinger stretching and the like.

It should be noted that the hand motions of the user can be implemented by one hand or both hands or even multiple hands, the same operation can be implemented by one or more hand motions, and the specific hand motions can also be determined in various ways according to application requirements, and are not limited to the hand motions described in the present application.

For example, a camera or a video camera is arranged on the smart device, the camera or the video camera captures the hand motion of the user to obtain a hand motion image of the user, and the processing system may acquire the hand motion image of the user from the camera or the video camera. When shooting the hand motion of a user, one or more frames of images can be obtained by shooting, for example, after continuously shooting the hand of the user by using a camera or a video camera, a plurality of frames of images of the hand motion of the user are obtained, specifically, when shooting, a video camera with an RGBD sensor is generally used for shooting, and RGBD data of the hand motion of the user, namely, color data RGB and depth data D, are directly obtained, so that an RGB image and a depth map of the hand motion of the user can be directly obtained. The RGB color scheme is a color standard in the industry, and various colors are obtained by changing three color channels of red (R), green (G) and blue (B) and superimposing the three color channels on each other, where RGB represents colors of the three channels of red, green and blue, and the color standard almost includes all colors that can be perceived by human vision, and is one of the most widely used color systems at present. The distance of points in the scene relative to the camera can be represented by a depth map (depth map), i.e. each pixel value in the depth map represents the distance between a point in the scene and the camera.

It should be noted that the user can perform the hand movement with bare hands, that is, without wearing a special recording device on the hand. And when the user operates the screen or the content displayed in the air, the operation may specifically refer to a non-contact operation, that is, the operation of the displayed content may be completed without the user contacting the screen.

It is understood that the embodiment of the present application takes a manual operation as an example, however, operations performed by other body parts of the user, such as a head motion, an arm motion, etc., may also be performed in accordance with the hand motion, and thus all belong to equivalent implementations of the embodiment of the present application.

S12: and determining an operation type corresponding to the hand action of the user according to the hand action image of the user.

The operation category corresponding to the user hand motion refers to a category in which the user operates the display content on the screen or in the air, such as moving a cursor, grabbing the content, dragging the content, releasing the content, handwriting, clicking and the like.

When the operation type is determined, for example, the hand motion type and the key point position of the hand of the user are identified according to the image, and then the operation type is determined according to the identified hand motion type and the key point position of the user. For details, reference may be made to the following description.

S13: and responding to the hand motion of the user according to the operation type.

The system can respond correspondingly according to the preset response mode of each operation type.

If the current operation type is the handwriting operation, the system switches to the handwriting mode after determining the operation type of the user so as to receive the handwriting content of the user to perform corresponding handwriting recognition and display a recognition result;

and if the current operation type is clicking, after the system determines the operation type of the user, giving a system response result according to the gesture, such as clicking a screen or an application program displayed in the air by the user, and executing corresponding operation.

It can be understood that before responding to the user hand motion according to the operation category, it can be further determined whether a preset condition is met, so that when the preset condition is met, the user hand motion is responded according to the operation category; and not responding when the preset condition is not met. The preset conditions include, for example: the function of responding to the hand action of the user is started currently, and the operation category belongs to a category supporting response by the user.

In the embodiment, the operation type is determined according to the hand action image of the user, and corresponding response is carried out according to the operation type, so that special equipment does not need to be worn on the hand of the user, natural interaction habits are met, the accuracy of collected data can be improved through processing the image, and natural and efficient interaction of a human and a machine is realized.

Fig. 2 is a flowchart illustrating an intelligent interaction method according to another embodiment of the present application.

As shown in fig. 2, the method of the present embodiment includes:

s21: and constructing a hand motion recognition model of the user.

The method specifically comprises the following steps:

(1) training data is acquired.

Each set of training data includes input data and output data, and in this embodiment, the input data includes: the RGB image of the user hand region and the corresponding depth image of the user hand region, the output data comprising: the marked user hand action categories and key point positions are generally obtained according to the field expert marks.

Specifically, a large number of images of the hand movements of the user may be collected first, and the collection method may be, for example, shooting the hand movements of the user by using a camera having an RGBD sensor, so that a large number of RGB images and depth images corresponding to each other may be obtained. And respectively segmenting the RGB image and the depth image in each group of user hand motion images to obtain the RGB image of the user hand region and the depth image of the user hand region. The specific division can be seen in the following description. And marking the hand action categories and the key point positions of the users corresponding to each group of the hand action images of the users.

The hand motion category of the user is determined according to the movement of the hand area of the user or the activity of fingers, and for example, the hand motion category comprises: fist making, palm opening, or forefinger stretching, etc.

It is understood that the hand motion category may be preset according to application requirements, and is not limited to the above example.

The position of the key point can be selected according to application requirements and the like, for example, the position of the center point of the fist or the position of the index finger is used as the position of the key point.

(2) And determining the structure of the hand motion recognition model of the user.

The model structure can be set according to requirements, and the deep neural network structure is taken as an example in the embodiment.

Fig. 3 shows a schematic diagram of a network topology of a user hand motion recognition model. As shown in fig. 3, the model includes: the device comprises an input layer, a characteristic conversion layer, a full connection layer and an output layer.

The input layer respectively inputs the RGB image and the corresponding depth image of the hand area of the user; the method comprises the steps that a characteristic transformation layer respectively carries out characteristic transformation on an input RGB image and an input depth image to respectively obtain image characteristics and depth characteristics after hand regions of a user are transformed, the characteristic transformation layer is generally of a convolutional neural network structure, and the specific transformation method is the same as that of each layer of the convolutional neural network; and then the image characteristics and the depth characteristics after the hand area transformation of the user are respectively input into an output layer after the transformation of a full connection layer, and the probability that the current hand action image of the user belongs to each hand action category and the key point position of the current hand action of the user are output by the output layer.

(3) Training is carried out based on the training data and the structure, and a user hand action recognition model is constructed.

For example, input data in training data is used as model input, after calculation with parameters of each layer of the model, model output is obtained and used as the probability that a current user hand motion image belongs to each hand motion type and the key point position of the current user hand motion, the user hand motion type with the highest probability is used as the predicted user hand motion type, the predicted user hand motion type and the key point position of the user hand motion are used as predicted values, then the output data in the training data is used as real values, a loss function is obtained according to the real values and the predicted values, parameters of each layer of the model can be obtained through a minimized loss function, and the model is obtained through training. Specific model training methods can be found in various existing or future technologies, and are not described in detail here.

S22: and acquiring a hand motion image of the user.

For example, the camera or the video camera may capture the hand motion of the user after the hand motion of the user is generated, and obtain the hand motion image of the user.

When shooting the hand action of the user, a camera with an RGBD sensor can be used for continuous shooting, so that multiple groups of continuous user hand action images are obtained, and each group of user hand action image comprises an RGB image and a depth image.

Fig. 4 shows a set of images of the hand movements of the user, including a frame of RGB image and a frame of depth image, which correspond to each other. In fig. 4, the left side is an RGB image, and the right side is a depth image. The RGB image in fig. 4 is a gray scale image as an example, but in actual implementation, the RGB image is a color image.

S23: and determining a user hand area in the user hand action image, and segmenting the user hand action image according to the user hand area to obtain a user hand area image.

When determining the hand area of the user, the method may include:

determining a skin area in the hand action image of the user according to the RGB image;

clustering the pixel points in the skin area to obtain different skin areas;

and according to the depth image, obtaining depth values corresponding to different skin areas, and according to the depth values, determining a user hand area in the user hand action image.

When the skin area is determined, the RGB image can be converted into a CrCb image, and the conversion mode can adopt the existing mode of mapping the RGB space to the CrCb space; and performing AND operation on a skin mask in a preset CrCb space and the CrCb image to determine a skin area in the hand action image of the user. The skin mask may be constructed by pre-collecting a large number of skin images, such as hand skin images. As shown in fig. 5, an image showing a skin area corresponding to the RGB image of fig. 4 is given. It will be appreciated that to facilitate image processing, the image may be converted to a binary image, wherein the pixels in the skin regions are represented in white and the pixels in the non-skin regions are represented in black.

After the skin area is determined, clustering can be performed on the pixel points in the skin area to obtain different skin areas. The clustering method is not limited, for example, k-means is adopted for clustering.

After different skin areas are obtained, depth values corresponding to the different skin areas can be obtained according to the depth image, and the user hand area in the user hand action image is determined according to the depth values. For example, the depth values of the pixel points in the depth image corresponding to the cluster center points of different skin areas are used as the depth values corresponding to the corresponding skin areas, or the average value of the depth values in the depth image corresponding to all the pixel points in the cluster is used as the depth value corresponding to the corresponding skin area, so that the depth value corresponding to each skin area can be obtained. Since the hand is usually located in front of the body, the skin area with the smallest depth value may be determined as the user hand area. As shown in FIG. 6, a schematic illustration of a user's hand region determined from the skin region of FIG. 5 is provided.

After the hand region of the user is determined, the originally shot hand motion image of the user can be segmented to obtain the hand region image of the user. For example, when the user hand motion image includes an RGB image and a depth image, the image of the user hand region is directly divided into the captured RGB image and depth image, and the RGB image of the user hand region and the depth image of the user hand region are obtained.

S24: and identifying the user hand action category and the key point position corresponding to the user hand action image according to the user hand area image and a pre-constructed user hand action identification model.

Assume that the user hand area image includes: the RGB image of the user hand region and the depth image of the user hand region correspond to the model structure shown in fig. 3, and the RGB image and the depth image of the user hand region obtained by the division are respectively used as the input of the user hand motion recognition model to obtain model output, where the model output includes: probability and keypoint location for each user hand action category; and then, taking the user hand action type with the highest probability as the identified user hand action type, and taking the key point position output by the model as the identified key point position.

S25: and determining the operation type corresponding to the hand action of the user according to the user hand action type and the key point position corresponding to the continuous multiple groups of user hand action images.

Because simple and easy-to-use hand motions are limited, some user operation categories need to be realized by combining hand motions of multiple users, such as clicking operation, which requires that a palm opens first and then makes a fist. Fig. 7 is a schematic diagram of the hand motion of the user corresponding to the clicking operation, and as shown in fig. 7, the hand motion of the user corresponding to the clicking operation includes opening of the palm shown on the left side of fig. 7 and making a fist shown on the right side of fig. 7. Therefore, in order to determine the operation type of the hand movement of the user, it is necessary to acquire a plurality of groups of hand movement images of the user, determine the user hand movement type and the key point position corresponding to each group of hand movement images of the user, and then determine the operation type of the user. How many groups of user hand motion images are obtained specifically can be determined according to application requirements, for example, 15 groups of user hand motion images are obtained continuously.

When the operation category of the user is specifically determined, the operation category may be determined according to a predefined action process of each operation category, a position of a key point, and a content currently displayed on a screen or in the air.

If the current screen or the interface displayed in the air for the user to inquire information is displayed, the user inputs corresponding inquiry information, the recognized user action category is that the palm opens first and then the fist is closed, namely in the acquired multiple groups of user hand action images, the user hand action categories corresponding to the front groups of images are that the palm opens, and the user hand action categories corresponding to the back groups of images are that the fist is closed, the operation category of the user can be judged to be clicking;

if the current screen or the air display is multi-line text content, the user hand action categories corresponding to the obtained multiple groups of user hand action images are all fist making, and the positions of key points of the user hand move towards the same direction, the operation category of the user can be determined to be a continuously selected text; FIG. 8 is a schematic diagram illustrating a hand motion of a user selecting a text.

Further, when the corresponding operation category cannot be found in the user hand action combination of the multi-frame images, the current user hand action is considered as an invalid action, and the system does not give a response or prompts the user that the hand action is wrong;

of course, in addition to the user hand motions corresponding to the above user operation categories, the user hand motions corresponding to other user operation categories may be predetermined according to the requirements, as shown in fig. 9, where fig. (a) is a view of capturing display content, such as capturing text, capturing images, etc., fig. (b) is a view of moving a cursor, if a large amount of text content is currently displayed, and the cursor needs to be moved to another position before a certain text character, the operation may be used, fig. (c) releases the content, the operation is generally used together with capturing the content or other operations, such as dragging the text after capturing the text, and then releasing the moved text, and fig. (d) is a handwriting operation, the operation is generally used for starting a handwriting mode, and when the user needs to input the content on a screen or in the air, the operation may be used; the present application does not limit the user hand motions corresponding to the predefined user operation types, and each operation type may be directly associated with each user hand motion using a combination of user hand motions for each operation type.

S26: and responding to the hand motion of the user according to the operation type.

In the embodiment, the operation type is determined according to the hand action image of the user, and corresponding response is carried out according to the operation type, so that special equipment does not need to be worn on the hand of the user, natural interaction habits are met, the accuracy of collected data can be improved through processing the image, and natural and efficient interaction of a human and a machine is realized. The accuracy can be improved by determining the operation type according to the user hand action type and the key point position corresponding to the continuous multiple groups of user hand action images. By segmenting the user hand region image from the user hand motion image, the processing efficiency can be improved. The recognition accuracy of the hand action of the user can be improved by adopting the deep neural network training to construct the hand action recognition model of the user.

Fig. 10 is a schematic structural diagram of an intelligent interaction device according to an embodiment of the present application.

As shown in fig. 10, the apparatus 100 of the present embodiment includes: an acquisition module 101, a determination module 102 and a response module 103.

The acquisition module 101 is used for acquiring a hand action image of a user, wherein the hand action image of the user is obtained by shooting the hand action of the user;

the determining module 102 is configured to determine, according to the user hand motion image, an operation category corresponding to the user hand motion;

a response module 103, configured to respond to the hand motion of the user according to the operation category.

In some embodiments, referring to fig. 11, the determining module 102 includes:

a segmentation submodule 1021, configured to determine a user hand region in the user hand motion image, and segment the user hand motion image according to the user hand region to obtain a user hand region image;

the identifying sub-module 1022 is configured to identify a user hand motion category and a key point position corresponding to the user hand motion image according to the user hand region image and a pre-constructed user hand motion identifying model;

the determining sub-module 1023 is used for determining the operation category corresponding to the user hand motion according to the user hand motion category and the key point position corresponding to the continuous multiple groups of user hand motion images.

In some embodiments, the single set of user hand motion images includes: a frame of RGB image and a frame of depth image corresponding to each other.

In some embodiments, the segmentation sub-module 1021 is configured to determine a user hand region in the user hand motion image, comprising:

clustering the skin areas to obtain different clustered skin areas;

In some embodiments, the segmentation sub-module 1021 is configured to determine a skin region in the user hand motion image according to the RGB image, and includes:

converting the RGB image into a CrCb image;

and performing AND operation on a skin mask in a preset CrCb space and the CrCb image to determine a skin area in the hand action image of the user.

In some embodiments, the user hand area image: the RGB image and the depth image of the user's hand region, the recognition sub-module 1022 is specifically configured to:

using the RGB image and the depth image of the user hand area as the input of a user hand motion recognition model to obtain model output, wherein the model output comprises: probability and keypoint location for each user hand action category;

and taking the user hand action category with the highest probability as the identified user hand action category, and taking the key point position output by the model as the identified key point position.

In some embodiments, referring to fig. 11, the apparatus 100 further comprises: a building module 104 for building a user hand motion recognition model, the building module 104 being specifically configured to:

obtaining training data, the training data comprising: the method comprises the steps that RGB images and depth images of a user hand area and annotation information are obtained by dividing collected user hand action images, and the annotation information corresponds to the collected user hand action images and comprises user hand action categories and key point positions;

determining the structure of a hand motion recognition model of a user;

training is carried out based on the training data and the structure, and a user hand action recognition model is constructed.

In some embodiments, the structure comprises: a deep neural network structure.

It is understood that the apparatus of the present embodiment corresponds to the method embodiment described above, and specific contents may be referred to the related description of the method embodiment, and are not described in detail herein.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. An intelligent interaction method, comprising:

acquiring a hand action image of a user, wherein the hand action image of the user comprises multiple continuous groups of hand action images of the user, which are shot for the hand action of the user;

respectively identifying the user hand action categories and the key point positions corresponding to each group of user hand action images;

determining an operation category according to user hand action categories and key point positions corresponding to the multiple groups of user hand action images and content currently displayed on a screen or in the air, wherein the operation category is associated with the multiple groups of user hand action categories, the moving directions of the multiple groups of key point positions and the content currently displayed on the screen or in the air;

and responding to the hand motion of the user according to the operation type.

2. The method of claim 1, wherein the identifying the user hand motion categories and the key point positions corresponding to each group of user hand motion images respectively comprises:

determining a user hand region in the user hand motion image, and segmenting the user hand motion image according to the user hand region to obtain a user hand region image;

and identifying the user hand action category and the key point position corresponding to the user hand action image according to the user hand area image and a pre-constructed user hand action identification model.

3. The method of claim 2, wherein a single set of user hand motion images comprises: a frame of RGB image and a frame of depth image corresponding to each other.

4. The method of claim 3, wherein the determining a user hand region in the user hand motion image comprises:

clustering the skin areas to obtain different clustered skin areas;

5. The method of claim 4, wherein determining the skin region in the image of the user's hand motion from the RGB image comprises:

converting the RGB image into a CrCb image;

6. The method of claim 2, wherein the user hand area image: the method comprises the following steps of identifying user hand action categories and key point positions corresponding to a user hand action image according to an RGB image and a depth image of the user hand area and a pre-constructed user hand action identification model, and comprises the following steps:

7. The method of claim 2, further comprising: constructing a user hand action recognition model, wherein the constructing of the user hand action recognition model comprises the following steps:

determining the structure of a hand motion recognition model of a user;

8. The method of claim 7, wherein the structure comprises: a deep neural network structure.

9. An intelligent interaction device, comprising:

the acquisition module is used for acquiring hand motion images of a user, wherein the hand motion images of the user comprise multiple continuous groups of hand motion images of the user, which are shot for the hand motion of the user;

the determining module is used for respectively identifying user hand action categories and key point positions corresponding to each group of user hand action images, and determining operation categories according to the user hand action categories and key point positions corresponding to the multiple groups of user hand action images and the content currently displayed on a screen or in the air, wherein the operation categories are associated with the multiple groups of user hand action categories, the moving directions of the multiple groups of key point positions and the content currently displayed on the screen or in the air;

and the response module is used for responding to the hand action of the user according to the operation type.

10. The apparatus of claim 9, wherein the determining module comprises:

the segmentation sub-module is used for determining a user hand region in the user hand action image and segmenting the user hand action image according to the user hand region to obtain a user hand region image;

and the identification submodule is used for identifying the user hand action category and the key point position corresponding to the user hand action image according to the user hand area image and a pre-constructed user hand action identification model.

11. The apparatus of claim 10, wherein a single set of user hand motion images comprises: a frame of RGB image and a frame of depth image corresponding to each other.

12. The apparatus of claim 11, wherein the segmentation sub-module is configured to determine a user hand region in the user hand motion image, and comprises:

clustering the skin areas to obtain different clustered skin areas;

13. The apparatus of claim 12, wherein the segmentation sub-module is configured to determine a skin region in the user hand motion image from the RGB image, and comprises:

converting the RGB image into a CrCb image;

14. The apparatus of claim 10, wherein the user hand area image: the recognition submodule is specifically configured to:

15. The apparatus of claim 10, further comprising: a construction module for constructing a user hand motion recognition model, the construction module being specifically configured to:

determining the structure of a hand motion recognition model of a user;

16. The apparatus of claim 15, wherein the structure comprises: a deep neural network structure.