CN112328090B

CN112328090B - Gesture recognition method and device, electronic equipment and storage medium

Info

Publication number: CN112328090B
Application number: CN202011363248.6A
Authority: CN
Inventors: 赵代平; 许佳; 孔祥晖; 孙德乾
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2023-01-31
Anticipated expiration: 2040-11-27
Also published as: CN112328090A; WO2022110614A1

Abstract

The disclosure relates to a gesture recognition method and device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a video to be identified; carrying out human body detection on the video to obtain the number of first objects included in the video; and recognizing the gestures in the video according to a gesture recognition mode corresponding to the number of the first objects to obtain a gesture recognition result. The gesture recognition effect of a multi-person scene can be improved.

Description

Gesture recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a gesture recognition method and apparatus, an electronic device, and a storage medium.

Background

The gesture interaction may be a human-computer interaction mode that recognizes human gesture language by using computer technology and converts the human gesture language into a command to operate and control an electronic device (e.g., a smart television, a smart air conditioner, etc.). The gesture recognition technology is a key technology for realizing gesture interaction.

Various complications can be involved in remote, non-contact gesture recognition scenarios, such as multiple people and hands in a video image. However, the recognition accuracy of the control gesture in the complex case is low in the related art.

Disclosure of Invention

The disclosure provides a gesture recognition technical scheme.

According to an aspect of the present disclosure, there is provided a gesture recognition method including: acquiring a video to be identified; carrying out human body detection on the video to obtain the number of first objects included in the video; and recognizing the gestures in the video according to a gesture recognition mode corresponding to the number of the first objects to obtain a gesture recognition result.

In one possible implementation, the number of the first objects is greater than or equal to two, and the video includes a first video frame; the recognizing the gestures in the video according to the gesture recognition modes corresponding to the number of the first objects to obtain gesture recognition results comprises the following steps: respectively acquiring a human body area and a hand area of each object in the first video frame; determining a second object from the first object based on a position relation between the human body area and a first preset area, wherein the human body area of each object comprises a first human body area of the second object, the hand area of each object comprises a first hand area of the second object, and the first human body area is located in the first preset area; and under the condition that the position relation between the first hand region and the first human body region meets a preset position condition, performing gesture recognition on the first hand region to obtain a first gesture recognition result.

In one possible implementation, the first gesture recognition result includes one of a valid gesture recognition result and an invalid gesture recognition result.

In one possible implementation, the video includes a second video frame that follows the first video frame; under the condition that the first gesture recognition result comprises an invalid gesture recognition result, recognizing the gestures in the video according to a gesture recognition mode corresponding to the number of the first objects to obtain a gesture recognition result, and further comprising: respectively acquiring a second human body area and a second hand area of the second object in the second video frame; and under the condition that the position relation between the second hand area and the second human body area meets the preset position condition, performing gesture recognition on the second hand area to obtain a second gesture recognition result.

In one possible implementation, the body region of each object includes a third body region of a third object, and the hand region of each object includes a third hand region of the third object; the recognizing the gestures in the video according to the gesture recognition modes corresponding to the number of the first objects to obtain gesture recognition results, further comprising: determining the third object from the first object under the condition that the position relation between the first hand region and the first human body region does not meet the preset position condition, wherein the third human body region is located in a second preset region; and under the condition that the position relation between the third hand area and the third human body area meets the preset position condition, performing gesture recognition on the third hand area to obtain a third gesture recognition result.

In a possible implementation manner, the second preset area is partially overlapped with the first preset area, or the second preset area is adjacent to the first preset area.

In one possible implementation, the video includes a second video frame following the first video frame, and a third video frame following the second video frame; after the obtaining of the third gesture recognition result, the method further includes: and performing gesture recognition on the fourth hand region to obtain a fourth gesture recognition result in response to that the position relation between the fourth hand region of the second object and the fourth human body region of the second object in the third video frame meets the preset position condition.

In one possible implementation, the first preset area includes a central area of a video frame of the video; the determining the second object from the first object based on the position relationship between the human body region and the first preset region includes: determining a human body region with the smallest distance from the first preset region among the plurality of human body regions as the first human body region when the first preset region comprises the plurality of human body regions; and determining the object corresponding to the first human body region as the second object.

In one possible implementation, the method further includes: and if the first hand regions comprise two hand regions and the preset gesture is a one-hand gesture, determining one hand region in the first hand regions as the first hand region.

In one possible implementation manner, the preset position condition includes: a first height difference between a hand region height and a crotch region height of a target object, greater than or equal to a height threshold, the height threshold positively correlated with a second height difference, the second height difference being a height difference between a shoulder region height and the crotch region height of the target object, the target object comprising at least one of the second object and a third object.

In one possible implementation, the method further includes: and controlling the electronic equipment to execute the operation corresponding to the effective gesture recognition result under the condition that the gesture recognition result is the effective gesture recognition result.

According to an aspect of the present disclosure, there is provided a gesture recognition apparatus including: the acquisition module is used for acquiring a video to be identified; the detection module is used for carrying out human body detection on the video to obtain the number of first objects included in the video; and the recognition module is used for recognizing the gestures in the video according to the gesture recognition modes corresponding to the number of the first objects to obtain gesture recognition results.

In one possible implementation, the number of the first objects is greater than or equal to two, and the video includes a first video frame; the identification module comprises: the first acquisition sub-module is used for respectively acquiring a human body area and a hand area of each object in the first video frame; a first determining sub-module, configured to determine a second object from the first objects based on a position relationship between the body region and a first preset region, where the body region of each object includes a first body region of the second object, the hand region of each object includes a first hand region of the second object, and the first body region is located in the first preset region; the first recognition submodule is used for performing gesture recognition on the first hand region under the condition that the position relation between the first hand region and the first human body region meets a preset position condition to obtain a first gesture recognition result.

In one possible implementation, the video includes a second video frame that follows the first video frame; in a case where the first gesture recognition result includes an invalid gesture recognition result, the recognition module further includes: a second obtaining sub-module, configured to obtain a second human body region and a second hand region of the second object in the second video frame, respectively; and the second recognition submodule is used for performing gesture recognition on the second hand region to obtain a second gesture recognition result under the condition that the position relation between the second hand region and the second human body region meets the preset position condition.

In one possible implementation, the body region of each object includes a third body region of a third object, and the hand region of each object includes a third hand region of the third object; the identification module further comprises: a second determination sub-module configured to determine the third object from the first object when a positional relationship between the first hand region and the first human body region does not satisfy the preset positional condition, the third human body region being located in a second preset region; and the third recognition submodule is used for performing gesture recognition on the third hand area under the condition that the position relation between the third hand area and the third human body area meets the preset position condition to obtain a third gesture recognition result.

In a possible implementation manner, the second preset area partially overlaps with the first preset area, or the second preset area is adjacent to the first preset area.

In one possible implementation, the video includes a second video frame following the first video frame, and a third video frame following the second video frame; after the obtaining of the third gesture recognition result, the apparatus further includes: and the fourth recognition submodule is used for performing gesture recognition on a fourth hand area of the second object in response to the fact that the position relation between the fourth hand area of the second object and the fourth human body area of the second object in the third video frame meets the preset position condition to obtain a fourth gesture recognition result.

In one possible implementation, the first preset area includes a central area of a video frame of the video; the first determination submodule includes: a human body region determining unit, configured to determine, as the first human body region, a human body region of the plurality of human body regions having a smallest distance from the first preset region when the first preset region includes the plurality of human body regions; and the object determining unit is used for determining the object corresponding to the first human body area as the second object.

In one possible implementation, the apparatus further includes: the hand area determination module is used for determining one hand area in the first hand area as the first hand area under the condition that the first hand area comprises two preset gestures and the preset gesture is a one-hand gesture.

In one possible implementation, the apparatus further includes: and the control module is used for controlling the electronic equipment to execute the operation corresponding to the effective gesture recognition result under the condition that the gesture recognition result is the effective gesture recognition result.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the number of the first objects included in the video is obtained by performing human body detection on the video, the gestures in the video are recognized according to the gesture recognition modes corresponding to the number of the first objects, so as to obtain the gesture recognition result, and the corresponding gesture recognition modes can be selected according to the number of the people in the video for gesture recognition. Therefore, a more targeted gesture recognition mode can be adopted based on a single scene or a multi-person scene. Especially, on the basis of a multi-user scene, the gesture recognition accuracy can be improved, and the technical problem that the recognition accuracy of the control gesture is low under complex conditions is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a gesture recognition method according to an embodiment of the present disclosure.

Fig. 2a shows a first schematic diagram of a preset gesture according to an embodiment of the present disclosure.

FIG. 2b shows a second schematic diagram of a preset gesture according to an embodiment of the present disclosure.

Fig. 2c shows a third schematic diagram of a preset gesture according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a preset area according to an embodiment of the present disclosure.

FIG. 4 illustrates a schematic diagram of a human body keypoint in accordance with an embodiment of the disclosure.

FIG. 5 shows a schematic diagram of a gesture recognition process according to an embodiment of the present disclosure.

Fig. 6 illustrates a block diagram of a gesture recognition apparatus according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, C, and may mean including any one or more elements selected from the group consisting of a, B, and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

Fig. 1 shows a flowchart of a gesture recognition method according to an embodiment of the present disclosure, as shown in fig. 1, the gesture recognition method includes:

in step S11, a video to be recognized is acquired.

In step S12, human body detection is performed on the video, and the number of first objects included in the video is obtained.

In step S13, the gestures in the video are recognized in a gesture recognition manner corresponding to the number of the first objects, and a gesture recognition result is obtained.

In one possible implementation, the gesture recognition method may be performed by an electronic device such as a terminal device or a server, the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA) device, a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory. Alternatively, the method may be performed by a server.

In one possible implementation, the video to be recognized may be acquired in step S11. The video to be identified may be a video captured by a video capture device. The video capture device may be any suitable video capture device known in the art, such as, but not limited to, a general webcam, a depth camera, a digital camera, etc., and the present disclosure is not limited to a particular type of video capture device.

In a possible implementation manner, the video capture device in the present disclosure may be a device in an electronic device that performs a gesture recognition method; or may be a device independent of the electronic apparatus performing the gesture recognition method. The embodiments of the present disclosure are not limited to the relationship between the video capture device and the electronic device.

In one possible implementation manner, in step S12, any known human body detection manner may be used to perform human body detection on the video, for example, the human body detection manner may include: extracting human key points (such as human key points of 13 joint parts) in a video frame of a video, wherein the number and the positions of the human key points can be determined according to actual requirements, and are not limited herein; or, the human body contour in the video frame can be extracted. The embodiment of the present disclosure is not limited to what human body detection method is adopted.

It can be understood that, in the video to be recognized acquired in step S11, there may be a video frame including a human body, or there may be a video frame not including a human body. By detecting the human body in the video, the video frame containing the first object in the video can be determined, namely the video frame containing the human body in the video is determined. The human body in the video frame may be one or more, and accordingly, the first object may include one or more objects.

In one possible implementation, the first object may comprise one or more objects, as described above. In step S12, the number of first objects included in the video may be determined according to the human body detection result. The human body detection result can be detected human body key points or human body contours, and the number of human bodies in the video frame can be determined according to the human body key points or the human body contours, so that the number of the first objects in the video frame is determined.

In a possible implementation manner, gesture recognition manners corresponding to the number of the first objects may be preset, and then in step S13, the gestures in the video are recognized according to the gesture recognition manners corresponding to the number of the first objects.

In a possible implementation manner, the presetting of the gesture recognition manners corresponding to the number of the first objects may include: when the number of the first objects is equal to one, the first objects are used as control objects, and gesture recognition is carried out on hand areas of one hand or two hands of the control objects; and when the number of the first objects is larger than or equal to two, determining a control object from the first objects according to the position of the first objects, and performing gesture recognition on hand areas of one hand or two hands of the control object. Wherein, the single hand can be a left hand or a right hand. The manipulation object may refer to an object that manipulates the electronic device through a gesture.

In one possible implementation, the gesture recognition result may include one of a valid gesture recognition result and an invalid gesture recognition result. The valid gesture recognition result can comprise a result matched with a preset gesture; the invalid gesture recognition result may include a result that does not match the preset gesture. The preset gesture can be a preset gesture graph corresponding to the operation of the electronic equipment.

Fig. 2a, 2b and 2c illustrate schematic diagrams of preset gestures according to an embodiment of the present disclosure. The preset gesture may include, but is not limited to, the gestures shown in fig. 2a, 2b and 2c, the gesture of fig. 2a may correspond to a confirmation operation, the gesture of fig. 2b may correspond to a switching operation, and the gesture of fig. 2c may correspond to a closing operation. The gesture recognition result matched with any one of the gestures shown in fig. 2a, 2b and 2c may be a valid gesture recognition result; accordingly, the gesture recognition result that does not match all the gestures shown in fig. 2a, 2b and 2c may be an invalid gesture recognition result.

It should be noted that, although the preset gesture and the operation corresponding to the preset gesture are described as examples, a person skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can set different preset gestures and set operations corresponding to the different preset gestures completely according to the actual application scene, and the embodiment of the present disclosure is not limited thereto.

In a possible implementation manner, the gesture recognition performed on the gesture in the video may adopt any known gesture recognition manner, for example, a gesture recognition manner based on geometric features, a gesture recognition manner based on direction histogram, and the like, and the embodiment of the present disclosure is not limited thereto.

In a possible implementation manner, whether the gesture recognition result is matched with the preset gesture may be determined according to the confidence of the gesture recognition result. The gesture recognition result matched with the preset gesture may be a valid gesture recognition result, and the gesture recognition result unmatched with the preset gesture may be an invalid gesture recognition result. The confidence level may be a measure for measuring the confidence level of the gesture recognition result. The confidence may be derived from a pre-trained neural network output. The higher the confidence, the higher the confidence representing the confidence of the recognized gesture recognition result. A confidence threshold may be set for confidence, and a gesture recognition result corresponding to a confidence higher than the confidence threshold may be considered as a valid gesture recognition result. The present disclosure does not limit the specific values of the confidence thresholds.

In the embodiment of the disclosure, the number of the first objects included in the video is obtained by performing human body detection on the video, the gestures in the video are recognized according to the gesture recognition modes corresponding to the number of the first objects, so as to obtain the gesture recognition result, and the corresponding gesture recognition modes can be selected according to the number of people in the video for gesture recognition, so that the gesture recognition effect of a multi-user scene is improved.

In one possible implementation, the number of first objects may be greater than or equal to two, as described above. The video may include a first video frame. In step S13, recognizing the gestures in the video according to the gesture recognition manners corresponding to the number of the first objects to obtain a gesture recognition result, which may include:

respectively acquiring a human body area and a hand area of each object in a first video frame;

determining second objects from the first objects based on the position relation between the human body areas and the first preset area, wherein the human body area of each object comprises a first human body area of the second object, the hand area of each object comprises a first hand area of the second object, and the first human body area is located in the first preset area;

and under the condition that the position relation between the first hand area and the first human body area meets a preset position condition, performing gesture recognition on the first hand area to obtain a first gesture recognition result.

In one possible implementation, the human body region may be determined according to the human body detection result in the present disclosure. For example, a human body frame can be determined according to the detected human body key points or human body contours, and the region of the human body frame in the video frame is used as the human body region of each object; or the detected region of the human body contour in the video frame may also be directly used as the human body region of each object, which is not limited in this embodiment of the present disclosure.

In a possible implementation manner, any known hand detection manner may be adopted to perform hand detection on the video frame, and then the hand area of each object is determined according to the hand detection result. The hand detection method may include, for example: extracting hand key points (such as key points of 20 hand joint parts) in the video frame, wherein the number and the positions of the hand key points can be determined according to actual requirements, and are not limited herein; or may also extract hand contours in the video frame, etc. The embodiment of the present disclosure does not limit what hand detection method is used.

In one possible implementation, the hand region may be determined from the hand detection results in the present disclosure. For example, a hand frame may be determined according to the detected hand key points or hand contours, and the area of the hand frame in the video frame may be used as the hand area of each object; or, the area of the detected hand contour in the video frame may also be directly used as the hand area of each object, which is not limited in this embodiment of the disclosure.

In one possible implementation, the first preset region may be a region for determining the first object in the video frame, for example, the first preset region may be a central region of the video frame. The range, shape, and the like of the first preset area may be set according to actual requirements, and the embodiment of the present disclosure is not limited.

In a possible implementation manner, for each video frame in a video, human detection and hand detection may be performed simultaneously, or may be performed sequentially according to a set detection order, and specifically, the detection order may be set according to factors that may affect the detection order, such as processing capability of a device that implements the detection function, resource occupation of the device, and limitation on time delay in an application process, which is not limited in this embodiment of the disclosure.

In a possible implementation manner, the position relationship between the human body region and the first preset region may be a distance between the human body region of each object and the first preset region, or may also be an overlap degree between the human body region of each object and the first preset region. Correspondingly, the second object is determined from the first object based on the position relationship between the human body area and the first preset area, and the object corresponding to the human body area with the shortest distance can be used as the second object according to the distance between the human body area of each object and the first preset area; or the object corresponding to the human body area with the largest overlapping degree may be used as the second object according to the overlapping degree between the human body area and the first preset area, which is not limited in this embodiment of the present disclosure.

In a possible implementation manner, the first human body area is located in the first preset area, and may be that all areas of the first human body area are located in the first preset area, or that a partial area of the first human body area is located in the first preset area. It may be understood that, in a case where there is an overlapping area of the first human body area and the first preset area, it is determined that the first human body area is located in the first preset area.

In one possible implementation, the positional relationship between the first hand region and the first human body region of the second object may include a height relationship between the first hand region and the first human body region. The preset position condition may be a preset condition for determining whether to perform gesture recognition based on a position relationship between the hand region and the human body region. For example, the preset position condition may be that the hand region is higher than the crotch region, or the hand region is higher than the elbow region, or the like. Specific content of the preset position condition may be set according to a gesture capable of controlling the electronic device to perform a corresponding operation, which is not limited in the embodiment of the present disclosure.

It can be understood that when the control object expects to control the electronic device through the gesture, the control gesture is usually issued in the hand-up state, and then, under the condition that the position relationship between the first hand region and the first human body region meets the preset position condition, it can be understood that, under the condition that the second object is in the hand-up state, the gesture recognition is performed on the first hand region to obtain a first gesture recognition result, so that the recognized gesture can be more effective, and the misoperation can be reduced.

In a possible implementation manner, for a video frame in a video, human detection and hand detection may be performed simultaneously, or may be performed sequentially according to a set detection sequence. The detection sequence may be set according to the processing capability of the device implementing the detection function, the resource occupation condition of the device, the limitation on the time delay in the application process, and other factors that may affect the detection sequence, and the embodiment of the present disclosure is not limited.

In a possible implementation manner, in a case that a positional relationship between the first hand region and the first human body region satisfies a preset position condition, gesture recognition may be performed on the first hand region to obtain a first gesture recognition result. The gesture recognition on the first hand region may adopt any known gesture recognition mode, for example, a gesture recognition mode based on geometric features, a gesture recognition mode based on direction histogram, and the like, which is not limited in this embodiment of the present disclosure.

In one possible implementation, the first gesture recognition result may include one of a valid gesture recognition result and an invalid gesture recognition result. The valid gesture recognition result can comprise a result matched with a preset gesture; the invalid gesture recognition result may include a result that does not match the preset gesture. The preset gesture can be a preset gesture graph corresponding to the operation of the electronic equipment.

In the embodiment of the disclosure, the second object can be determined from the plurality of first objects, and the hand region of the second object is subjected to gesture recognition when the second object is in a hand-up state, so that an effective control object is determined in a multi-person scene, and then the gesture recognition is performed on the control object, so that the effectiveness of gesture recognition can be improved, and misoperation can be reduced.

In one possible implementation, the method further includes: and under the condition that the gesture recognition result is an effective gesture recognition result, controlling the electronic equipment to execute the operation corresponding to the effective gesture recognition result. For example, in the case where the first gesture recognition result recognizes that the gesture matches the gesture shown in fig. 2a, the electronic device may be controlled to perform the confirmation operation.

In a possible implementation manner, when the recognized gesture recognition result is a valid gesture recognition result, the electronic device may be controlled to execute a corresponding operation by sending an operation instruction corresponding to a preset gesture.

In one possible implementation, the electronic device being controlled (which may be referred to as a first electronic device) may be the same as or different from the electronic device performing the gesture recognition method according to the embodiments of the present disclosure (which may be referred to as a second electronic device). Under the condition that the first electronic device and the second electronic device are the same electronic device, the electronic device can be a terminal such as a smart television, a smart phone and the like, human body detection, hand detection and gesture recognition are realized in the same device, and corresponding operations such as channel switching, remote photographing and the like are performed according to recognized gestures. Under the condition that the first electronic device and the second electronic device are different electronic devices, the first electronic device can be a terminal such as a smart television, a smart phone and the like, the second electronic device can be any terminal or server, human body detection, human hand detection and gesture recognition are realized by the second electronic device, and the first electronic device is controlled to perform corresponding operation according to recognized gestures. The disclosed embodiments are not limited in this regard.

In a possible implementation manner, the video capture device in the present disclosure may be an apparatus belonging to a first electronic device and a second electronic device; or may be a device independent of the first electronic device and the second electronic device. Embodiments of the present disclosure are not limited with respect to the relationship between the video capture device and the electronic device being controlled.

In a possible implementation manner, in the case that the effective gesture recognition result of the manipulation object in the current video frame is the same as the effective gesture recognition result of the manipulation object in the previous video frame, the operation performed by the electronic device may be controlled to be unchanged, or the same operation (for example, channel switching again) may be performed again; under the condition that the effective gesture recognition result of the control object in the current video frame is different from the effective gesture recognition result of the control object in the previous video frame, the electronic equipment can be controlled to execute the operation corresponding to the effective gesture recognition result of the control object in the current video frame.

According to the embodiment of the disclosure, the electronic equipment can be remotely controlled through gestures, so that the accuracy of gesture recognition can be effectively improved and the user experience is improved in a multi-user scene.

In one possible implementation, the video may include video frames based on a time sequence, and video frames subsequent to a first video frame may be determined based on the time sequence. It is understood that the video frames following the first video frame may be selected according to a certain time interval, frame interval, etc., for example, one video frame may be selected every 2 frames, or one video frame may be selected every 0.2 seconds, or all the video frames may be selected sequentially. The embodiments of the present disclosure do not limit the selection of the video frame after the first video frame.

In one possible implementation, the video may include a second video frame that follows the first video frame; in step S13, when the first gesture recognition result includes an invalid gesture recognition result, the method further includes, in a gesture recognition mode corresponding to the number of the first objects, recognizing the gesture in the video to obtain a gesture recognition result:

respectively acquiring a second human body area and a second hand area of a second object in a second video frame;

and under the condition that the position relation between the second hand area and the second human body area meets the preset position condition, performing gesture recognition on the second hand area to obtain a second gesture recognition result.

In one possible implementation, as described above, the gesture recognition result may include one of a valid gesture recognition result and an invalid gesture recognition result; the invalid gesture recognition result may include a result that does not match the preset gesture. The first gesture recognition result includes an invalid gesture recognition result, and it can be understood that the recognized gesture of the manipulation object does not match the preset gesture.

In a possible implementation manner, in a case that the position relationship between the second hand region and the second body region satisfies the preset position condition, it may be understood that the second object is still in the hand-up state in the second video frame. And under the condition that the position relation between the second hand region and the second human body region meets the preset position condition, performing gesture recognition on the second hand region, wherein the gesture recognition is performed on the hand region of the second object in the second video frame under the condition that the gesture of the second object in the first video frame is not matched with the preset gesture and the second object in the second video frame is still in a hand-up state.

In one possible implementation, in the second video frame, the second human body region may be determined according to the human body detection result. The second hand region may be determined based on the hand detection result. The human body detection result and the hand detection result may be determined by using the human body detection method and the hand detection method disclosed in the embodiments of the present disclosure, and the embodiments of the present disclosure are not limited to this.

In a possible implementation manner, the second video frame may be a continuous video frame after the first video frame, or may be a video frame after the first video frame and spaced by a certain number of frames or a certain time, which is not limited in this embodiment of the disclosure.

In the embodiment of the disclosure, the gesture tracking of the control object can be realized, the gesture recognition can be performed more specifically, and the times of switching the control object are reduced, so that the processing efficiency is improved.

As described above, when an object desires to manipulate an electronic device through a gesture, a control gesture is usually issued in a hand-up state. The second object may be in a hand-lifting state or in a non-hand-lifting state. In the hands-off state of the second object, it may be considered that the second object may not manipulate the electronic device through gestures.

In one possible implementation, the body region of each object may include a third body region of a third object, the hand region of each object including a third hand region of the third object; in step S13, recognizing the gestures in the video according to the gesture recognition manners corresponding to the number of the first objects to obtain a gesture recognition result, which may further include:

determining a third object from the first object under the condition that the position relation between the first hand area and the first human body area does not meet a preset position condition, wherein the third human body area is located in a second preset area;

and under the condition that the position relation between the third hand area and the third human body area meets the preset position condition, performing gesture recognition on the third hand area to obtain a third gesture recognition result.

In one possible implementation, the third object may be an object other than the second object in the first object. The second preset area may be an area in the video frame for determining the third object.

In a possible implementation, the second preset area may partially overlap with the first preset area, or the second preset area may be adjacent to the first preset area. The overlapping degree between the second preset area and the first preset area can be set according to actual requirements, and the embodiment of the present disclosure is not limited.

In one possible implementation, the first preset region may include one or more regions in a video frame, and the second preset region may include one or more regions in the video frame. The range, shape, number and the like of the second preset area may be set according to actual requirements, and the embodiment of the present disclosure is not limited.

Fig. 3 shows a schematic diagram of a preset area according to an embodiment of the present disclosure. As shown in fig. 3, the area a may be a first predetermined area, and the areas B and C may be second predetermined areas.

In a possible implementation manner, when the position relationship between the first hand region and the first human body region does not satisfy the preset position condition, the third object is determined from the first object, and it may be understood that, when the second object is in the non-hand-lifting state in the first video frame, the third object located in the second preset region is determined from the first object, so that when the second object is in the non-hand-lifting state, other objects can be searched for gesture recognition, which is closer to the gesture recognition situation in the actual scene, and the effectiveness of gesture recognition is improved.

In a possible implementation manner, the third object is determined from the first object, which may be a third object determined by determining a human body area of each object located in the second preset area and an object corresponding to a human body area closest to the human body area of the first object; or determining the human body area of each object located in the second preset area and the object corresponding to the human body area with the highest overlapping degree with the human body area of the first object as the third object, which is not limited in the embodiment of the present disclosure.

In a possible implementation manner, a coordinate system may be established based on the video frame, and the distance between the human body region of each object located in the second preset region and the human body region of the first object is determined according to the coordinates of the central point or the boundary point of the human body region, so that the object corresponding to the closest human body region is determined as the third object.

In a possible implementation manner, the third human body area is located in the second preset area, and may be that all areas of the third human body area are located in the second preset area, or that a partial area of the third human body area is located in the second preset area. It is understood that, in the case that there is an overlapping area between the third human body region and the second preset region, it is determined that the third human body region is located in the second preset region.

As described above, the preset position condition may be a condition that is set in advance to determine whether or not to perform gesture recognition based on the positional relationship between the hand region and the human body region.

In a possible implementation manner, when the position relationship between the third hand region and the third human body region meets the preset position condition, performing gesture recognition on the third hand region to obtain a third gesture recognition result, which may be understood as performing gesture recognition on the third hand region of the third object to obtain the third gesture recognition result when the third object is in a hand-up state, so that a recognized gesture is more effective, and misoperation is reduced.

In the embodiment of the disclosure, the third object can be determined when the second object is not lifted, and under the condition that the third object is lifted, gesture recognition is performed on the hand region of the third object, so that an effective control object is determined in a multi-person scene, the effectiveness of gesture recognition is improved, and misoperation is reduced.

It can be known that when it is desired to manipulate the electronic device, the object will generally be in a central region with respect to the field angle of the electronic device or the video capture device, that is, there is a high probability that the object in the central region in the video frame is the manipulated object. In a possible implementation manner, in a case where the first preset region is the central region, it may be considered that the probability that the second object in the first preset region is the manipulation object is greater, and the probability that the third object in the second preset region is the second order of the manipulation object.

In one possible implementation, the video may include a second video frame following the first video frame, and a third video frame following the second video frame;

after obtaining the third gesture recognition result, the gesture recognition method may further include:

and performing gesture recognition on the fourth hand region to obtain a fourth gesture recognition result in response to that the position relation between the fourth hand region of the second object and the fourth human body region of the second object in the third video frame meets a preset position condition.

In one possible implementation, as described above, the second video frame may be a video frame that is consecutive after the first video frame, or a video frame that is separated by a certain number of frames or a certain time after the first video frame; correspondingly, the third video frame may be a continuous video frame after the second video frame, or may be a video frame after the second video frame and spaced by a certain number of frames or a certain time, which is not limited in this embodiment of the disclosure.

It is to be appreciated that the third gesture recognition result of the third object can include one of a valid gesture recognition result and an invalid gesture recognition result.

In a possible implementation manner, in response to that the position relationship between the fourth hand region of the second object and the fourth human body region of the second object in the third video frame satisfies the preset position condition, gesture recognition is performed on the fourth hand region, which may be understood that, in the case that gesture recognition is performed on the hand region of the third object in the second video frame, it is detected that the second object in the third video frame is in the hand-up state, and since the probability that the second object is the manipulation object is greater, gesture recognition may be performed on the hand region of the second object by switching.

In a possible implementation manner, in the third video frame, the position relationship between the fourth hand region of the second object and the fourth human body region of the second object may not satisfy the preset position condition. In a third video frame, if the position relationship between the fourth hand region of the second object and the fourth body region of the second object does not satisfy the preset position condition, the position relationship between the fifth hand region of the third object and the fifth body region may be continuously determined, and in the case that the preset position condition is satisfied, the gesture recognition may be performed on the fifth hand region to obtain a fifth gesture recognition result.

In the embodiment of the disclosure, the hand region of the second object can be switched to perform gesture recognition when the second object in the first preset region is detected to be in the hand raising state, so that the control object with higher probability can be determined more effectively, and the accuracy of gesture recognition is improved.

As described above, the probability that the object in the central region in the video frame is the manipulation object is large. In one possible implementation, the first preset region may include a central region of a video frame of the video; the determining the second object from the first object based on the position relationship between the human body region and the first preset region may include:

determining a human body region with the smallest distance from the first preset region among the plurality of human body regions as a first human body region under the condition that the first preset region comprises the plurality of human body regions; and determining the object corresponding to the first human body region as a second object.

It is understood that the first predetermined area may include one body region or a plurality of body regions. In case that the first preset region includes one body region, the body region may be directly determined as the first body region.

In a possible implementation manner, the distance between the human body region and the first preset region may be according to a distance between a center point of the human body region and a center point of the first preset region; or, the distance between the center point of the human body area and the center line of the first preset area can be further determined; or the distance between the boundary point of the human body region and the center line of the first preset region, and the like, which is not limited in this embodiment of the present disclosure.

It should be noted that, although the form of the distance between the human body region and the first preset region as above is described by taking the center point, the center line and the boundary point as an example, the present disclosure should not be limited thereto. In fact, the user can set the form of the distance between the human body region and the first preset region completely according to actual requirements, and the embodiment of the present disclosure is not limited.

In a possible implementation manner, the distance between the plurality of human body regions and the first preset region may be determined by coordinates according to establishing a coordinate system based on the video frame. For example, the distance and the like may be determined by center point coordinates of the plurality of human body regions and center point coordinates of the first preset region. The disclosed embodiments are not limited as to how to determine the distances between the plurality of human body regions and the first preset region.

In the embodiment of the disclosure, the second object is determined based on the distance between the plurality of human body regions and the first preset region, and the object closer to the middle can be effectively determined under the condition that the first preset region comprises the plurality of human body regions, and the object closer to the middle can be used as the control object, so that the gesture recognition accuracy can be improved by being closer to the actual application scene.

It is known that the human body usually comprises two hands. If the first hand regions comprise two hand gestures and the preset gesture comprises a two-hand gesture (for example, the two hands form a love shape), the gesture recognition can be directly performed on the two first hand regions; if the first hand region includes two and the preset gesture includes only a one-handed gesture, selection of the hand region may be performed.

In one possible implementation, the gesture recognition method may further include: and determining one hand area in the first hand area as the first hand area under the condition that the preset gesture is a one-hand gesture.

In one possible implementation, a default identification of the right hand or the left hand may be set. Then, in a case where the first hand region includes two and the preset gesture is a one-handed gesture, a hand region of a left hand or a right hand in the first hand region may be determined as the first hand region.

In a possible implementation manner, the identity of the second object can be identified through identity authentication, face recognition and other manners, so as to determine the operation habit of the second object based on the user data, and further to select the hand area of the left hand for gesture recognition or select the hand area of the right hand for gesture recognition. For example, for a right-handed person, the right hand is used to be lifted for operation; for the left-handed people, the left hand is used to be lifted for operation, and the hand region can be identified in a more targeted manner by recording the operation habits of the object.

In the embodiment of the present disclosure, the effectiveness of gesture recognition can be further improved by determining one of the two hand regions as the hand region to be recognized.

As described above, the preset position condition may be a condition that is set in advance to determine whether or not to perform gesture recognition based on the positional relationship between the hand region and the human body region. According to whether the position relation between the hand area and the human body area meets the preset position condition or not, whether the object is in a hand lifting state or not can be determined.

In one possible implementation manner, the preset position condition may include: a first height difference between a hand region height and a crotch region height of the target object, greater than or equal to a height threshold; the height threshold may be positively correlated with a second height difference, the second height difference being a height difference between a shoulder region height and a crotch region height of a target object, the target object including at least one of a second object and a third object.

In one possible implementation, the first height difference between the hand region height and the crotch region height of the target object may be a height difference between the hand position and the crotch position of the target object. The height difference between the shoulder region height and the crotch region height of the target object may be a height difference between the shoulder position and the crotch position.

As described above, by the human body key point detection, key points of the human body joint portion can be determined. In one possible implementation, by establishing a coordinate system based on video frames, the key point coordinates of each human joint part can be known, so that the positions of the hand region, the crotch region and the shoulder region can be determined. From the determined positions of the hand region, the crotch region and the shoulder region, a first height difference between the height of the hand region and the height of the crotch region of the target object and a height difference between the height of the shoulder region and the height of the crotch region of the target object can be determined.

FIG. 4 shows a schematic diagram of a human body keypoint, according to an embodiment of the disclosure. Numerals 0 to 13 in fig. 4 represent key points of detected joint portions of a human body, for example, 0 represents a head, 1 represents a neck, 2 and 5 represent shoulders, 3 and 6 represent elbows, 4 and 7 represent hands, 8 and 11 represent a crotch, 9 and 12 represent knees, and 10 and 13 represent feet, respectively.

Taking the key points of the human body as shown in fig. 4 as an example, how to determine the first height difference and the second height difference is explained. In the case of taking the hand region of the hand 7 as the first hand region, that is, performing gesture recognition on the hand 7, the position of the hand 7 may be (x 7, y 7), and the height of the hand region may be y7; the position of the shoulder 5 may be (x 5, y 5), the shoulder region height may be y5; the position of the crotch portion 11 may be (x 11, y 11), and the crotch region height may be y11.

After the hand region height, the shoulder region height, and the crotch region height are respectively determined, a first height difference between the hand region height and the crotch region height may be (y 7-y 11), and a second height difference between the crotch region height and the shoulder region height may be (y 5-y 11).

It should be noted that although the manner of determining the crotch region height, the shoulder region height, the hand region height, the first height difference, and the second height difference as above is described by taking the human body key points as an example as shown in fig. 4, the present disclosure should not be limited thereto. In fact, the user can set the manner of determining the crotch region height, the shoulder region height, the hand region height, the first height difference and the second height difference according to actual requirements, and the embodiment of the disclosure is not limited.

In a possible implementation manner, the height threshold may be set according to an actual requirement, for example, the height threshold may be 1/3 of the second height difference, and may also be 1/2 of the second height difference, and the like, which is not limited in this embodiment of the disclosure.

In a possible implementation manner, when the first height difference is greater than or equal to the height threshold, it may be determined that the position relationship between the hand region and the human body region satisfies a preset position condition, that is, the target object may be considered to be in a hand-up state at this time; when the first height difference is smaller than the height threshold, it may be determined that the position relationship between the hand region and the human body region does not satisfy the preset position condition, that is, it may be considered that the target object is in a non-hand-lifting state at this time;

taking the human body key point shown in fig. 4 as an example, if the second height difference with a height threshold of 1/3 is set, that is, the height threshold is [1/3 × (y 5-y 11) ], then under the condition of (y 7-y 11) ≧ [1/3 × (y 5-y 11) ], it can be considered that the preset position condition is satisfied between the hand region and the human body region; otherwise, the preset position condition is not met.

In the embodiment of the disclosure, by determining the preset position condition, the hand region can be identified when the hand of the object is lifted to a certain height, so that the identification of the hand region without response can be reduced, and the gesture identification efficiency can be improved.

FIG. 5 shows a schematic diagram of a gesture recognition process according to an embodiment of the present disclosure. As shown in fig. 5, a video to be recognized is input; carrying out human body detection and hand detection on the input video according to the time sequence;

for a first video frame in an input video, directly determining the first video frame as a second object when the first object in the first video frame is one; recognizing the gesture of the second object under the condition that the preset position condition (namely the second object lifts the hand) is met between the hand area and the human body area of the second object;

under the condition that a plurality of first objects are in the first video frame, determining a second object according to the position relation between the human body area and the first preset area;

recognizing the gesture of the second object under the condition that the preset position condition (namely the second object lifts the hand) between the hand area and the human body area of the second object is met;

under the condition that the preset position condition is not met between the hand area and the human body area of the second object (namely the second object does not lift the hand), determining a third object according to the position relation between the human body area of each first object and the human body area of the second object;

and under the condition that the preset position condition (namely the third object lifts the hand) between the hand area and the human body area of the third object is met, recognizing the gesture of the third object.

For a video frame after the first video frame, tracking a hand area of a second object under the condition that a gesture identified by the first video frame is a gesture of the second object, and identifying the gesture of the second object under the condition that the second object lifts the hand;

under the condition that the gesture identified by the first video frame is the gesture of the third object, judging whether the preset position condition is met between the hand area and the human body area of the second object (namely judging whether the second object is a hand-up phenomenon);

under the condition that the preset position condition (namely that the second object lifts the hand) is met between the hand area and the human body area of the second object, switching to recognizing the gesture of the second object;

and under the condition that the preset position condition is not met between the hand area and the human body area of the second object (namely the second object still does not lift the hand), and the preset position condition is met between the hand area and the human body area of the third object (namely the third object still lifts the hand), recognizing the gesture of the third object.

According to the embodiment of the disclosure, the effective control object can be accurately determined from a plurality of objects, and the gesture which is sent by the control object and needs to be responded can be accurately and efficiently identified.

According to the gesture recognition method, a man-in-the-middle priority strategy is adopted, a man-in-the-middle can be determined from a plurality of persons detected in a video frame, and a main manipulator of the man-in-the-middle is determined; when the main operation hand of the middle person is lifted up, the gesture recognition is carried out, so that the control person can be accurately determined, the gesture needing to be responded can be recognized, the gesture recognition of the non-operation person and the gesture needing not to be responded are reduced, the misoperation caused by the error recognition is reduced, and the recognition efficiency and the accuracy of the gesture are improved.

In the related art, hand recognition is usually used as a dimension, and in the case of multiple people and multiple hands, a gesture with the highest confidence is preferentially detected, and without detection logic of multiple people and multiple hands, the gesture of a middle person cannot be correctly distinguished, so that the recognition efficiency is low, and misoperation is easily triggered. According to the gesture recognition method disclosed by the embodiment of the disclosure, the hands of the middle person can be preferentially recognized in a scene of multiple persons, and the detection tracking is always carried out without loss. Compared with the related art, the method disclosed by the embodiment of the disclosure can reduce the calculation amount of the detection and identification algorithm, improve the processing performance, and is more targeted and can better fit the use conditions in the actual scene.

The gesture recognition method can be applied to remote gesture recognition scenes, and electronic equipment, such as a television, an air conditioner, a refrigerator and other hardware equipment provided with a camera, is intelligently controlled through gestures. For example, the built-in or external intelligent shooting (camera) module of TV sets up artificial intelligence AI human staff detection and gesture recognition algorithm in the module, obtains the gesture recognition result through the gesture recognition method of this disclosure embodiment, controls the TV through the gesture result, triggers automatic function of shooing etc. for example through the victory gesture of fig. 2 b.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a gesture recognition apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the gesture recognition methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Fig. 6 illustrates a block diagram of a gesture recognition apparatus according to an embodiment of the present disclosure, as illustrated in fig. 6, the apparatus including:

an obtaining module 101, configured to obtain a video to be identified;

the detection module 102 is configured to perform human body detection on the video to obtain the number of first objects included in the video;

the recognition module 103 is configured to recognize the gestures in the video according to a gesture recognition manner corresponding to the number of the first objects, so as to obtain a gesture recognition result.

In one possible implementation, the number of the first objects is greater than or equal to two, and the video includes a first video frame; the identification module 103 includes: a first obtaining sub-module, configured to obtain, in the first video frame, a human body region and a hand region of each of the first objects respectively; a first determining sub-module, configured to determine a second object from the first objects based on a position relationship between the body region and a first preset region, where the body region of each object includes a first body region of the second object, the hand region of each object includes a first hand region of the second object, and the first body region is located in the first preset region; the first recognition submodule is used for performing gesture recognition on the first hand region under the condition that the position relation between the first hand region and the first human body region meets a preset position condition to obtain a first gesture recognition result.

In one possible implementation, the video includes a second video frame that follows the first video frame; in a case where the first gesture recognition result includes an invalid gesture recognition result, the recognition module 103 further includes: a second obtaining sub-module, configured to obtain a second human body region and a second hand region of the second object in the second video frame, respectively; and the second recognition submodule is used for performing gesture recognition on the second hand region to obtain a second gesture recognition result under the condition that the position relation between the second hand region and the second human body region meets the preset position condition.

In one possible implementation, the body region of each object includes a third body region of a third object, and the hand region of each object includes a third hand region of the third object; the identification module 103 further includes: a second determination sub-module configured to determine the third object from the first object when a positional relationship between the first hand region and the first human body region does not satisfy the preset positional condition, the third human body region being located in a second preset region; and the third recognition submodule is used for performing gesture recognition on the third hand area under the condition that the position relation between the third hand area and the third human body area meets the preset position condition to obtain a third gesture recognition result.

In one possible implementation, the first preset area includes a central area of a video frame of the video; the first determination submodule includes: a human body region determining unit, configured to determine, as the first human body region, a human body region of the plurality of human body regions having a smallest distance to the first preset region, if the first preset region includes the plurality of human body regions; and the object determining unit is used for determining the object corresponding to the first human body area as the second object.

In one possible implementation manner, the preset position condition includes: a first height difference between a hand region height and a crotch region height of a target object, greater than or equal to a height threshold, the height threshold being positively correlated with a second height difference, the second height difference being a height difference between a shoulder region height and the crotch region height of the target object, the target object including at least one of the second object and a third object.

In the embodiment of the disclosure, the number of the first objects included in the video is obtained by performing human body detection on the video, the gestures in the video are recognized according to the gesture recognition modes corresponding to the number of the first objects, a gesture recognition result is obtained, and the corresponding gesture recognition modes can be selected according to the number of people in the video for gesture recognition, so that the gesture recognition effect of a multi-person scene is improved.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

Embodiments of the present disclosure also provide a computer program product, which includes computer readable code, and when the computer readable code runs on a device, a processor in the device executes instructions for implementing a gesture recognition method provided in any of the above embodiments.

The embodiments of the present disclosure also provide another computer program product for storing computer readable instructions, which when executed cause a computer to perform the operations of the gesture recognition method provided in any of the above embodiments.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 7 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 7, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 8 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as a Microsoft Server operating system (Windows Server), stored in the memory 1932 ^TM ) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X) ^TM ) Multi-user, multi-process computer operating system (Unix) ^TM ) Free and open native code Unix-like operating System (Linux) ^TM ) Open native code Unix-like operating System (FreeBSD) ^TM ) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK) or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of gesture recognition, the method comprising:

acquiring a video to be identified;

carrying out human body detection on the video to obtain the number of first objects included in the video;

recognizing the gestures in the video according to gesture recognition modes corresponding to the number of the first objects to obtain gesture recognition results;

the method for recognizing the gestures in the video includes the steps that when the number of the first objects is greater than or equal to two, the gestures in the video are recognized according to a gesture recognition mode corresponding to the number of the first objects, and a gesture recognition result is obtained, and the method includes the steps of:

respectively acquiring a human body area and a hand area of each object in the first video frame;

determining a second object from the first object based on a position relationship between the human body area and a first preset area, wherein the human body area of each object comprises a first human body area of the second object, the hand area of each object comprises a first hand area of the second object, and the first human body area is located in the first preset area;

and under the condition that the position relation between the first hand region and the first human body region meets a preset position condition, performing gesture recognition on the first hand region to obtain a first gesture recognition result, wherein the preset position condition comprises a preset condition for judging whether to perform gesture recognition or not based on the position relation between the hand region and the human body region.

2. The method of claim 1, wherein the first gesture recognition result comprises one of a valid gesture recognition result and an invalid gesture recognition result.

3. The method of claim 1 or 2, wherein the video comprises a second video frame following the first video frame;

under the condition that the first gesture recognition result comprises an invalid gesture recognition result, recognizing the gestures in the video according to a gesture recognition mode corresponding to the number of the first objects to obtain a gesture recognition result, and further comprising:

respectively acquiring a second human body region and a second hand region of the second object in the second video frame;

4. The method of any one of claims 1 to 3, wherein the body region of each subject comprises a third body region of a third subject, and the hand region of each subject comprises a third hand region of the third subject;

the recognizing the gestures in the video according to the gesture recognition mode corresponding to the number of the first objects to obtain a gesture recognition result, further comprising:

determining the third object from the first object under the condition that the position relation between the first hand region and the first human body region does not meet the preset position condition, wherein the third human body region is located in a second preset region;

5. The method of claim 4, wherein the second predetermined area partially overlaps the first predetermined area or is adjacent to the first predetermined area.

6. The method according to claim 4 or 5, wherein the video comprises a second video frame following the first video frame, and a third video frame following the second video frame;

after the obtaining of the third gesture recognition result, the method further includes:

and performing gesture recognition on the fourth hand region to obtain a fourth gesture recognition result in response to that the position relation between the fourth hand region of the second object and the fourth human body region of the second object in the third video frame meets the preset position condition.

7. The method according to any one of claims 1 to 6, wherein the first preset area comprises a central area of a video frame of the video;

the determining the second object from the first object based on the position relationship between the human body region and the first preset region includes:

determining a human body region with the smallest distance from the first preset region among the plurality of human body regions as the first human body region when the first preset region comprises the plurality of human body regions;

and determining the object corresponding to the first human body region as the second object.

8. The method according to any one of claims 1 to 7, further comprising:

and if the first hand regions comprise two hand regions and the preset gesture is a one-hand gesture, determining one hand region in the first hand regions as the first hand region.

9. The method according to any one of claims 1 to 8, wherein the preset position condition comprises:

a first height difference between a hand region height and a crotch region height of a target object, greater than or equal to a height threshold, the height threshold being positively correlated with a second height difference, the second height difference being a height difference between a shoulder region height and the crotch region height of the target object, the target object including at least one of the second object and a third object.

10. The method according to any one of claims 1 to 9, further comprising:

and controlling the electronic equipment to execute the operation corresponding to the effective gesture recognition result under the condition that the gesture recognition result is the effective gesture recognition result.

11. A gesture recognition apparatus, comprising:

the acquisition module is used for acquiring a video to be identified;

the detection module is used for carrying out human body detection on the video to obtain the number of first objects included in the video;

the recognition module is used for recognizing the gestures in the video according to a gesture recognition mode corresponding to the number of the first objects to obtain a gesture recognition result;

wherein the video comprises a first video frame, and in the case that the number of the first objects is greater than or equal to two, the identifying module comprises: a first obtaining sub-module, configured to obtain, in the first video frame, a human body region and a hand region of each of the first objects respectively; a first determining sub-module, configured to determine a second object from the first objects based on a position relationship between the body region and a first preset region, where the body region of each object includes a first body region of the second object, the hand region of each object includes a first hand region of the second object, and the first body region is located in the first preset region; the first recognition submodule is used for performing gesture recognition on the first hand region to obtain a first gesture recognition result under the condition that the position relation between the first hand region and the first human body region meets a preset position condition, and the preset position condition comprises a preset condition for judging whether to perform gesture recognition or not based on the position relation between the hand region and the human body region.

12. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 10.

13. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any one of claims 1 to 10.