CN114924645A

CN114924645A - Interaction method and system based on gesture recognition

Info

Publication number: CN114924645A
Application number: CN202210542825.0A
Authority: CN
Inventors: 徐东升; 丁为国; 朱雷震
Original assignee: Shanghai Zhuangsheng Xiaomeng Information Technology Co ltd
Current assignee: Shanghai Zhuangsheng Xiaomeng Information Technology Co ltd
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-19

Abstract

The invention provides an interaction method and system based on gesture recognition, wherein the interaction method based on gesture recognition comprises the following steps: acquiring a video data stream of a monitoring area, and acquiring an image frame from the video data stream; performing human shape detection on the image frame to determine human shape regions in the image frame; and executing gesture detection on the human-shaped area, and determining a focusing area according to the gesture detection result. According to the method and the device, the focusing area is determined according to the gesture detection result, so that the target area in the conference is focused in real time. If the target area is set as a human-shaped area of the speaker or the participant, the speaker or the participant in the conference can be effectively focused in real time.

Description

Interaction method and system based on gesture recognition

Technical Field

The invention relates to the technical field of artificial intelligence recognition interaction, in particular to an interaction method and system based on gesture recognition.

Background

With the high-speed development of artificial intelligence technology, the visual field makes a major breakthrough, and visual algorithm technologies such as face recognition, target detection, target tracking and the like are widely applied in various industries. At present, in a conference interaction mode, an intelligent conference interaction mode is a trend of future development. The pure voice and pure video interaction mode in the traditional online conference is too monotonous, and when the picture displayed in the traditional online conference contains too much background, the participants can not be effectively focused, and the speaker can not be highlighted in the multi-person conference scene.

Therefore, the invention provides an interaction method and system based on gesture recognition, so as to effectively focus on a speaker and participants in a conference.

Disclosure of Invention

The invention provides an interaction method and an interaction system based on gesture recognition, which are used for effectively focusing a speaker and participants in a conference.

In a first aspect, the present invention provides an interaction method based on gesture recognition, including: acquiring a video data stream of a monitoring area, and acquiring an image frame from the video data stream; performing human shape detection on the image frame to determine human shape regions in the image frame; and executing gesture detection on the human-shaped area, and determining a focusing area according to the gesture detection result.

The beneficial effects are that: according to the method and the device, the focusing area is determined according to the gesture detection result, so that the target area in the conference is focused in real time. If the target area is set as a human-shaped area of the speaker or the participant, the speaker or the participant in the conference can be effectively focused in real time.

Optionally, the performing gesture detection on the humanoid region includes: executing gesture detection on the human-shaped area, if a first gesture is detected, determining that the human-shaped area containing the first gesture is a human-shaped area of a speaker, and the human-shaped area not containing the first gesture is a human-shaped area of a participant, and executing real-time focusing processing on the human-shaped area of the speaker; and if the second gesture is detected, performing focusing processing on the human-shaped area of the participant. The beneficial effects are that: according to the first gesture and the second gesture of the speaker stroke, the focusing area can be freely switched, so that the effect of an online conference is achieved.

Further optionally, the performing gesture detection on the humanoid region further includes: and if the first gesture and the second gesture are not detected, executing real-time focusing processing on the human-shaped area of the participant. The beneficial effects are that: by performing real-time focusing on the human-shaped areas of the participants, the privacy of the participants can be protected, and some useless information or interference information can be effectively shielded.

Optionally, the performing real-time focusing processing on the human-shaped area of the speaker comprises: performing face detection on the human-shaped area of the speaker, and determining the facial features of the speaker according to the result of the face detection; performing face recognition on the image frame based on facial features of the speaker when a human-shaped region of the speaker is not detected; and determining a humanoid area containing the speaker based on the result of the face recognition, and performing real-time focusing processing on the humanoid area containing the speaker. The beneficial effects are that: when the human shape detection of the speaker fails due to reasons such as shielding, the human shape area of the speaker can be determined again according to the result of face recognition, and real-time focusing is performed to prevent the loss of a focusing target.

Further optionally, the performing face recognition on the image frame based on the facial features of the speaker comprises: and if the face of the speaker is not recognized, quitting the real-time focusing processing of the human-shaped area of the speaker, and executing the real-time focusing processing on the human-shaped area of the participant. The beneficial effects are that: if the face of the speaker is not detected, which indicates that the speaker may temporarily leave the conference, the focus of the human-shaped area of the participant is switched.

Optionally, the interaction method based on gesture recognition further includes: arranging anti-shake areas around the human-shaped areas of the speaker and the participants; if the human shape position of the speaker exceeds the anti-shake area, human shape detection is carried out on the image frame again, and the human shape area of the speaker is determined again according to a detection result; and if the position of the human figure of the participant exceeds the anti-shake area, re-performing human figure detection on the image frame, and re-determining the human figure area of the participant according to the detection result. The beneficial effects are that: since the position of the person may change, such as lowering the head to take notes, or holding a cup, or suddenly standing or sitting down, it is necessary to set the anti-shake range to control the human figure region within a reasonable range and reduce the number of times of human figure re-detection.

Further optionally, if the position of the speaker does not exceed the anti-shake area, locking the human-shaped area of the speaker; and if the position of the participant does not exceed the anti-shake area, locking the human-shaped area of the participant.

Optionally, the performing real-time focusing on the humanoid region of the speaker comprises: and performing feature extraction on the humanoid area of the speaker, and predicting the action track of the speaker in the next frame based on the detected first gesture so as to realize real-time focusing processing on the humanoid area of the speaker.

In a second aspect, the invention is directed to an interaction system based on gesture recognition, configured to perform the interaction method based on gesture recognition according to any one of the first aspect, the system comprising modules/units for performing the method according to any one of the possible designs of the first aspect. These modules/units may be implemented by hardware, or by hardware executing corresponding software.

As for the advantageous effects of the above second aspect, reference may be made to the description of the above first aspect.

Drawings

FIG. 1 is a flowchart of an embodiment of an interaction method based on gesture recognition according to the present invention;

FIG. 2 is a schematic structural diagram of an embodiment of an interactive system based on gesture recognition according to the present invention;

fig. 3 is a schematic diagram of a screenshot of an online conference provided by the present invention.

Detailed Description

The technical solution in the embodiments of the present application is described below with reference to the drawings in the embodiments of the present application. In the description of the embodiments of the present application, the terminology used in the following embodiments is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the following embodiments of the present application, "at least one", "one or more" means one or more than two (including two). The term "and/or" is used to describe an association relationship that associates objects, meaning that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise. The term "coupled" includes direct coupling and indirect coupling, unless otherwise noted. "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

In the embodiments of the present application, the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The invention provides an interaction method based on gesture recognition, the flow of which is shown in figure 1, and the method comprises the following steps:

s101: acquiring a video data stream of a monitoring area, and acquiring an image frame from the video data stream;

s102: performing human shape detection on the image frame to determine human shape regions in the image frame;

s103: and executing gesture detection on the human-shaped area, and determining a focusing area according to the gesture detection result.

In a possible embodiment, optionally, the performing gesture detection on the humanoid region comprises: executing gesture detection on the human-shaped area, if a first gesture is detected, determining that the human-shaped area containing the first gesture is a human-shaped area of a speaker, and the human-shaped area not containing the first gesture is a human-shaped area of a participant, and executing real-time focusing processing on the human-shaped area of the speaker; and if the second gesture is detected, performing focusing processing on the human-shaped area of the participant. In this embodiment, according to the first gesture and the second gesture of the speaker stroke, the focus area can be freely switched, so as to achieve the effect of an online conference.

In another possible embodiment, the performing gesture detection on the humanoid region further includes: and if the first gesture and the second gesture are not detected, executing real-time focusing processing on the human-shaped area of the participant. In the embodiment, the real-time focusing processing is performed on the human-shaped area of the participant, so that the privacy of the participant can be protected, and useless information or interference information can be effectively shielded.

Illustratively, a picture shot by a monocular camera is transmitted into a display to display a picture of a video conference, and the picture image is sent into a human shape and gesture detection model to identify the human shape and the gesture in the picture. If the first gesture is not found, displaying a dynamic real-time focusing picture of the participant; if the first gesture is found, the picture is dynamically focused to a gesture maker, namely a speaker, and meanwhile, only if the speaker makes a second gesture, the focusing mode of the speaker can be closed, and the dynamic real-time focusing picture of the participant is switched.

In a further possible embodiment, the performing a real-time focusing process on the humanoid region of the speaker comprises: performing face detection on the human-shaped area of the speaker, and determining the facial features of the speaker according to the result of the face detection; performing face recognition on the image frame based on facial features of the speaker when a human-shaped region of the speaker is not detected; and determining a human-shaped area containing the speaker based on the face recognition result, and performing real-time focusing processing on the human-shaped area containing the speaker. In this embodiment, when the human shape detection of the speaker fails due to a shielding or the like, the human shape area of the speaker may be determined again according to the result of the face recognition, and real-time focusing may be performed to prevent a focused target from being lost.

In one possible embodiment, the performing face recognition on the image frames based on facial features of the speaker comprises: and if the face of the speaker is not recognized, exiting the real-time focusing processing on the human-shaped area of the speaker, and executing the real-time focusing processing on the human-shaped area of the participant. In this embodiment, if the face of the presenter is not detected, which indicates that the presenter may temporarily leave the conference, the focus is switched to the participant.

In one possible embodiment, an anti-shake area is provided around the human-shaped area of the speaker and the human-shaped area of the participant; if the human shape position of the speaker exceeds the anti-shake area, human shape detection is carried out on the image frame again, and the human shape area of the speaker is determined again according to the detection result; and if the human shape position of the participant exceeds the anti-shake area, carrying out human shape detection on the image frame again, and determining the human shape area of the participant again according to the detection result. In this embodiment, since the position of the human may change, such as lowering the head to take a note, or taking a cup, or suddenly standing or sitting down, it is necessary to set the anti-shake range to control the human shape area within a reasonable range and reduce the number of times of human shape re-detection.

In yet another possible embodiment, the human-shaped area of the speaker is locked if the position of the speaker does not exceed the anti-shake area; and if the position of the participant does not exceed the anti-shake area, locking the human-shaped area of the participant.

Illustratively, since the position of the detected person is constantly changing, which may cause a slight jitter in the focused picture, the target position is controlled within a reasonable range, i.e., an anti-jitter range needs to be set. If the target position detected by the current frame is out of the range, the target position detected by the current frame is obtained again, otherwise, the target position is the target position of the previous frame, the current frame is i, the target position is x, and the jitter range is k, then:

optionally, the redundancy processing may be further performed, and based on the position of the speaker or the position of the participant, the widening processing is performed to make the display more reasonable, and if the current frame is i, the target size is y, and the redundancy coefficient is r, then: y is y r (r is more than or equal to 1); and then executing focusing processing on the target human-shaped area, wherein the focusing processing is a sliding focusing process, when the picture is focused from one position to another position, the picture is also slowly moved from one position to another position to form a picture dynamic silk sliding process, and if the current frame is i, the sliding coefficient is a, the range is (0, 1), the target position width w and the central point (cx, cy) slide according to the width and the central point, the following steps are carried out:

the actual picture being a final output displayThe height of the picture with fixed size is calculated according to the size proportion of the actual picture by the calculated width, and finally the picture is determined by the width, the height and the central point and is scaled in the same proportion to obtain the actual picture.

In one possible embodiment, the performing real-time focusing on the human-shaped area of the speaker comprises: and performing feature extraction on the humanoid area of the speaker, and predicting the action track of the speaker in the next frame based on the detected first gesture so as to realize real-time focusing processing on the humanoid area of the speaker.

Illustratively, the action track of the speaker in the next frame is predicted based on a method combining deep learning and traditional tracking ideas. Firstly, a tracking mode is started through gesture recognition, a speaker is recognized based on a detection algorithm, next frame track and characteristic information of the speaker are predicted through a Kalman filtering method and a ReID (pedestrian re-recognition) method, and then the track information and the characteristic information obtained through Hungary algorithm are matched with the speaker detected and recognized in the current frame.

The interaction method based on gesture recognition provided by the invention can get rid of the traditional remote control mode and realize intelligent control by executing the gesture command at a long distance.

The invention provides an interaction system based on gesture recognition, which is configured to execute the interaction method based on gesture recognition according to any one of the above embodiments, as shown in fig. 2, and the interaction system comprises: the system comprises an acquisition module 201, a human shape detection module 202, a gesture detection module 203 and a focusing module 204; the obtaining module 201 is configured to obtain a video data stream of a monitored area, and obtain an image frame from the video data stream; the human shape detection module 202 is configured to perform human shape detection on the image frame to determine a human shape region in the image frame; the gesture detection module 203 is configured to perform gesture detection on the human-shaped region, and the focusing module 204 is configured to determine a focusing region according to a result of the gesture detection.

In one possible embodiment, the gesture detection module comprises: a setting unit and a detection unit; the setting unit is used for setting a first gesture and a second gesture; the detection unit is used for executing gesture detection on the human-shaped area; if the first gesture is detected, determining that a humanoid area containing the first gesture is a humanoid area of a speaker, and a humanoid area not containing the first gesture is a humanoid area of a participant, wherein the focusing module executes real-time focusing processing on the humanoid area of the speaker; and if the second gesture is detected, the focusing module performs focusing processing on the human-shaped area of the participant.

Illustratively, as shown in fig. 3, the full screen of the online conference includes 5 participants including the speaker, a shape area 1 of the speaker, and shape areas 2-4 of the participants. If the first gesture is detected, the focusing module performs real-time focusing processing on the humanoid area 1 of the speaker; if a second gesture is detected, the focusing module performs focusing on the human-shaped regions 2-4 of the participants.

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the embodiments of the present application should be covered within the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

1. An interaction method based on gesture recognition is characterized by comprising the following steps:

acquiring a video data stream of a monitoring area, and acquiring an image frame from the video data stream;

performing human shape detection on the image frame to determine human shape regions in the image frame;

and executing gesture detection on the human-shaped area, and determining a focusing area according to the gesture detection result.

2. The gesture recognition based interaction method according to claim 1, wherein the performing gesture detection on the humanoid region comprises:

executing gesture detection on the human-shaped area, if a first gesture is detected, determining that the human-shaped area containing the first gesture is a human-shaped area of a speaker, and the human-shaped area not containing the first gesture is a human-shaped area of a participant, and executing real-time focusing processing on the human-shaped area of the speaker; and if the second gesture is detected, performing focusing processing on the human-shaped area of the participant.

3. The gesture recognition based interaction method according to claim 2, wherein the performing gesture detection on the humanoid region further comprises: and if neither the first gesture nor the second gesture is detected, performing real-time focusing processing on the human-shaped area of the participant.

4. The interaction method based on gesture recognition according to claim 2, wherein the performing of real-time focusing processing on the humanoid region of the speaker comprises: executing face detection on the human-shaped area of the speaker, and determining the facial features of the speaker according to the face detection result;

performing face recognition on the image frame based on facial features of the speaker when a human-shaped region of the speaker is not detected;

and determining a humanoid area containing the speaker based on the result of the face recognition, and performing real-time focusing processing on the humanoid area containing the speaker.

5. The gesture recognition based interaction method according to claim 4, wherein the performing of the face recognition on the image frame based on the facial features of the speaker comprises:

and if the face of the speaker is not recognized, exiting the real-time focusing processing on the human-shaped area of the speaker, and executing the real-time focusing processing on the human-shaped area of the participant.

6. The gesture recognition based interaction method according to claim 2, further comprising: arranging anti-shake areas around the human-shaped area of the speaker and the human-shaped area of the participant; if the human shape position of the speaker exceeds the anti-shake area, human shape detection is carried out on the image frame again, and the human shape area of the speaker is determined again according to a detection result; and if the position of the human figure of the participant exceeds the anti-shake area, re-performing human figure detection on the image frame, and re-determining the human figure area of the participant according to the detection result.

7. The interaction method based on gesture recognition according to claim 6, wherein if the position of the speaker does not exceed the anti-shake area, the human-shaped area of the speaker is locked; and if the position of the participant does not exceed the anti-shake area, locking the human-shaped area of the participant.

8. The interaction method based on gesture recognition according to claim 2, wherein the performing of real-time focusing on the humanoid region of the speaker comprises:

and performing feature extraction on the humanoid area of the speaker, and predicting the action track of the speaker in the next frame based on the detected first gesture so as to realize real-time focusing processing on the humanoid area of the speaker.

9. A gesture recognition based interaction system configured to perform the gesture recognition based interaction method according to any one of claims 1 to 8, comprising: the system comprises an acquisition module, a human shape detection module, a gesture detection module and a focusing module;

the acquisition module is used for acquiring a video data stream of a monitoring area and acquiring an image frame from the video data stream;

the human shape detection module is used for performing human shape detection on the image frame so as to determine a human shape area in the image frame;

the gesture detection module is used for executing gesture detection on the humanoid area, and the focusing module is used for determining a focusing area according to a gesture detection result.

10. The gesture recognition based interaction system according to claim 9, wherein the gesture detection module comprises: a setting unit and a detection unit; the setting unit is used for setting a first gesture and a second gesture; the detection unit is used for executing gesture detection on the human-shaped area; if the first gesture is detected, determining that a humanoid area containing the first gesture is a humanoid area of a speaker, and a humanoid area not containing the first gesture is a humanoid area of a participant, wherein the focusing module executes real-time focusing processing on the humanoid area of the speaker; and if the second gesture is detected, the focusing module executes focusing processing on the human-shaped area of the participant.