CN115016641A

CN115016641A - Conference control method, device, conference system and medium based on gesture recognition

Info

Publication number: CN115016641A
Application number: CN202210617660.9A
Authority: CN
Inventors: 黄鑫; 陈龙; 蒋海洋; 马澎家; 杨望宇; 蔡俊; 张子恒
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-09-06

Abstract

The application discloses a conference control method, a device, a conference system and a medium based on gesture recognition, which comprises the following steps: acquiring a gesture recognition model with complete training, wherein the gesture recognition model comprises a feature extraction module and an inference library module; acquiring video stream information to be detected; performing feature extraction on the video stream information based on a feature extraction module to obtain a video frame to be identified; performing gesture action recognition on the video frame to be recognized based on the reasoning base module to obtain gesture action recognition information; and generating a corresponding conference control instruction according to the gesture action identification information. The invention can identify the gesture actions of the conference participants by only using the common camera and the data processing equipment, determines the gesture actions by tracking the information of the key points of the hands and generates the corresponding conference control instructions, more conveniently and quickly completes the man-machine interaction, improves the identification speed and the accuracy of the gesture actions of the conference participants, and provides technical support for the online conference studied by the cooperation of multiple persons.

Description

Conference control method, device, conference system and medium based on gesture recognition

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a conference control method and device based on gesture recognition, a conference system and a computer readable storage medium.

Background

With the development of artificial intelligence technology, applications in various computer fields including human behavior recognition, target detection, target tracking, voice recognition and the like have made great progress. Human-computer interaction is always a hot point of research, from early punched paper to mouse and keyboard operation, from the existing touch screen technology to voice recognition interaction, and the human-computer communication mode is more and more natural and humanized. The recent rise of virtual reality and augmented reality technologies has also brought about the development of gesture recognition interaction technologies.

The traditional gesture recognition algorithm mainly comprises threshold segmentation, edge image segmentation, region-based segmentation and the like. In the daily operation of a company, multiple departments are often required to discuss a project. However, the traditional method can only be used for demonstrating by a single speaker, cannot allow other participants to comment, is difficult to discuss in real time, and is not beneficial to the efficient development of the collaborative consultation. In addition, conventional gesture recognition algorithms have some disadvantages, such as: the speed is high, but the accuracy is low; the self-adaptive series of multi-threshold segmentation has large calculation amount, and the result is sensitive to the threshold, occupies more resources and the like.

Therefore, a conference control method based on gesture recognition is needed to be provided to solve the problems of slow gesture recognition speed and large occupied resource of the existing conference system for controlling multi-person collaborative interaction through gestures.

Disclosure of Invention

In view of this, it is necessary to provide a conference control method, apparatus, system and computer readable storage medium based on gesture recognition, so as to solve the problems of slow gesture recognition speed and large occupied resources when controlling multi-user collaborative interaction through gestures in the prior art.

In order to solve the above problem, the present invention provides a conference control method based on gesture recognition, including:

acquiring a gesture recognition model which is trained completely, wherein the gesture recognition model comprises a feature extraction module and an inference library module;

acquiring video stream information to be detected;

extracting the characteristics of the video stream information based on the characteristic extraction module to obtain a video frame to be identified;

performing gesture action recognition on the video frame to be recognized based on the reasoning library module to obtain gesture action recognition information;

and generating a corresponding conference control instruction according to the gesture action identification information.

Further, the obtaining of the fully trained gesture recognition model includes:

creating an initial gesture recognition model;

acquiring a gesture video sample data set, and dividing the sample data set into a training set and a verification set;

training the initial gesture recognition model by using the training set to obtain a trained gesture recognition model;

and performing performance evaluation on the trained gesture recognition model by using the verification set, and obtaining the preset gesture recognition model when the trained gesture recognition model reaches a preset performance standard.

Further, the gesture recognition model includes a plurality of channel separable volume blocks;

the channel separable convolution block includes a plurality of convolution layers and a SE channel attention layer;

the SE channel attention layer is connected to a plurality of the convolution layers.

Further, the activation function of the channel separable volume block is a Swish function;

the Swish function is used to pass the output data of the plurality of convolutional layers.

Further, the network parameters of the gesture recognition model include depth, width, and resolution size;

and adjusting the weights corresponding to the depth, the width and the picture size by using a MnasNet grid searching method.

Further, training the initial gesture recognition model using the training set includes:

carrying out first-stage training on the initial gesture recognition model by using a random data enhancement training mode to obtain a gesture recognition model after preliminary optimization;

and performing second-stage training on the preliminarily optimized gesture recognition model by using a antagonism sample training mode to obtain a trained gesture recognition model.

Further, the first stage training includes adjusting network parameters of the model; the second stage training includes adjusting network parameters and scale of the model.

The invention also provides a conference control device based on gesture recognition, which comprises:

the model acquisition module is used for acquiring a gesture recognition model which is completely trained, and the gesture recognition model comprises a feature extraction module and an inference library module;

the video information acquisition module is used for acquiring video stream information to be detected;

the extraction module is used for extracting the characteristics of the video stream information based on the characteristic extraction model to obtain a video frame to be identified;

the recognition module is used for recognizing the gesture action of the video frame to be recognized based on the reasoning library module to obtain gesture action recognition information;

and the instruction generating module is used for generating a corresponding conference control instruction according to the gesture action identification information.

The invention also provides a conference system, which comprises a processor and a memory, wherein the memory is stored with a computer program, and when the computer program is executed by the processor, the conference control method based on gesture recognition in any technical scheme is realized.

The invention also provides a computer readable storage medium, wherein the program medium stores computer program instructions, and when the computer program instructions are executed by a computer, the computer is enabled to execute any one of the conference control methods based on gesture recognition.

Compared with the prior art, the invention has the beneficial effects that: firstly, establishing a gesture recognition model with complete training; secondly, acquiring video stream information to be detected; thirdly, recognizing the gesture action in the video stream information to be detected through a gesture recognition model; and finally, generating a corresponding conference control instruction according to the gesture action recognition information. The invention can identify the gesture actions of the conference participants by only using a common camera and data processing equipment, does not need to use specific matched hardware equipment, determines the gesture actions by tracking the information of key points of hands and generates corresponding conference control instructions, more conveniently and quickly completes man-machine interaction, improves the identification speed and accuracy of the gesture actions of the conference participants, and provides technical support for an online conference studied by multiple persons in a collaborative interaction manner.

Drawings

Fig. 1 is a schematic flowchart of an embodiment of a conference control method based on gesture recognition according to the present invention;

FIG. 2 is a schematic flowchart illustrating an embodiment of a method for obtaining a fully trained gesture recognition model according to the present invention;

FIG. 3 is a schematic structural diagram of an initial gesture recognition model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an embodiment of an MBConv volume block provided by the present invention;

fig. 5 is a schematic structural diagram of an embodiment of a conference control apparatus based on gesture recognition according to the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

The invention provides a conference control method based on gesture recognition, which comprises the following steps:

step S101: acquiring a gesture recognition model which is trained completely, wherein the gesture recognition model comprises a feature extraction module and an inference library module;

step S102: acquiring video stream information to be detected;

step S103: extracting the characteristics of the video stream information based on the characteristic extraction module to obtain a video frame to be identified;

step S104: performing gesture action recognition on the video frame to be recognized based on the reasoning library module to obtain gesture action recognition information;

step S105: and generating a corresponding conference control instruction according to the gesture action identification information.

As a specific embodiment, the feature extraction module may utilize an Efficienct model framework or a mobilene network model framework.

As a specific embodiment, the inference library module is a strongly classified neural network Realtimenet. In the current application of Realtimenet, the neural network architecture can identify which action gesture is, can also judge abnormal behavior identification such as movement and fighting, and can also calculate the calorie consumed. Thus, it can be used in the following scenarios: 1. gesture control (in smart home devices, smart kiosks, automobiles); 2. human body action recognition (in smart home devices, automobiles, public places, video calls); 3. body-building tracking; 4. human-computer interaction (e.g., is the user talking to me or others; 5. AR (gesturing from the "self" perspective); 6. the interaction between the digital human and the virtual human and the user is realized.

In the conference control method based on gesture recognition provided by this embodiment, first, a fully trained gesture recognition model is established; secondly, acquiring video stream information to be detected; thirdly, recognizing the gesture action in the video stream information to be detected through a gesture recognition model; and finally, generating a corresponding conference control instruction according to the gesture action recognition information. The invention can identify the gesture actions of the conference participants by only using a common camera and data processing equipment, does not need to use specific matched hardware equipment, determines the gesture actions by tracking the information of key points of hands and generates corresponding conference control instructions, more conveniently and quickly completes man-machine interaction, improves the identification speed and accuracy of the gesture actions of the conference participants, and provides technical support for an online conference studied by multiple persons in a collaborative interaction manner.

As a preferred embodiment, in step S101, as shown in fig. 2, acquiring a fully trained gesture recognition model includes:

step S201: creating an initial gesture recognition model;

step S202: acquiring a gesture video sample data set, and dividing the sample data set into a training set and a verification set;

step S203: training the initial gesture recognition model by using the training set to obtain a trained gesture recognition model;

step S204: and performing performance evaluation on the trained gesture recognition model by using the verification set, and obtaining the preset gesture recognition model when the trained gesture recognition model reaches a preset performance standard.

As a specific embodiment, in step S202, first, the videos and the classification labels in the gesture video sample data set are converted into images (video frames) and corresponding classification labels, or a small video is separately labeled with classification categories without labeling, specifically: cutting out video frames (in the form of pictures) from each video (training and testing videos) by a certain FPS (indicating the number of frames transmitted per second on a picture) and storing the video frames as a training set and a testing set, and taking the classification performance of the images as the classification performance of the corresponding videos;

and after the training is finished, the model is loaded to check and verify all video frames in the test set, and the first five video frames of the accuracy rate on the full test set are output.

As a preferred embodiment, the network parameters of the gesture recognition model include depth, width and resolution size;

As a specific example, when the initial gesture recognition model is created based on the Efficienct model, the baseline model EfficientNet-B0 is generated using a MnasNet model implemented with a reinforcement learning algorithm. The models in the EfficientNet series model range from EfficientNet-B0 to EfficientNet-L2, and the accuracy of the models is higher and higher.

By adopting a composite scaling method, under the restriction conditions of preset memory and calculated amount, the depth, width (channel number of a characteristic diagram) and picture size of the EfficientNet-B0 model are scaled simultaneously, the scaling ratios of the three dimensions are obtained by grid search, and finally the initial gesture recognition model established based on EfficientNet is output. As shown in fig. 3. The respective subgraph meanings in fig. 3 are:

(a) the subgraph is the reference model.

(b) And (4) carrying out width scaling on the subgraph on the basis of the reference model, namely increasing the channel number of the picture.

(c) And (4) carrying out depth scaling on the subgraph on the basis of the reference model, namely increasing the layer number of the network.

(d) The sub-graph scales the size of the picture on the basis of the reference model.

(e) And the subgraph scales the depth, the width and the size of the picture simultaneously on the basis of the reference model.

As a preferred embodiment, the gesture recognition model comprises a plurality of channel separable volume blocks;

the channel separable convolution block includes a plurality of convolution layers and an SE channel attention layer;

As a preferred embodiment, the activation function of the channel separable volume block is a Swish function;

As a specific embodiment, as shown in fig. 4, the inside of the Efficienct model is implemented by a plurality of MBConv convolution blocks, and the specific structure of each MBConv convolution block is as shown in fig. 4. Wherein, the ReLU activation function of the MBConv volume block is replaced by a Swish activation function. The MBConv volume block also uses a structure similar to residual chaining, except that the SE layer is used in the short connection part.

In addition, a drop _ connect method is also used instead of the conventional drop method. Dropconnect differs from Dropout in that in training the neural network model, rather than dropping the output of hidden nodes randomly, it drops the input of hidden nodes randomly. The role of Dropconnect and Dropout in deep neural networks is to prevent the model from generating an overfit. In contrast, the effect of DropConnect would be better.

As a preferred embodiment, training the initial gesture recognition model by using the training set includes:

As a specific example, the gesture recognition model of the present invention is derived from two aspects.

(1) Aspect of model structure scale:

from the EfficientNet-B0 version to the EfficientNet-L2 version in the EfficientNet series models, the models are higher and higher in precision and larger in scale, and the requirement for the memory is increased accordingly.

The scale of the model is mainly determined by the scaling parameters of the three dimensions of width, depth and resolution. These three dimensions are not independent of each other, and for higher resolution of the input pictures, a deeper network is required to obtain a larger perceived field of view. Also, for higher resolution pictures, more channels are needed to obtain more accurate features.

The scaling parameters for each version are shown in table 1, and it can be seen that as the scaling parameters of the model become larger, the drop rate parameter of dropout also increases. This is because the more parameters in the model, the stronger the fitting effect of the model, and the easier it is to generate overfitting.

TABLE 1

Version name	Scaling parameters: width of	Scaling parameters: depth of field	Scaling parameters: resolution ratio	Dropout rate
					EfficientNet-B0	1	1	224	0.2
EfficientNet-B1	1	1.1	240	0.2
					EfficientNet-B2	1.1	1.2	260	0.3
EfficientNet-B3	1.2	1.4	300	0.3
					EfficientNet-B4	1.4	1.8	380	0.4
EfficientNet-B5	1.6	2.2	456	0.4
					EfficientNet-B6	1.8	2.6	528	0.5
EfficientNet-B7	2.0	3.1	600	0.5
					EfficientNet-B8	2.2	3.6	672	0.5
EfficientNet-L2	4.3	8.3	800	0.5

To avoid the over-fitting problem, increasing the drop rate of dropout alone is not enough. There is also a need to improve the generalization capability of the model by means of an improvement in the training mode.

(2) Training aspects of the model:

before the EfficientNet-B7 version, the EfficientNet series model mainly improves the precision by adjusting the scaling parameters and increasing the network scale. After the EfficientNet-B7 version, the model precision is improved mainly by improving the training mode and increasing the network size 2 methods in parallel. The main training method is as follows:

1) random data enhancement: called Randaugment, is a more efficient data enhancement method. The method is used in EfficientNet-B7 version.

2) Training the model with challenge samples: the applications are described in EfficientNet-B8 and EfficientNet-L2 versions, hereinafter versions B8 through L2 are collectively referred to as AdvProp.

The random data enhancement is to directly replace the original automatic data enhancement method automatic in the original training frame. The AdvProp and the noise Student are training methods, namely random data enhancement methods, completed by using a new training framework.

3) Using a self-training framework: the application is in the Noisy Student version.

The random data enhancement method in the embodiment is a new data enhancement method, and is simpler and better-used than the automatic data enhancement method.

In this embodiment, the training method for overfitting is reduced by aligning the resistant samples. In implementation, a separate secondary batch specification is used to process antagonistic samples. I.e. an additional one auxiliary BN is used to act on the antagonistic sample alone. Antagonistic samples refer to those generated by adding an imperceptible perturbation to an image that may cause convolutional neural networks (ConvNets) to make erroneous predictions.

In the AdvProp model, 3 generation algorithms of challenge samples are used, PGD, I-FGSM and GD respectively.

Based on the foundation and application research results of the existing gesture recognition technology related to computer vision, software libraries such as a MediaPipe handles framework and OpenCV are integrated, and a gesture recognition method which can be used on a notebook computer with a camera is designed. The gesture is determined by tracking the information of the key points of the hand and corresponding decisions are made, so that the man-machine interaction is completed more conveniently and rapidly.

As a specific embodiment, the gesture actions that can be recognized by this embodiment are: clicking, translating, zooming, grabbing and rotating.

The click gesture actions include: the index finger of one hand is opened, the other four fingers are closed to form a click shape, the coordinates of the mouse are found, and the thumb is used for controlling whether the mouse is in a pressing (clicking) state or not, so that the mouse is used for clicking a file, a hyperlink and the like.

The panning gesture actions include: the five fingers of one hand are opened to move in parallel for entering the next ppt or returning to the previous ppt, and the five fingers translate in the positive direction (from left to right).

The zoom gesture action includes: the five fingers of the two hands are opened, and simultaneously move outwards in an expanding way or move inwards in a contracting way, so that the specific areas of pictures, texts and the like can be enlarged or reduced according to a specific scale.

The grab gesture action includes: the index finger and the middle finger of one hand are opened, the other three fingers are closed, a grabbing object is found, and then the thumb is opened, namely, the grabbing object is in a grabbing state and can be used for dragging a text box, characters, pictures and the like to a specified position.

The rotational gesture actions include: five fingers of both hands open to simultaneously to certain distance of clockwise moving, can discern rotatory angle according to the position that both hands moved, can be used to realize functions such as the rotation of picture.

In a preferred embodiment, the first stage training includes adjusting network parameters of the model; the second stage training includes adjusting network parameters and scale of the model.

The present invention also provides a conference control device based on gesture recognition, a block diagram of which is shown in fig. 5, and the conference control device 500 based on gesture recognition includes:

the model acquisition module 501 is used for acquiring a gesture recognition model with complete training, and the gesture recognition model comprises a feature extraction module and an inference base module;

a video information obtaining module 502, configured to obtain video stream information to be detected;

an extracting module 503, configured to perform feature extraction on the video stream information based on the feature extraction model to obtain a video frame to be identified;

the recognition module 504 is configured to perform gesture motion recognition on the video frame to be recognized based on the inference library module to obtain gesture motion recognition information;

and the instruction generating module 505 is configured to generate a corresponding conference control instruction according to the gesture motion recognition information.

The invention also correspondingly provides a conference system, which comprises a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the conference control method based on gesture recognition in any technical scheme is realized.

The present embodiment also provides a computer-readable storage medium, where the computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed by a computer, the computer is caused to execute any one of the above-mentioned conference control methods based on gesture recognition.

According to the computer-readable storage medium and the computing device provided by the above embodiments of the present invention, the content specifically described for implementing the conference control method based on gesture recognition according to the present invention can be referred to, and the beneficial effects similar to those of the conference control method based on gesture recognition as described above are obtained, and are not repeated herein.

The invention discloses a conference control method, a conference control device, a conference system and a computer readable storage medium based on gesture recognition, wherein firstly, a gesture recognition model with complete training is established; secondly, acquiring video stream information to be detected; thirdly, recognizing the gesture action in the video stream information to be detected through a gesture recognition model; and finally, generating a corresponding conference control instruction according to the gesture action recognition information. The invention can identify the gesture actions of the conference participants by only using a common camera and data processing equipment, does not need to use specific matched hardware equipment, determines the gesture actions by tracking the information of key points of hands and generates corresponding conference control instructions, more conveniently and quickly completes man-machine interaction, improves the identification speed and accuracy of the gesture actions of the conference participants, and provides technical support for an online conference studied by multiple persons in a collaborative interaction manner.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A conference control method based on gesture recognition is characterized by comprising the following steps:

acquiring video stream information to be detected;

2. The conference control method based on gesture recognition according to claim 1, wherein the obtaining of the well-trained gesture recognition model comprises:

creating an initial gesture recognition model;

3. The gesture recognition based conference control method of claim 1, wherein said gesture recognition model comprises a plurality of channel separable volume blocks;

4. The gesture recognition based conference control method of claim 3, wherein said channel separable volume block activation function is a Swish function;

the Swish function is used to pass output data of the plurality of convolutional layers.

5. The conference control method based on gesture recognition according to claim 1, wherein the network parameters of the gesture recognition model include depth, width and resolution size;

6. The method for controlling conference based on gesture recognition according to claim 5, wherein training the initial gesture recognition model by using the training set comprises:

performing a first-stage training on the initial gesture recognition model by using a random data enhancement training mode to obtain a gesture recognition model after preliminary optimization;

7. The conference control method based on gesture recognition according to claim 6, wherein the first stage training includes adjusting network parameters of a model; the second stage training includes adjusting network parameters and scale of the model.

8. A conference control apparatus based on gesture recognition, comprising:

9. A conferencing system comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements a gesture recognition based conference control method according to any of claims 1-7.

10. A computer-readable storage medium, characterized in that the program medium stores computer program instructions which, when executed by a computer, cause the computer to perform the conference control method based on gesture recognition according to any one of claims 1-7.