CN114564104A

CN114564104A - Conference demonstration system based on dynamic gesture control in video

Info

Publication number: CN114564104A
Application number: CN202210145445.3A
Authority: CN
Inventors: 苗启广; 宋建锋; 史媛媛; 刘如意; 苗凯彬; 李宇楠; 刘向增; 葛道辉
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-31

Abstract

The invention discloses a conference demonstration system based on dynamic gesture control in video, which consists of a real-time video acquisition module, a continuous gesture segmentation module, a video multi-scale redundancy removal module, a gesture recognition module and a conference demonstration system response module. The real-time video acquisition module is used for continuously acquiring real-time video stream data; the continuous gesture segmentation module is used for splitting continuous gestures in the video stream; the video redundancy removing module is used for removing the redundancy area of the single gesture video clip; the gesture recognition module is used for recognizing the received independent single gesture video; the conference demonstration system response module converts the gesture signal into a control instruction of the conference system, and the corresponding instruction function is called to realize the control of opening, showing and page turning of the conference demonstration system. The system enables a presenter to get rid of the limitations of a mouse, a keyboard and a page turning pen, enhances the interactivity of conference presentation and improves the fluency of the presentation process.

Description

Conference demonstration system based on dynamic gesture control in video

Technical Field

The invention belongs to the technical field of computer application, relates to an office teaching conference demonstration control system, and particularly relates to a conference demonstration system based on dynamic gesture control in videos.

Background

With the popularization of computer technology and equipment, projection equipment is often adopted as demonstration in office and teaching meeting scenes at present. And most of the tools that operate synchronously with the projection device are mice, keyboards or page turning pens. The interaction between performers and participants is emphasized in the office teaching scene, and the devices have certain limitations. For example, when a performer walks down the platform to interact with the participants and meets the situation of showing and turning pages, the performer has to stop interacting with the participants and return to the platform again to show and turn pages of the conference system, which destroys the fluency of conference communication. As another example, remote control devices interrupt communication when a performer interacts with a participant by shaking hands or signing. These problems described above all bring much inconvenience to the actual life work.

Disclosure of Invention

In order to solve the technical problems that in the process of meeting demonstration, a demonstration manuscript can only be controlled through a keyboard, a mouse or a page turning pen or other similar control devices, convenience and rapidness cannot be realized, and control operation of external equipment is not needed, the invention aims to provide a non-contact type meeting demonstration system based on dynamic gesture control in videos, wherein an performer is not limited by distance and equipment, and the operations of opening, showing, page turning and the like of the meeting system are freely and smoothly realized.

In order to realize the task, the invention adopts the following technical scheme:

the utility model provides a meeting presentation system based on gesture control in video which characterized in that, comprises real-time video acquisition module, continuous gesture segmentation module, video redundancy module, gesture recognition module and the meeting presentation system response module that connect gradually, wherein:

the real-time video acquisition module is used for acquiring a current video stream in real time by adopting a camera;

the continuous gesture segmentation module is used for splitting continuous gestures in the video stream, segmenting a plurality of continuous gestures into independent gesture fragments and sending the independent gesture video fragments to the video redundancy elimination module;

the video redundancy removing module is used for removing redundancy areas of the single gesture video clips, screening effective information in the video clips through a coarse redundancy removing unit and a further fine redundancy removing unit, and sending the simplified independent gesture video clips to the gesture recognition module;

the gesture recognition module is used for recognizing the received independent single gesture video, training a gesture recognition model by utilizing a recording data set constructed in advance, further performing prediction classification on the detected hand video by adopting a hand feature model, and finally sending a prediction result of the gesture recognition model to the conference demonstration system response module;

and the conference demonstration system response module is used for converting the received gesture type prediction result into a control instruction, and then sending the received control instruction to the processor to complete the opening, showing and page turning of the conference demonstration system.

According to the invention, the continuous gesture segmentation module judges whether the hand part is continuously exposed in the demonstration area by designing a hand part discriminator algorithm, segments a plurality of continuous gestures into independent gesture segments according to the exposure degree of the hand part, and sends the independent gesture video segments to the video redundancy elimination module.

Furthermore, the video redundancy removing module screens effective information in the video segments through two units of coarse redundancy removing and fine redundancy removing; wherein:

the rough redundancy removing unit is used for screening and deleting irrelevant gesture segments in the starting part and the ending part of the video;

and the redundancy eliminating unit is further designed to screen similar frames in the video in order to accelerate the speed of the gesture recognition module, so that the video information is simplified.

Specifically, the gesture recognition module specifically includes a data recording unit, a gesture recognition model training unit, and a gesture category prediction unit, where:

the gesture recognition model unit is used for training the constructed gesture data set, learning the characteristic information of different types of gestures and storing the characteristic information as a gesture characteristic model;

the gesture type prediction unit is used for performing gesture type prediction on the motion video of the hand;

the recording data set unit is obtained by collecting, recording and sorting, and is used for recording data of 18 demonstrators in an environment with a white wall as a background under normal indoor illumination, each demonstrator is 1m away from the camera, and the demonstrators demonstrate three gesture actions of clicking, grabbing and translating in a sitting posture state.

The gesture is a single-hand action and is operated by the left hand or the right hand.

And a camera in the real-time video acquisition module selects a common camera.

The conference demonstration system based on the dynamic gesture control in the video can be widely applied to the environments of office work, teaching and the like, gets rid of the constraint of a keyboard, a mouse and a page turning pen, overcomes the limitation on space, can control the system in real time, enhances the interactivity of conference demonstration and simultaneously improves the fluency of demonstration.

Drawings

Fig. 1 is a schematic diagram of the general structure of a conference presentation system based on gesture control in video according to the present invention.

FIG. 2 is a flow diagram of a continuous gesture segmentation module based on slide detection.

FIG. 3 is a schematic diagram of a sliding window segmentation algorithm in the continuous gesture segmentation module.

Fig. 4 is a graph of similarity versus successive frames in a video stream.

FIG. 5 is a graph of performance test of a training model of a gesture recognition algorithm.

The invention is explained in more detail below with reference to the figures and examples.

Detailed Description

It should be noted that the conference presentation system based on dynamic gesture control in video provided by this embodiment is constructed based on a PC end, so as to facilitate the control of the whole conference system by the PC end; on the other hand, the system is a conference system controlled by dynamic gestures in videos, not a conference system controlled by static gestures in images, and the gesture dynamic videos are directly trained by adopting a three-dimensional convolutional neural network in a gesture recognition model unit, so that the system is closer to real scenes in real life and is beneficial to application and popularization. The camera in the real-time video acquisition module is a common camera so as to facilitate the wide application of the system.

In design, the continuous gesture segmentation module is provided for video flow scene design in real life. In previous presentation systems, pictures of gestures were generally faced, that is to say they could be regarded as separate gestures, i.e. only one gesture in each picture needs to be analyzed and recognized. Compared with the situation of recognizing independent gesture videos, the gesture recognition under the real human-computer interaction scene has the primary problem that gesture separation and extraction are required to be carried out on continuously obtained gesture video streams. In the module, a continuous gesture segmentation method based on sliding window detection is further designed, so that continuous gestures of a video stream acquired by a camera are segmented. Meanwhile, in consideration of the fact that practical application scenes often have great requirements on the real-time performance of the algorithm, the multi-thread algorithm framework is adopted while continuous gesture segmentation is achieved.

Furthermore, in the design process, the video redundancy removing module consists of two units of coarse redundancy removing and fine redundancy removing. Firstly, a redundancy removing unit is roughly arranged, an adaptive interframe similarity judging algorithm is designed, and irrelevant gesture information of a starting part and an ending part in a video segment is subjected to adaptive screening and deletion; and then, a redundancy thinning unit is used for further accelerating the speed of the gesture recognition module, and a uniform proportional sampling algorithm is used for thinning redundancy and simplifying video information.

The recorded data set is acquired, recorded and sorted by the applicant. The data set records data of 18 demo persons in an environment with a white wall as a background under normal indoor illumination. Each demonstrator is 1m away from the camera, and performers demonstrate three gesture actions of clicking, grabbing and translating in a sitting posture state.

To expand the amount of data, each participant performed 5 iterations of each gesture, for a total of 90 video samples, for a total of 270 gesture videos. The data set has 3 types (click, grab, pan), with 200 videos as training set and 70 videos as test set.

The control gestures and detailed information are as follows:

the gesture is only a single-handed motion, but both the left and right hands can operate.

As shown in fig. 1, the present embodiment provides a conference presentation system based on dynamic gesture control in video, including: the system comprises a real-time video acquisition module, a continuous gesture segmentation module, a video redundancy removing module, a gesture recognition module and a conference demonstration system response module.

The real-time video acquisition module is mainly used for acquiring a current video stream in real time by adopting a camera;

the continuous gesture segmentation module is mainly used for splitting continuous gestures in a video stream, segmenting a plurality of continuous gestures into independent single gestures, and sending the independent gesture video to the video redundancy elimination module;

in this embodiment, the continuous gesture segmentation module determines whether the hand is continuously exposed in the presentation area by designing a hand discriminator algorithm, and segments a plurality of continuous gestures into individual independent gesture segments according to the exposure degree of the hand.

The video redundancy removing module is mainly used for removing the redundancy area of the single gesture video clip; screening effective information in the video clips through a coarse redundancy removing unit and a further fine redundancy removing unit, and sending the simplified independent gesture video clips to a gesture recognition module; wherein:

The gesture recognition module mainly comprises the acquisition of a hand video, the model training of a dynamic gesture video and the prediction by loading a model; the conference demonstration system response module converts the gesture signal into a control instruction of the conference system, and the corresponding instruction function is called to realize the control of opening, showing and page turning of the conference demonstration system.

Specifically, the gesture recognition module specifically includes a data recording unit, a gesture recognition model unit, and a gesture category prediction unit, where:

And the conference presentation system response module is used for converting the received gesture type prediction result into a control instruction and then sending the received control instruction to the processor to complete the opening, showing and page turning of the conference presentation system.

Referring to fig. 1, the conference presentation system based on dynamic gesture control in video of the present embodiment operates according to the following steps:

step 1: and opening the camera, performing continuous gesture actions by the performer, and acquiring real-time video stream by the camera.

Step 2: dividing a plurality of continuous gesture dynamic videos collected in real time into independent gesture videos;

in this embodiment, a continuous gesture segmentation method based on sliding window detection is designed, so as to segment a continuous gesture of a video stream acquired by a camera. As shown in fig. 2, in the present embodiment, a multi-thread processing method based on sliding window segmentation is designed, so as to implement segmented sampling of a video stream acquired by a camera. The whole process is realized by the cooperation of the two threads, and by the method, the time delay accumulation caused by the gesture recognition process can be avoided, the whole processing efficiency is further improved, and the real-time performance of the human-computer interaction system is ensured.

Thread 1 is primarily responsible for the capture of video. Firstly, a sliding detection window with the length of n is maintained, the window carries out detection once every t seconds, and if the fact that hands appear in n continuous frames is detected, the fact that the next 100 frames are all effective gesture action information is judged. Meanwhile, in order to ensure the robustness of the recognition, a sampling queue with a length of 100 frames needs to be maintained, and the video frame sequence in the sliding window is included therein (the frame sequence with the length of 100 is read in for the first time). Thread 1 sends an activation signal to thread 2 every time it completes a read (here, a 100 frame sample sequence is a sample threshold determined by the time and number of video frames required to perform an independent gesture based on experimental statistics).

Thread 2 is primarily responsible for data processing and prediction of gestures. After the sample queue is full, the video frame sequence therein is sent to the video deduplication module.

As shown in fig. 3, the detail of the sampling of thread 1 is shown. And for a real-time video stream, a sliding window detection unit is arranged, if the hand parts of 10 continuous frames are detected to appear, the next 100 frames are judged to be effective gesture information, and a video redundancy removal module stage is started. If no hand appears in 10 continuous frames, the gesture action is not considered to be started, and the sliding window detection of the next round is abandoned. Each time a hand is detected for 10 consecutive frames, the module proceeds to the next module for the next 100 consecutive segments. By the segmentation method based on sliding window detection, the independent gestures are divided.

And step 3: and performing redundancy elimination processing on the independent gesture video segments, wherein the redundancy elimination processing comprises a coarse redundancy elimination unit and a fine redundancy elimination unit which are used for screening effective information in the video segments.

Further, the rough redundancy removing unit is used for screening and deleting irrelevant gesture segments in the starting part and the ending part of the video; independent gesture video segment data are unified into 100 frames, but through single gesture video analysis, the video existence information redundancy is found, the gesture action is concentrated in the middle part of the video, and the front end and the tail end of the video basically do not contain useful gesture information. Therefore, through the statistical analysis of the video clips, as shown in fig. 4, the whole video data has 100 frames, and a single gesture action performance needs about 2s (60 frames). In the video, about the first 10 frames of the video, the performer is in a waiting state and does not start to perform any action. The video ends for about 10 frames and the performer gesture presentation is substantially finished. Through the discrimination of the interframe similarity, a self-adaptive frame sampling threshold value is set to screen the video frames, and 60 frames in the middle are selected, so that the information redundancy is reduced.

Furthermore, in order to accelerate the speed of the gesture recognition module, the method designs a uniform sampling based on equal intervals to screen similar frames in the video in a detail mode, and video information is simplified. One frame is reserved every m frames in the video, and the rest m-1 frames are deleted. Then, a video with a standard frame number is obtained, and m is obtained by the following formula:

wherein, the standard frame number is s, and the total is the actual frame number of the original video.

For video with a lower frame number than standard, the ratio between the standard frame number and the actual frame number is calculated:

then, for each frame in the video, it is replicated ratio times, interpolated after the position of the frame. The difference between the video frame number and the standard frame number at this time is:

dif＝s-total*ratio (3)

if dif is greater than 0, then the dif frame is randomly selected from the original total frame to be copied once, and the dif frame is sequentially placed behind the position of the random frame. At this point, the video with the frame number smaller than the standard completes the frame expansion/completion operation.

And 4, step 4: and sending the gesture recognition model constructed according to the recorded data set into a three-dimensional neural network for training, and iterating for 20 times to obtain a corresponding network model. When the I3D generation is trained for 20 generations, the training program is set to store the trained gesture recognition model once per generation when training the gesture recognition model. The learning rate of the training gesture recognition model adopts exponential decay, and the calculation formula is as follows:

l_n＝l_o×γ^epoch。

wherein l_nIndicates the learning rate of a new round of update,/_oDenotes the learning rate before update, and γ is a parameter. Gamma is set to 0.1, and the learning rate finally converges to 0.001 through exponential decay. The convergence of the network is accelerated by adopting the exponential decay model, so that the network can be converged better to obtain an optimal solution.

Through continuous tests of the inventor, the fact that the batch _ size is set to be 8 is determined, the video memory utilization rate is improved, the parallelization efficiency of large matrix multiplication is further improved, the iteration times required by training are reduced, and the training speed of the same data volume is further accelerated.

In the training process, in order to see the effect of the model as a whole, firstly, each epoch tests the test data, finds out a training algebra with higher accuracy (as shown in fig. 5), and further analyzes and compares to find out a numerically optimal model. And finally, training a 12 th generation model as an optimal model.

Further, the I3D model, the double-current network, the 3DRes-18, the I3D and the Yolov3+ Res-18 are trained according to the same algebra respectively, the accuracy of each model and the performance of the recognition time comparison model are obtained by testing the same test data, and the performance evaluation comprises the recognition capability, the recognition speed and the robustness of the model.

Detection precision and speed comparison of four models

	Dual stream network	3DRes-18	Yolov3+Res-18	I3D
					Acc	0.85	0.69	0.95	0.92
FPS/ms	146	129	210	130

In this stage, the optical flow data needs to be extracted from the RGB video data in advance by using the dual-flow network, which means that the gesture recognition model implemented by using the dual-flow network model cannot be implemented in real time, and 3DRes-18 is used to train the video data, and as a result, it is found that the recognition accuracy is low and the space that can be promoted is large. The Yolov3+ Res-18 is trained, and the result shows that the model achieves a good effect in the aspect of accuracy, but the gesture detection consumes too long time and cannot achieve good real-time correspondence. The training of I3D shows that the I3D model achieves good effect on recognition speed and accuracy, and the accuracy and recognition time of the test set of each model are shown in Table 5.1. By comparing all aspects, the I3D is adopted as the basic model in the embodiment to continue deep training and result optimization.

And 5: loading a gesture recognition model to carry out prediction classification on motion videos of the hand;

step 6: and converting the acquired gesture type into a function control function of the conference system, and controlling the conference system to open, show or turn pages through the gesture.

Through the PPT control system based on gesture recognition, a performer does not need to operate external equipment, only a common camera at a PC (personal computer) end needs to acquire a gesture dynamic video, a corresponding control instruction can be acquired, and convenience and smoothness of demonstration are improved.

It should be understood that the above embodiments are preferred examples of the present invention, the present invention is not limited to the above embodiments, and those skilled in the art may make additions or substitutions of technical features without departing from the technical solutions of the present invention, and therefore, the technical solutions generated by the additions or substitutions should also belong to the protection scope of the present invention.

Claims

1. The utility model provides a meeting presentation system based on dynamic gesture control in video which characterized in that, by the real-time video acquisition module that connects gradually, continuous gesture cuts apart module, video and removes redundant module, gesture recognition module and meeting presentation system response module and constitute, wherein:

2. The system as claimed in claim 1, wherein the continuous gesture segmentation module determines whether the hand is continuously exposed in the presentation area by designing a hand discriminator algorithm, segments a plurality of continuous gestures into independent gesture segments according to the exposure degree of the hand, and sends the independent gesture video segments to the video redundancy elimination module.

3. The system for meeting presentation based on dynamic gesture control in video according to claim 1, wherein the video redundancy removing module filters the effective information in the video segment through two units of coarse redundancy removing and fine redundancy removing; wherein:

4. The video-based conference presentation system controlled by dynamic gestures according to claim 1, wherein the gesture recognition module specifically comprises a recording data set unit, a gesture recognition model unit and a gesture category prediction unit, wherein:

the recording data set unit is obtained by collection, recording and arrangement, under normal indoor illumination, 18 demo persons are recorded in an environment with a white wall as a background, each demo person is 1m away from the camera, and the demo persons demonstrate three gesture actions of clicking, grabbing and translating in a sitting posture state.

5. The video-based dynamic gesture control conference presentation system according to claim 4, wherein the gesture is a one-handed motion, operated with either left or right hand.

6. The system for meeting presentation based on dynamic gesture control in video of claim 1, wherein the camera in the real-time video capture module is a common camera.