CN115167666A

CN115167666A - But interactive AR intelligence helmet device of gesture

Info

Publication number: CN115167666A
Application number: CN202210723856.6A
Authority: CN
Inventors: 童飞飞; 葛晨阳; 李辉; 杨亚林; 张小娜; 王骞; 卫倩倩; 王辉
Original assignee: Henan Costar Group Co Ltd
Current assignee: Henan Costar Group Co Ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-10-11

Abstract

The invention discloses an AR intelligent helmet device capable of realizing gesture interaction, which comprises an AR helmet and a gesture recognition module, wherein the AR helmet is connected with the gesture recognition module; the AR helmet comprises a multi-camera module, an SoC (System on a chip) computing board, a binocular micro-display screen and an optical machine module, wherein the multi-camera module collects gesture images, and gesture control commands are recognized to perform function selection, clicking, exiting and page changing actions through gesture recognition processing on the SoC computing board based on an ARM + NPU (advanced RISC machine plus user plane) architecture and are synchronously displayed on the binocular micro-display screen and the optical machine module; the gesture recognition module runs on the SoC computing board and performs hand detection and static gesture recognition on gesture images acquired by the multiple camera modules, wherein the hand detection adopts a Retina hand-based hand detection network. Compared with the prior art, the method can realize gesture interactive recognition, has the characteristics of short time delay, accurate hand action recognition and real-time interaction support, and has important significance for communication between people, between people and machines, and even between humanoid intelligent machines.

Description

But interactive AR intelligence helmet device of gesture

Technical Field

The invention relates to the technical field of VR/AR and natural interaction, in particular to an AR intelligent helmet device capable of gesture interaction.

Background

The gesture is a natural interaction mode which is inherent to people, is an important bridge for communication between people, people and machines, even people-like intelligent machines and machines, and has urgent needs in many fields, such as deaf-mute communication, intelligent home, robots, medical national defense and the like. How to obtain high-precision and high-accuracy gesture recognition has become a key for gesture interaction research. The large-view-field high-immersion AR helmet serving as a portable large-screen display device can meet the display requirements of large view field, high immersion degree and high resolution, and is developed in the directions of multi-sensor integration, virtual and real image fusion, digital information superposition and the like at present. The existing AR helmet generally adopts a key, voice control or mobile phone connection interaction mode in the interaction aspect. If the AR helmet can also adopt gesture interaction in the aspect of interaction, the method has important significance for communication and communication between people, people and machines, and even humanoid intelligent machines and machines.

Disclosure of Invention

In order to solve the technical defects, the invention aims to provide an AR intelligent helmet device capable of gesture interaction, which can realize gesture interaction recognition, has the characteristics of short time delay, accurate hand motion recognition and real-time interaction support, and has important significance for communication between people, between people and machines, and even between humanoid intelligent machines.

In order to achieve the purpose, the invention adopts the technical scheme that: an AR intelligent helmet device capable of realizing gesture interaction comprises an AR helmet and a gesture recognition module; the AR helmet comprises a multi-camera module, an SoC (System on a chip) computing board, a binocular micro-display screen and an optical machine module, wherein the multi-camera module collects gesture images, and gesture control commands are recognized to perform function selection, clicking, exiting and page changing actions through gesture recognition processing on the SoC computing board based on an ARM + NPU (advanced RISC machine plus user plane) architecture and are synchronously displayed on the binocular micro-display screen and the optical machine module;

the gesture recognition module runs on the SoC computing board and performs hand detection and static gesture recognition on gesture images acquired by the multiple camera modules, wherein the hand detection adopts a Retina hand-based hand detection network, and the static gesture recognition adopts static gesture classification network processing.

Furthermore, the multi-camera module comprises an RGB high-definition camera and an IR detection camera, and video images acquired by the RGB high-definition camera and the IR detection camera are transmitted to the SoC computing board through the mipi interface.

The multi-camera module comprises two low-light-level high-definition cameras and an IR detection camera, the two low-light-level high-definition cameras are combined to have the function of expanding the view angle FoV, and video images collected by the cameras are transmitted to the SoC computing board through the mipi interface.

The SoC computing board is provided with a SoC chip, the chip adopts an ARM core + NPU core architecture, a plurality of paths of video signals of a multi-camera module are input through a mipi interface, image fusion processing of RGB + IR or two paths of low-light level + IR is carried out after ISP processing, and the fusion target is more visible and convenient to identify for targets under different illumination conditions. The NPU core is used for running a target detection and gesture recognition algorithm. And the fused image, the target detection result and the gesture recognition result are synchronously output to a binocular micro display screen and an optical-mechanical module through mipi for display.

The binocular micro display screen adopts an OLED micro display screen or an LCoS micro display screen; the optical machine module is a near-eye optical system or an optical waveguide diffraction device and is used for near-eye AR enhanced display.

The hand detection and static gesture recognition are mainly based on operation processing of an NPU core of an SoC on an SoC computing board, the RetinaHandd-based hand detection network comprises a backbone network backbone for feature extraction, a feature processing fusion module FPN and a regression head part head module, and the regression head part head module is used for regressing specific category and coordinate information of a target from features processed by the feature processing fusion module FPN.

The static gesture recognition result is used for controlling functions and menu selection, APP clicking, exiting and page changing actions on the SoC computing board, and the static gesture classification network for realizing the static gesture recognition comprises a feature extraction module and a normalization index function, wherein the feature extraction module comprises a full connection layer, a batch standardization layer and a nonlinear activation layer; outputting the detected hand region of the static gesture classification network as a C-dimensional feature, which represents the probability that the static gesture belongs to C categories respectively; the normalized exponential function normalizes the probability to between [0,1 ].

By the technical scheme, the AR intelligent helmet device capable of realizing gesture interaction is realized based on the AR helmet and the gesture recognition module. The method has the characteristics of short time delay, accurate hand action recognition and real-time interaction support, and can improve the natural interaction capacity of the AR intelligent helmet.

Drawings

Fig. 1 is a block schematic diagram of the present invention.

Fig. 2 is a schematic diagram of a structure of a hand detection network based on RetinaHand in the gesture recognition module of the present invention.

FIG. 3 is a flow chart of a method for visualizing the Attention-FPN in the present invention.

FIG. 4 is a schematic diagram of a static gesture classification network structure according to the present invention.

Detailed Description

Referring to fig. 1, an embodiment of the present invention discloses a gesture interactive AR smart helmet apparatus, which includes: an AR helmet, a gesture recognition module; the AR helmet comprises a plurality of camera modules, an SoC (System on a chip) computing board, a binocular micro-display screen and an optical machine module, gesture images are collected by the plurality of camera modules, gesture recognition processing is carried out on the SoC computing board based on an ARM + NPU (advanced RISC machine plus peripheral component Unit) framework, gesture action control instructions are recognized to carry out function selection, clicking, exiting and page changing actions, and the gesture control instructions are synchronously displayed on the binocular micro-display screen and the optical machine module.

The multi-camera module comprises an RGB high-definition camera and an IR detection camera, and transmits acquired video images to the SoC computing board through a mipi interface;

the multi-camera module can also comprise two low-light-level high-definition cameras and an IR detection camera, the two low-light-level high-definition cameras are combined to expand the view angle FoV, and the acquired video image is sent to the SoC computing board through a mipi interface;

the SoC computing board generally comprises an SoC chip, the chip adopts an ARM core + NPU core architecture, a plurality of paths of video signals of a plurality of camera modules are input through a mipi interface, image fusion processing of RGB + IR or two paths of low-light level + IR is carried out after the processing of ISP, and the fusion target is that the target under different illumination conditions can be more dominant and is convenient to identify. The NPU core is used for running a target detection and gesture recognition algorithm. And the fused image, the target detection result and the gesture recognition result are synchronously output to a binocular micro-display screen and an optical machine module through mipi for display.

The binocular micro display screen is generally an OLED micro display screen or an LCoS micro display screen,

the optical machine module is a near-eye optical system, can be an optical waveguide diffraction device and is used for near-eye AR enhanced display.

The gesture recognition module runs on the SoC computing board, and carries out hand detection and static gesture action classification recognition on gesture images collected by the multiple camera modules, wherein the hand detection adopts a Retina hand-based hand detection network, and the gesture recognition adopts static gesture classification network processing. The result of the static gesture recognition is used for controlling functions and menu selection on the SoC computing board, APP clicking, quitting, page changing and other actions.

The hand detection and classification network is mainly operated and processed based on an NPU core of the SoC on the SoC computing board.

The Retinahand detection network belongs to a single-stage target detection network, and improves and upgrades a plurality of modules on the basis that a RetinaFace framework is used as a structure for reference, particularly, more lightweight Networks are introduced as backbone Networks, a characteristic Pyramid (FPN) is improved, a positive and negative sample generation strategy is changed, a neutral part is simplified, different loss functions are tried, and the like.

The RetinaHand-based hand detection network conforms to the classic design flow of Backbone, neck and Head in a target detection algorithm, and the network structure of the hand detection network is shown in the attached figure 2. The network structure mainly comprises three main parts:

1) The backbone network used for feature extraction is commonly referred to as a backbone.

2) The feature processing fusion module FPN, also called the nack module of the network.

3) The regression header part, generally called head module, is used for regressing information such as specific categories, coordinates and the like of the target from the features processed by the hack module.

The processing process of the hand detection network based on Retinahand comprises three steps:

the first step is as follows: the generation of an a priori anchor frame (anchor) and the matching of the anchor frame and a target frame (GT). The basic principle of all single-stage target detection algorithms based on the prior anchor frame can be summarized into classification and regression after dense sampling for the original image, so that the generation of the anchor frame is an essential step, and although the geometric meaning of the anchor frame is relative to the original image, the specific generation of the anchor frame needs to be performed by combining a feature map. In the Retina-hand, three layers of feature maps in the network are reserved, and the down-sampling ratios to the original image are 1/8, 1/16 and 1/32 respectively.

In combination with the features of the infrared gesture image data set and the consideration of speed, in one example, the size of the input infrared image original is defined as 224x224, and then the scales of the three layers of feature maps are 28x28, 14x14 and 7x7, respectively, where each pixel point on each layer of feature map corresponds to an area of 8x8, 16x16 and 32x32 on the original. For traditional algorithms such as Faster R-CNN, SSD, retinaNet and the like, k anchor frames with different scales and length-width ratios are generated by taking each pixel point on a feature map as a reference, wherein k =9 generally represents 3 anchor frames with different scales and 3 widths-high ratios. Further aiming at the characteristic that the infrared image gesture data of the invention is close to a square, the design of the anchor frame can be simplified by only considering the dimension and neglecting the aspect ratio. Meanwhile, when the data set is processed, all labels can be forcibly processed into squares by adopting a short edge lengthening mode.

After the anchor frame is generated, intensive sampling work for the original image is only completed, and further, an object for supervised learning needs to be constructed for each sample, wherein the position of the object frame relative to the anchor frame and the category of each anchor frame are specifically represented. I.e. to determine whether the anchor frame belongs to the foreground or the background, if it belongs to the foreground, a specific position needs to be determined for it, where the position is represented by the offset of the anchor frame with respect to the target frame. The offset is divided into two parts, namely the offset of the center point of the target frame relative to the center point of the anchor frame and the conversion of the width and the height of the target frame relative to the width and the height of the anchor frame, wherein the conversion specifically represents the scale proportion of the target frame and the anchor frame after logarithmic transformation.

It should be noted that, in order to eliminate the influence caused by the size of the anchor frame itself, all anchor frames are considered equally, and the target frame needs to be normalized by using the width and the height relative to the center point of the anchor frame. If normalization is not carried out, the large anchor frame can tolerate larger deviation, and the small anchor frame is very sensitive to the deviation, so that the training and learning of the model are not facilitated, and the problem can be solved by converting regression absolute scale into regression relative scale. The other important step is to convert the width and height of the target frame relative to the width and height of the anchor frame into a logarithmic space, if the conversion is not carried out, the output width and height of the model can only be positive values, the requirement on the model is increased, the optimization difficulty is increased, and the conversion into the logarithmic space solves the problem.

The second step is that: in the mapping process from input to output of the whole network, an input image 3x224x224 is firstly subjected to feature extraction through a backbone network formed by convolutional layer stacking, features of each layer in the middle of the network are extracted and sent to the next FPN for processing, the features of the last three layers of the whole backbone network are extracted in total, and for the MobileNet V1x0.25 serving as the backbone network, the scales of feature maps of the three layers are respectively 64x28x28, 128x14x14 and 256x7x7.

And 3 layers of characteristics are obtained after FPN characteristic fusion, and each layer has a large number of prior anchor frames. In order to improve the expression capability of the features, the feature map at this time is further subjected to feature extraction by a feature refining module consisting of a large convolution kernel, so that the receptive field of the feature map is enlarged.

The FPN is an Attention-FPN. The characteristic pyramid is used as a necessary component in a current target detection mainstream model, the positioning capacity of an algorithm for targets with different scales can be effectively improved, for a hand detection task, the size of a hand changes violently due to different distances and orientations of a shot object relative to a camera in an actual scene, the pixel of a target close to the camera can reach 400x400 maximally, the size of a farthest target is only 20x20, and the scale changes violently, so that a target detection network is required to have good detection capacity for large and small targets. The traditional FPN is realized by directly adding the upper sampling of the high-level features and the bottom-level features, and the design of the invention realizes the improved FPN which integrates the Attention idea.

Inspired by MobileViT, the invention expands a self-attention mechanism and introduces the self-attention mechanism into an FPN module, wherein Query, key and Value are not from the same input any more, query is from nonlinear transformation of a shallow feature map, and Key and Value are from linear transformation of a deep feature map after upsampling. The operation in raw FPN using element-by-element addition is turned into fusion using an attention mechanism. From the principle point of view of the attention mechanism, this operation can be understood as expressing each pixel inside the shallow feature map using a weighted sum of all pixels of the deep feature map. The method has the advantages that the shallow layer is represented by a deep attention mechanism, global information can be effectively introduced into each pixel in the shallow layer feature map, convolution focuses more on local information, and therefore the feature map after fusion retains the global information and the local information at the same time, and model learning is facilitated. And finally, obtaining a new feature map of the shallow feature and the deep feature fused by the attention mechanism, and further transforming the feature map by using the attention mechanism again to improve the expression capability of the features.

The specific operation is as follows: the method comprises the steps of upsampling a relatively deep feature map, wherein 7x7 sampling is 14x14, aligning the number of channels with the channels of the previous layer by using 1x1 convolution, mapping 256 to 128 to obtain 128x14x14, and then performing Attention operation by using the obtained feature map, wherein the MobileViT method is used for firstly performing slicing operation on the feature map, performing self-Attention operation on all pixels in each slice, and then performing inverse transformation on the obtained final result to obtain the same shape as the original input feature map, thereby realizing an Attention calculation process. FIG. 3 shows the complete implementation flow of the Attention-FPN.

The third step: the feature maps are respectively subjected to target frame regression branches, and the final coordinate and the probability of the foreground background are obtained through confidence classification branch regression. For the present invention, if the total number of anchor boxes is represented by N, then the final output of the classification branch of the network model will be 2N, while the final output of the coordinate-box regression branch will be 4N, representing the probability that each anchor box belongs to the front, the background, and if it belongs to the foreground, the offset of the center point of the target with respect to the anchor boxes and the logarithmic transformation value of the width height of the target with respect to the width height of the anchor boxes, respectively.

To improve the positioning accuracy, the Loss function of the regression target frame coordinates is replaced by the average absolute error Loss (IoU Loss). When the distance between the output and the target is measured by using the absolute error, the regressed geometric quantities are independent from each other, and the inherent geometric constraints between the regressed geometric quantities are lacked. While such geometrical links can be modeled if the intersection ratio between the prediction box and the real box is directly optimized, this can also be seen as a direct optimization for the evaluation index.

In another embodiment, the static gesture classification network includes a feature extraction module and a normalized exponential function; the characteristic extraction module comprises a full connection layer, a batch normalization layer and a nonlinear activation layer; outputting the detected hand region of the static gesture classification network as a C-dimensional feature, which represents the probability that the static gesture belongs to C categories respectively; the normalized exponential function normalizes the probability to between [0,1 ].

For the embodiment, the static gesture classification network, as shown in fig. 4, mainly includes a feature extraction module and a normalized exponential function. The feature extraction module is mainly formed by stacking a full connection layer, a batch standardization layer and a nonlinear activation layer module. The input of the dynamic gesture classification network is a sequence of K groups of key point positions, the output is C-dimensional (C represents class number) characteristics, the probabilities that the gestures belong to C classes respectively are represented, and meanwhile, in order to compare the maximum output probability with a set threshold value, an exponential normalization function is required to normalize the probabilities to be between [0,1 ].

The above-described embodiments are only some of the embodiments of the present invention, and the concept and scope of the present invention are not limited to the details of the above-described exemplary embodiments. Therefore, various changes and modifications of the present invention shall be covered by the appended claims without departing from the design concept of the present invention.

Claims

1. The utility model provides a but interactive AR intelligence helmet apparatus of gesture which characterized in that: the device comprises an AR helmet and a gesture recognition module; the AR helmet comprises a multi-camera module, an SoC (System on a chip) computing board, a binocular micro-display screen and an optical machine module, wherein the multi-camera module collects gesture images, and gesture control commands are recognized to perform function selection, clicking, exiting and page changing actions through gesture recognition processing on the SoC computing board based on an ARM + NPU (advanced RISC machine plus user plane) architecture and are synchronously displayed on the binocular micro-display screen and the optical machine module;

2. The gesture-interactable AR smart headset apparatus of claim 1, wherein: the multi-camera module comprises an RGB high-definition camera and an IR detection camera, and video images acquired by the RGB high-definition camera and the IR detection camera are transmitted to the SoC computing board through the mipi interface.

3. The gesture-interactable AR smart headgear apparatus of claim 1, wherein: the multi-camera module comprises two low-light-level high-definition cameras and an IR detection camera, the two low-light-level high-definition cameras are combined to have the function of expanding the view angle FoV, and video images collected by the cameras are transmitted to the SoC computing board through the mipi interface.

4. The gesture-interactable AR smart headgear apparatus of claim 1, wherein: the SoC computing board is provided with a SoC chip which adopts an ARM core + NPU core architecture, a plurality of paths of video signals of a multi-camera module are input through a mipi interface, image fusion processing of RGB + IR or two paths of dim light + IR is carried out after the image fusion processing, a fused target is more dominant and is convenient to identify for targets under different illumination conditions, and the NPU core is used for running a target detection and gesture recognition algorithm; and the fused image, the target detection result and the gesture recognition result are synchronously output to a binocular micro display screen and an optical-mechanical module through mipi for display.

5. The gesture-interactable AR smart headset apparatus of claim 4, wherein: the binocular micro display screen adopts an OLED micro display screen or an LCoS micro display screen; the optical-mechanical module is a near-eye optical system or an optical waveguide diffraction device and is used for near-eye AR enhanced display.

6. The gesture-interactable AR smart headgear apparatus of claim 1, wherein: the hand detection and static gesture recognition are mainly based on operation processing of an NPU (kernel-processing unit) core of an SoC (system on a chip) computing board, the Retinahand-based hand detection network comprises a backbone network backbone for feature extraction, a feature processing fusion module FPN and a regression head module, and the regression head module is used for regressing specific types and coordinate information of targets from features processed by the feature processing fusion module FPN.

7. The gesture-interactable AR smart headgear apparatus of claim 6, wherein: the static gesture recognition result is used for controlling functions and menu selection, APP clicking, exiting and page changing actions on the SoC computing board, and the static gesture classification network for realizing the static gesture recognition comprises a feature extraction module and a normalization index function, wherein the feature extraction module comprises a full connection layer, a batch standardization layer and a nonlinear activation layer; outputting the detected hand region of the static gesture classification network as a C-dimensional feature, which represents the probability that the static gesture belongs to C categories respectively; the normalized exponential function normalizes the probability to between [0,1 ].