CN114973414A

CN114973414A - Dynamic gesture recognition method and intelligent vehicle-mounted equipment

Info

Publication number: CN114973414A
Application number: CN202210618569.9A
Authority: CN
Inventors: 王绍昶; 管岱; 王琪
Original assignee: Zebred Network Technology Co Ltd
Current assignee: Zebred Network Technology Co Ltd
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-08-30

Abstract

The specification discloses a dynamic gesture recognition method and intelligent vehicle-mounted equipment, wherein the method comprises the following steps: acquiring a plurality of frames of images containing dynamic gestures in a dynamic gesture recognition mode; calling an action recognition model to perform feature transfer fusion and classification on the plurality of frames of images to obtain action categories of the plurality of frames of images; and determining a motion category corresponding to the dynamic gesture from the motion categories of the plurality of frames of images.

Description

Dynamic gesture recognition method and intelligent vehicle-mounted equipment

Technical Field

The specification relates to the field of intelligent driving, in particular to a dynamic gesture recognition method and intelligent vehicle-mounted equipment.

Background

At present, dynamic gesture recognition is mainly based on a depth camera to obtain a depth map or a three-dimensional point cloud, and then a motion classification result is obtained by utilizing three-dimensional convolution (3DCNN) and calculation through a recurrent neural network (RNN/LSTM) or a variant network of the recurrent neural network.

However, the above identification scheme requires a large amount of computing resources and computing power, and is not suitable for the scenario of limited computing resources of the vehicle-mounted intelligent device.

Disclosure of Invention

The specification provides a dynamic gesture recognition method and intelligent vehicle-mounted equipment, and aims to solve or partially solve the technical problem that the conventional dynamic gesture recognition scheme is not suitable for a vehicle-mounted intelligent equipment limited resource scene due to huge computing resource consumption.

In order to solve the above technical problem, the present specification provides a dynamic gesture recognition method, including:

acquiring a plurality of frames of images containing dynamic gestures in a dynamic gesture recognition mode;

calling an action recognition model to perform feature transfer fusion and classification on the plurality of frames of images to obtain action categories of the plurality of frames of images;

and determining a motion category corresponding to the dynamic gesture from the motion categories of the plurality of frames of images.

Preferably, before the acquiring the plurality of frames of images containing the dynamic gesture, the method further includes:

shooting the hand action by utilizing the camera module;

detecting whether the hand motion has a trigger condition by using a hand detection model;

and if so, starting the dynamic gesture recognition mode.

Preferably, the detecting, by using the hand detection model, whether the hand motion has the trigger condition specifically includes:

detecting whether the hovering time of the hand action exceeds a preset time;

if yes, the hand motion is provided with the trigger condition.

Preferably, the invoking the motion recognition model to perform feature transfer fusion and classification on the plurality of frame images to obtain motion categories of the plurality of frame images specifically includes:

in the plurality of frame images, associating the space dimension characteristics of the previous frame image with the space dimension characteristics of the previous frame image to perform characteristic transmission fusion on the time dimension of the space dimension characteristics in the current frame image to obtain the space-time dimension characteristics of the current frame image;

and performing convolution calculation on the space-time dimension characteristics of the current frame image to obtain the action category of the current frame image.

Preferably, the associating the spatial dimension features of the previous and adjacent frame images performs feature transfer fusion on the spatial dimension features in the current frame image to obtain the spatio-temporal dimension features of the current frame image, and specifically includes:

performing convolution processing on the current frame image to obtain the spatial dimension characteristics of the current frame image;

and replacing part of the spatial dimension characteristics of the current frame image by the part of the spatial dimension characteristics in the previous adjacent frame image to obtain the space-time dimension characteristics of the current frame image.

Preferably, after replacing part of the spatial dimension features of the current frame image with part of the spatial dimension features in the previous frame image, the method further includes:

and storing the replaced partial spatial dimension characteristics in the current frame image to replace partial spatial dimension characteristics of the next frame image.

Preferably, the determining the motion category corresponding to the dynamic gesture from the motion categories of the plurality of frames of images specifically includes:

selecting a target action category from the action categories of the plurality of frames of images as an action category corresponding to the dynamic gesture; and the number of the images corresponding to the target action category is above a preset number threshold.

determining a time window, wherein the time window is a time window corresponding to a set frame number;

determining an action type corresponding to the time window from the action types of the plurality of frames of images according to the time window;

and determining the target action category based on the action category corresponding to the time window, and determining the action category corresponding to the dynamic gesture based on the target action category.

Preferably, after determining the motion category corresponding to the dynamic gesture from the motion categories of the several frames of images, the method further includes:

transmitting the action instruction corresponding to the dynamic gesture to a downstream object for execution;

and exiting the dynamic gesture recognition mode.

This specification provides an intelligence mobile unit, includes:

the camera module is used for acquiring a plurality of frame images containing dynamic gestures in a dynamic gesture recognition mode;

the space-time conversion module is used for calling an action recognition model to perform feature transfer fusion and classification on the plurality of frame images to obtain action categories of the plurality of frame images;

and the determining module is used for determining the action category corresponding to the dynamic gesture from the action categories of the plurality of frames of images.

Preferably, the intelligent vehicle-mounted device further comprises:

the camera module is used for shooting hand motions;

the hand detection module is used for detecting whether the hand action has a trigger condition by using a hand detection model;

and if so, starting the dynamic gesture recognition mode.

Preferably, the hand detection module is specifically configured to:

detecting whether the hovering time of the hand action exceeds a preset time;

if yes, the hand motion is provided with the trigger condition.

Preferably, the space-time transform module is specifically configured to:

in the plurality of frame images, correlating the space dimension characteristics of the previous and adjacent frame images to perform characteristic transmission fusion on the space dimension characteristics in the current frame image in the time dimension to obtain the space-time dimension characteristics of the current frame image;

Preferably, the space-time transform module is specifically configured to:

and replacing part of the spatial dimension characteristics of the current frame image by using the part of the spatial dimension characteristics in the previous adjacent frame image to obtain the space-time dimension characteristics of the current frame image.

Preferably, the intelligent vehicle-mounted device further comprises:

and the storage module is used for storing the replaced partial spatial dimension characteristics in the current frame image so as to replace partial spatial dimension characteristics of a post-adjacent frame image.

Preferably, the space-time transform module is specifically configured to:

Preferably, the determining module is specifically configured to:

Preferably, the intelligent vehicle-mounted device further comprises:

the transmission module is used for transmitting the action instruction corresponding to the dynamic gesture to a downstream object for execution;

and the exit module is used for exiting the dynamic gesture recognition mode.

The present specification discloses a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.

The present specification discloses an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.

Through one or more embodiments of the present description, the present description has the following advantages or advantages:

according to the scheme, firstly, under a dynamic gesture recognition mode, a plurality of frames of images containing dynamic gestures are obtained; and then calling an action recognition model to perform feature transfer fusion and classification on the plurality of frames of images, and simultaneously realizing feature extraction (referred to as space-time dimension features in the specification) of time dimension and space dimension through the feature transfer fusion and classifying according to the feature extraction, so that the calculation of space-time dimension is performed layer by layer for multiple times without using a dimension conversion layer, a convolution layer, a batch standardization BN layer, a modified linear unit ReLu layer, a maximum pooling layer, a feature combination layer and the like, and a large amount of calculation resources can be saved, and the method is further adapted to the limited resource scene of the intelligent vehicle-mounted equipment. In addition, the action category corresponding to the dynamic gesture is determined from the action categories of the plurality of classified frame images, and the accuracy of action recognition can be ensured.

The above description is only an outline of the technical solution of the present specification, and the embodiments of the present specification are described below in order to make the technical means of the present specification more clearly understood, and the present specification and other objects, features, and advantages of the present specification can be more clearly understood.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the specification. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a process diagram for dynamic gesture recognition in accordance with one embodiment of the present description;

FIG. 2 illustrates an implementation of feature transfer fusion in accordance with one embodiment of the present description;

FIG. 3 illustrates an implementation process diagram according to an example of the present specification;

FIG. 4 is a schematic diagram illustrating a configuration of an intelligent vehicle-mounted device according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of an electronic device in accordance with one embodiment of the present description.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the current dynamic gesture recognition scheme, because the computing resource consumption is huge and the method is not suitable for a scene of limited resources of the vehicle-mounted intelligent device, the embodiment of the specification provides a dynamic gesture recognition method which is mainly suitable for the intelligent vehicle-mounted device. In the scheme, firstly, a plurality of frames of images containing dynamic gestures are obtained in a dynamic gesture recognition mode; and then calling an action recognition model to perform feature transfer fusion and classification on the plurality of frames of images, and simultaneously realizing feature extraction (referred to as space-time dimension features in the specification) of time dimension and space dimension through the feature transfer fusion and classifying according to the feature extraction, so that the calculation of space-time dimension is performed layer by layer for multiple times without using a dimension conversion layer, a convolution layer, a batch standardization BN layer, a modified linear unit ReLu layer, a maximum pooling layer, a feature combination layer and the like, and a large amount of calculation resources can be saved, and the method is further adapted to the limited resource scene of the intelligent vehicle-mounted equipment. In addition, the action category corresponding to the dynamic gesture is determined from the action categories of the plurality of classified frame images, and the accuracy of action recognition can be ensured.

Based on the above inventive concept, the method of the present specification comprises two stages: a triggering phase and an identification classification phase.

In the triggering stage, the camera module is used for shooting the hand movement, and then the hand detection model is used for detecting whether the hand movement has the triggering condition. And if the trigger condition is met, starting a dynamic gesture recognition mode, and entering a recognition classification stage. Because the computing resources of the intelligent vehicle-mounted equipment are very limited, the triggering conditions are set in the embodiment, and the intelligent vehicle-mounted equipment can enter the recognition and classification stage to perform action recognition only when the triggering conditions are met, so that the situation that a large amount of useless recognition occupies the computing resources is avoided, and the load of a vehicle-mounted end can be reduced to a certain extent. It should be noted that the hand motions of the present embodiment are isolated hand motions, for example, a hovering motion of a hand in a vehicle, so as to avoid a potential safety hazard caused by a situation that attention deviates from a road due to operation of a central control, a screen, a button, and the like.

In a specific implementation process, the camera module is used for shooting the hand motion to obtain a hand motion image. Alternatively, the hand action may be a hand hover action, such as a hand hover action after palm unrolling (or fist making). Specifically, this embodiment adopts the monocular camera module to shoot the hand action, through adopting the monocular camera module as the input source, does not need extra hardware equipment, therefore the cost of whole device is lower to the image that the monocular camera module gathered occupies the resource littleer, handles more swiftly, can improve the treatment effeciency.

Further, the hand motion image is input to the hand detection model and processed. In the process of processing, there are multiple processing manners, and this embodiment provides two processing manners for explanation, but does not form a limitation. Treatment method 1: the hand detection model detects whether the hovering time of the hand motion exceeds a preset time, for example, exceeds 0.5 s. If the number exceeds the preset threshold value, the triggering condition is manually met, and then the vehicle-mounted terminal enters the identification and classification stage, so that the vehicle-mounted terminal load is reduced to a certain extent. Of course, in addition to detecting hand hover time, a predetermined action may also be used as a trigger to initiate a dynamic gesture recognition mode. Treatment method 2: the hand detection model detects whether the hand action is a preset gesture (for example, a hand hovering action), and if the hand action is the preset gesture, the hand detection model enters a recognition and classification stage, so that the load of the vehicle-mounted end is reduced to a certain extent.

In addition, the hand detection model is developed by adopting a lightweight backbone framework to adapt to the computing power situation with limited resources on the vehicle-mounted system, and other hand detection models can be adopted. For example, the hand detection model uses a lightweight, slim-net version of yolox. In the training process of the hand detection model, a plurality of video segments marked with action starting frames and action ending frames are obtained, then the video segments are converted into image data, and the selected initial model is trained by utilizing the image data with marking information, so that the hand detection model is obtained. The training process of the motion recognition model is similar, and is not repeated in the following. In practical application, the hand detection model can be configured in the palm detector to be responsible for triggering the hand motion detection in a stage, and the motion recognition model can be configured in the motion recognizer to be responsible for recognizing the dynamic gesture. And when the palm detector detects that the hand action meets the trigger condition, the action recognizer is triggered to start so as to reduce the load of the vehicle-mounted end.

In some alternative embodiments, the palm detector and the motion recognizer together correspond to a cooling period during which detection of hand motion and recognition of dynamic gestures are not responded to. If the palm detector and the action recognizer are not in the cooling time period, which indicates that the palm detector and the action recognizer are in a normal working state, the following detection processes are executed: and when the palm detector detects that the hand action time is 0.5s-1s, triggering and starting a dynamic gesture recognition mode, and entering a recognition classification stage. At this time, it is first determined whether the action identifier is overtime, that is, whether the operating time of the action identifier exceeds a preset time, and if the operating time is overtime, the action identifier is controlled to enter a cooling period. If not, the motion recognition model is started to carry out classification recognition, and after the recognition is finished, the dynamic gesture recognition mode is exited to enter a cooling time period, so that the false recognition condition caused by the interference of the residual motion is avoided.

Implementation in the recognition classification stage referring to fig. 1, the following steps are performed:

step 101, acquiring a plurality of frames of images including dynamic gestures in a dynamic gesture recognition mode.

In the present embodiment, a monocular camera module is used to capture the plurality of frames of images. Through adopting the monocular camera module as the input source, do not need extra hardware equipment, consequently the cost of whole device is lower to the image that the monocular camera module was gathered occupies the resource littleer, handles more swiftly, can improve the treatment effeciency. The frames of images can be arranged and combined into a video stream containing dynamic gestures. The dynamic gesture in this embodiment has various specific forms, such as grabbing, releasing, clicking, swinging back and forth, sliding left and right, etc., and is not limited to a gesture that is easy to recognize. It should be noted that the dynamic gesture of the present embodiment is an isolated hand motion, so as to avoid the potential safety hazard caused by the deviation of attention from the road due to the operation of the central control, the screen, the buttons, and the like.

And 102, calling a motion recognition model to perform feature transfer fusion and classification on the plurality of frames of images to obtain motion categories of the plurality of frames of images.

Before the action recognition model classification is called, whether the action recognition model classification is overtime needs to be judged in advance, and if the action recognition model classification is overtime, the action recognition model classification is controlled to enter a cooling period. If the time is not out, the motion recognition model is started for classification recognition, and after the recognition is finished, the dynamic gesture recognition mode is exited to enter a cooling time period, so that the false recognition condition caused by the interference of the residual motion is avoided.

The feature transfer fusion is to transfer and fuse the features of the previous and adjacent frame images into the current frame image, so that the features of the current frame image and the features of the previous and adjacent frame images have correlation in the time dimension, and the features in the current frame image can have the time dimension and the space dimension at the same time without carrying out a large amount of calculation, thereby realizing the feature extraction of the current frame image in the time dimension and the space dimension.

Feature transfer fusion is carried out on the plurality of frame images frame by frame so as to realize feature extraction of the plurality of frame images in a time dimension and a space dimension respectively. After the features are transmitted and fused, the features in each frame of image of the plurality of frames of images all have space-time dimensions, so that the plurality of frames of images all have space-time dimension features without carrying out a large amount of calculation, and a large amount of calculation resources can be saved.

In the embodiment, in order to adapt to a limited resource scene of the intelligent vehicle-mounted device, a plurality of frame images are processed in a characteristic transfer fusion mode to obtain the space-time dimension characteristics of each frame image. In the process of feature transfer fusion and classification of a plurality of frame images in this embodiment, each frame image is taken as a current frame image during processing, so the following processing is performed for each frame image: and associating the space dimension characteristics of the previous and adjacent frame images to perform characteristic transfer fusion on the space dimension characteristics in the current frame image in the time dimension to obtain the space-time dimension characteristics of the current frame image. Specifically, the spatial dimension characteristics of the current frame image can be obtained by performing convolution processing on the current frame image. The number of convolution processes is not limited in this embodiment, and the obtained spatial dimension feature includes spatial information of width and height. Furthermore, partial space dimension characteristics in the previous adjacent frame image are used for replacing partial space dimension characteristics of the current frame image, so that the characteristic transmission on the time dimension is realized, and the space-time dimension characteristics of the current frame image are obtained. And during classification, performing convolution calculation on the space-time dimension characteristics of the current frame image to obtain the action category of the current frame image. Thus, classification of the current frame image can be realized. By performing the above processing on each frame image, the motion category of each frame image can be obtained, so that the motion categories of several frame images are obtained.

And for the replaced partial spatial dimension features in the current frame image, the following processing can be performed: and storing the replaced partial spatial dimension characteristics in the current frame image to replace partial spatial dimension characteristics of the next frame image. Naturally, the processing procedure may also be implemented before the replacement, for example, a part of the spatial dimension features of the current frame image is extracted and stored, and then a part of the spatial dimension features of the current frame image is replaced by the part of the spatial dimension features of the previous adjacent frame image.

For ease of illustration and explanation of the above, reference is made to fig. 2 for example.

In the N frame image, the numerical value of N is not limited in this application, and the t frame image F is used _t T +1 th frame image F _t+1 The t +2 th frame image will be described as an example. Wherein, F _t For any one of the N frame images, andF _t+1 is F _t The subsequent adjacent frame image.

T frame image F _t And extracting corresponding spatial dimension characteristics through multilayer convolution calculation. And the spatial dimension features are present in a matrix form. Extraction of F _t The characteristic of partial space dimension

Storing in a computer cache area; using a previously stored partial feature matrix

Replacing partial spatial dimension features

To add F _t Is characterized by the spatial dimension of (A), from which F is obtained _t The spatiotemporal dimension of (a). At this time, F is calculated through multilayer convolution _t The space-time dimension characteristics of the system can obtain a classification result y _t 。

When t +1 frame image F _t+1 And extracting corresponding spatial dimension characteristics through multilayer convolution calculation. And the spatial dimension features exist in a matrix form. Extraction of F _t+1 The characteristic of partial space dimension

Storing in computer buffer with F _t Partial feature matrix of

Replacing partial spatial dimension features

To add F _t+1 Is obtained from the spatial dimension characteristic of (a) to obtain F _t+1 The spatiotemporal dimension of (a). Calculating F by multilayer convolution _t+1 The space-time dimension characteristics of the user can be obtained, and a classification result y can be obtained _t+1 。

Specifically, each frame of image is convolved to obtain a spatial dimension feature, which comprises width information and height information in space. And carrying out feature transfer fusion on each frame of image in a time dimension frame by frame to obtain the space-time dimension feature of each frame of image. Therefore, the feature extraction of time and space dimensions can be simultaneously realized in the same classification model, the classification of dynamic actions is further realized, and compared with the traditional space-time classification mode such as 3DCNN/RNN/LSTM and the like, the method has the advantages that the time and space dimension feature extraction is realized simultaneously. The method has the advantages that the feature transmission and fusion processing is carried out on each frame, the feature extraction in time and space dimensions can be realized without carrying out a large amount of calculation, the calculated amount is low, the resource occupation is small, the efficiency is higher, and the method can adapt to the scene of resource shortage of intelligent vehicle-mounted equipment.

It is noted that, for a first frame image in the plurality of frame images, an image is obtained, the spatial dimension features of the image exist in a matrix form, and partial spatial dimension features (numerical values can take 0) in the image are extracted in the foregoing manner to replace partial spatial dimension features at corresponding positions in the first frame image, thereby realizing feature extraction of the first frame image in time and spatial dimensions.

And 103, determining the motion category corresponding to the dynamic gesture from the motion categories of the plurality of frame images.

After the classification, each frame image in the plurality of frame images corresponds to a respective action category, and therefore, the target action category is selected from the action categories of the plurality of frame images as the action category corresponding to the dynamic gesture. The number of images corresponding to the target action category is above a preset number threshold, and further, the number of images corresponding to the target action category is the largest. Specifically, since each of the plurality of frame images has an action type, if a plurality of frame images correspond to the same action type, the action type is set as a target action type. For example, if the number of frame images corresponding to the motion category is the largest, the motion category is set as the target motion category. For example, if the motion type of 15 images out of 20 images is the same motion type, the motion type is set as the target motion type.

In some optional embodiments, a time window is determined, and the time window corresponds to the set frame number. For example, 10 frames correspond to a time window, but may be set to another number of frames. And determining the action type corresponding to the time window from the action types of the plurality of frames of images according to the time window. And determining the target action category based on the action category corresponding to the time window, for example, determining the action category with the largest number of frames in the time window as the target action category. And then, determining the action category corresponding to the dynamic gesture based on the target action category, for example, directly using the target action category as the action category corresponding to the dynamic gesture, thereby improving the accuracy of action recognition.

In some optional embodiments, after the time window is determined, the motion category corresponding to the time window of each frame segment is determined from the motion categories of the plurality of frame images frame by frame with the time window as a reference. For example, in the present example, if there are 10 frames of images and 5 frames are set to correspond to a time window, the motion type is determined frame by frame from the 1 st frame in units of 5 frames. Specifically, the action type corresponding to a time window of 1-5 frames, the action type corresponding to a time window of 2-6 frames, the action type corresponding to a time window of 3-7 frames, the action type corresponding to a time window of 4-8 frames, and the action type corresponding to a time window of 5-10 frames are determined by taking 5 frames as a unit. Further, based on the action type corresponding to the time window of each frame segment, a target action type corresponding to the time window of each frame segment is determined. For example, the action category with the largest number of frames in the time window of each frame segment is determined as the target action category, and thus the time window of each frame segment has one corresponding target action category. And determining the action type corresponding to the dynamic gesture based on the target action type corresponding to the time window of each frame section. Specifically, the target action types corresponding to the time windows of the frame segments are compared, and if the target action types corresponding to the time windows of the frame segments are consistent, the target action type corresponding to the time window of one frame segment is randomly determined as the action type corresponding to the dynamic gesture. And if the target action types corresponding to the time windows of the frame sections are not consistent, taking the target action type with the largest frame number as the action type corresponding to the dynamic gesture, thereby improving the accuracy of action identification.

In some optional embodiments, after the motion category corresponding to the dynamic gesture is determined from the motion categories of the frames of images, the motion instruction corresponding to the dynamic gesture is transmitted to a downstream object for execution, and then the dynamic gesture recognition mode exits, so as to avoid a false recognition situation caused by interference of remaining motions.

It is noted that the above scheme is applicable to RGB images and IR infrared images.

To further illustrate and explain aspects of the present description, reference is made to fig. 3, which illustrates a specific example.

When a user drives a vehicle, the intelligent vehicle-mounted equipment is in a working state, and the monocular camera shoots the user in real time.

When the user makes a hand hovering action in the air, S301 is executed, and the monocular camera captures the hand hovering action and transmits it to the palm detector.

And S302, judging whether the palm detector is in the cooling time. If so, no response is made. If not, go to step S303. Optionally, in this step, it may be determined whether both the palm detector and the motion identifier are in the cooling time, and if both are not in the cooling time, it indicates that both can respond normally, and then S303 is performed.

S303, detecting whether the hand hovering action meets the triggering condition.

If not, no response is made.

If yes, execute step S304 to start the dynamic gesture recognition mode.

S305, judging whether the action recognizer processes overtime. If yes, go to step S306. If not, go to step S307.

And S306, controlling the action recognizer to enter a cooling period.

S307, the monocular camera shoots a plurality of frame images containing the dynamic gestures and transmits the frame images to the action recognizer. Specifically, the plurality of frames of images are acquired by a monocular camera when a user makes a dynamic gesture at intervals.

And S308, calling a motion recognition model in the motion recognizer to perform feature transfer fusion and classification on the plurality of frames of images to obtain motion categories of the plurality of frames of images.

S309, determining a time window, and determining an action type corresponding to the time window from the action types of the plurality of frame images according to the time window.

S310, determining the action category with the maximum number of frames in the time window as the dynamic category of the dynamic gesture.

S311, exiting the dynamic gesture recognition mode, executing S306, and controlling the motion recognizer to enter a cooling state.

Although the above scheme needs to detect the hand hovering action first, the dynamic gesture is recognized. However, for the user, after the user performs the hand hovering action and the subsequent dynamic gesture at intervals, the intelligent vehicle-mounted device can detect, recognize and timely execute the corresponding instruction in real time, so that the user can be timely responded, and smooth operation experience can be brought.

The embodiment of the present disclosure introduces a specific implementation process of dynamic gesture recognition, and based on the same inventive concept as that in the foregoing embodiment, the embodiment of the present disclosure further provides an intelligent vehicle-mounted device.

Referring to fig. 4, the intelligent vehicle-mounted device in this embodiment includes:

the camera module 401 is configured to obtain a plurality of frames of images including a dynamic gesture in a dynamic gesture recognition mode;

a spatiotemporal conversion module 402, configured to invoke a motion recognition model to perform feature transfer fusion and classification on the plurality of frame images, so as to obtain motion categories of the plurality of frame images;

a determining module 403, configured to determine, from the motion categories of the several frames of images, a motion category corresponding to the dynamic gesture.

In some optional embodiments, the smart vehicle device further includes:

the camera module 401 is used for shooting hand movements;

if yes, starting a dynamic gesture recognition mode.

In some optional embodiments, the hand detection module is specifically configured to:

detecting whether the hovering time of the hand action exceeds a preset time;

if yes, the hand motion is provided with the trigger condition.

In some optional embodiments, the spatio-temporal transformation module 402 is specifically configured to:

In some optional embodiments, the smart vehicle device further includes:

and the storage module is used for storing the replaced partial spatial dimension characteristics in the current frame image and replacing partial spatial dimension characteristics of the next frame image.

In some optional embodiments, the determining module 403 is specifically configured to:

In some optional embodiments, the smart vehicle device further includes:

and the exit module is used for exiting the dynamic gesture recognition mode.

Based on the same inventive concept as in the previous embodiments, the present specification further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of any of the methods described above.

Based on the same inventive concept as in the previous embodiment, an embodiment of the present specification further provides an electronic device, as shown in fig. 5, including a memory 504, a processor 502, and a computer program stored on the memory 504 and executable on the processor 502, where the processor 502 implements the steps of any one of the methods described above when executing the program.

Where in fig. 5 a bus architecture (represented by bus 500) is shown, bus 500 may include any number of interconnected buses and bridges, and bus 500 links together various circuits including one or more processors, represented by processor 502, and memory, represented by memory 504. The bus 500 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 505 provides an interface between the bus 500 and the receiver 501 and transmitter 503. The receiver 501 and the transmitter 503 may be the same element, i.e. a transceiver, providing a unit for communicating with various other terminal devices over a transmission medium. The processor 502 is responsible for managing the bus 500 and general processing, and the memory 504 may be used for storing data used by the processor 502 in performing operations.

according to the scheme, the feature transfer fusion and classification are carried out on a plurality of frames of images containing dynamic gestures by calling the motion recognition model, the feature extraction (referred to as space-time dimension features in the specification) of time dimension and space dimension can be simultaneously realized through the feature transfer fusion and the classification is carried out according to the feature extraction, so that the large amount of calculation of 3DCNN/RNN/LSTM and the variety thereof can be avoided, the calculation of three-dimensional point cloud and the like is not carried out, the calculation on the end side is very friendly, a large amount of calculation resources can be saved, the calculation efficiency can be guaranteed to achieve real-time reasoning, and the limited resource scene of intelligent vehicle-mounted equipment is further adapted.

Furthermore, the dynamic gestures that this scheme can discern are more various, like snatch, release, click, wave around, all can discern about smooth etc..

Furthermore, the gestures recognized by the scheme are all air-isolated gestures, so that the potential safety hazard caused by the fact that attention deviates from a road due to operation center control, a screen, a button and the like is avoided.

Furthermore, the shooting module of the scheme adopts a monocular shooting module to reduce cost and improve processing efficiency, and the hand detection model adopts a lightweight backbone framework to develop so as to adapt to the situation of limited computing power of resources on a vehicle-mounted system.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, this description is not intended for any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present specification and that specific languages are described above to disclose the best modes of the specification.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present description may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the specification, various features of the specification are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed to reflect the intent: that is, the present specification as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this specification.

Those skilled in the art will appreciate that the modules in the devices in an embodiment may be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the description and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of this description may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components of a gateway, proxy server, system in accordance with embodiments of the present description. The present description may also be embodied as an apparatus or device program (e.g., computer program and computer program product) for performing a portion or all of the methods described herein. Such programs implementing the description may be stored on a computer-readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the specification, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The description may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method of dynamic gesture recognition, the method comprising:

2. The method of claim 1, prior to said capturing a plurality of frames of images containing dynamic gestures, the method further comprising:

shooting the hand action by utilizing the camera module;

and if so, starting the dynamic gesture recognition mode.

3. The method as claimed in claim 2, wherein the detecting whether the hand motion has the trigger condition by using the hand detection model specifically includes:

detecting whether the hovering time of the hand action exceeds a preset time;

if yes, the hand motion is provided with the trigger condition.

4. The method according to claim 1, wherein the invoking of the motion recognition model performs feature transfer fusion and classification on the plurality of frame images to obtain motion categories of the plurality of frame images, specifically comprises:

5. The method according to claim 4, wherein associating the spatial dimension features of the previous and adjacent frame images to perform feature transfer fusion on the spatial dimension features of the current frame image to obtain the spatio-temporal dimension features of the current frame image, specifically comprises:

6. The method of claim 5, after replacing the partial spatial dimension feature of the current frame image with the partial spatial dimension feature in the previous frame image, the method further comprising:

7. The method according to claim 1, wherein the determining, from the motion categories of the plurality of frames of images, the motion category corresponding to the dynamic gesture specifically includes:

8. The method according to claim 1, wherein the determining, from the motion categories of the plurality of frames of images, the motion category corresponding to the dynamic gesture specifically includes:

9. The method of claim 1, after determining the motion category corresponding to the dynamic gesture from the motion categories of the number of frames of images, the method further comprising:

and exiting the dynamic gesture recognition mode.

10. An intelligent vehicle-mounted device, comprising:

11. The intelligent vehicle-mounted device of claim 10, further comprising:

the camera module is used for shooting hand motions;

the hand detection module is used for detecting whether the hand action has a trigger condition or not by using a hand detection model;

and if so, starting the dynamic gesture recognition mode.

12. The intelligent vehicle-mounted device of claim 11, wherein the hand detection module is specifically configured to:

detecting whether the hovering time of the hand action exceeds a preset time;

if yes, the hand motion is provided with the trigger condition.

13. The intelligent vehicle-mounted device of claim 10, wherein the space-time transformation module is specifically configured to:

14. The intelligent vehicle-mounted device of claim 13, wherein the space-time transformation module is specifically configured to:

15. The intelligent vehicle-mounted device of claim 14, further comprising:

16. The intelligent vehicle-mounted device of claim 10, wherein the space-time transformation module is specifically configured to:

17. The intelligent vehicle-mounted device of claim 10, wherein the determination module is specifically configured to:

18. The intelligent vehicle-mounted device of claim 10, further comprising:

and the exit module is used for exiting the dynamic gesture recognition mode.

19. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.

20. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1-9 when executing the program.