CN117173677A

CN117173677A - Gesture recognition method, device, equipment and storage medium

Info

Publication number: CN117173677A
Application number: CN202311182181.XA
Authority: CN
Inventors: 李勉; 刘世超; 严立康; 徐坚江
Original assignee: Avatr Technology Chongqing Co Ltd
Current assignee: Avatr Technology Chongqing Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-12-05

Abstract

The application provides a gesture recognition method, a gesture recognition device, gesture recognition equipment and a storage medium, wherein the gesture recognition method comprises the following steps: acquiring initial image data, marking a target area in the initial image data based on a preset detection algorithm, and obtaining sample image data; the labeling mode of the target area comprises a line segment penetrating through the target area; training a preset gesture recognition model based on sample image data to obtain a target gesture recognition model; acquiring hand image data of a user, identifying the hand image data based on a target gesture identification model, and determining a target gesture corresponding to the hand image data; and acquiring feedback data of the user, and optimizing and updating the target gesture recognition model based on the feedback data. Therefore, the labeling cost can be reduced, meanwhile, the gesture recognition precision can be improved, the personal habit and preference of the user can be matched, and the actual requirements of the user can be met.

Description

Gesture recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of intelligent vehicle technologies, and in particular, to a gesture recognition method, device, apparatus, and storage medium.

Background

With the continuous development of technologies such as internet and artificial intelligence, more and more intelligent functions of vehicles, such as gesture recognition functions in an intelligent cabin of a vehicle, greatly improve the use experience of users.

In the related art, a gesture recognition algorithm of a vehicle is usually trained based on manual annotation data, so that the manual annotation cost is high, the recognition accuracy of the gesture recognition algorithm is low, and the actual requirements of a driver (user) cannot be met.

Disclosure of Invention

The application provides a gesture recognition method, a gesture recognition device, gesture recognition equipment and a storage medium, which can reduce the labeling cost, improve the gesture recognition precision and ensure that the actual requirements of users are met.

In a first aspect, an embodiment of the present application provides a gesture recognition method, including:

acquiring initial image data, and marking a target area in the initial image data based on a preset detection algorithm to obtain sample image data;

training a preset gesture recognition model based on the sample image data to obtain a target gesture recognition model;

acquiring hand image data of a user, identifying the hand image data based on the target gesture identification model, and determining a target gesture corresponding to the hand image data;

And acquiring feedback data of a user, and optimizing and updating the target gesture recognition model based on the feedback data.

In one possible embodiment, the sample image data comprises annotation image data and depth image data; the marking the target area in the initial image data based on the preset detection algorithm to obtain sample image data comprises the following steps:

determining a target area based on a preset detection algorithm in the initial image data, and labeling the target area according to a preset labeling mode to obtain labeled image data; the preset labeling mode comprises a line segment penetrating through the target area;

and acquiring depth image data corresponding to the marked image data based on a preset image processing algorithm.

In a possible implementation manner, the training the preset gesture recognition model based on the sample image data to obtain the target gesture recognition model includes:

training a preset gesture recognition model based on the sample image data to obtain a standby gesture recognition model;

and acquiring current user history data, and adjusting and optimizing the standby gesture recognition model based on the current user history data to obtain a target gesture recognition model.

In a possible implementation manner, the training the preset gesture recognition model based on the sample image data to obtain an alternative gesture recognition model includes:

extracting the sample image data based on an asymmetric feature extractor to obtain image features corresponding to the sample image data;

decoding the image features and optimizing the features based on a preset decoder to obtain target feature information corresponding to the image features;

inputting the target characteristic information into a preset classifier to obtain sample classification corresponding to the sample image data;

and performing iterative training on the preset gesture recognition model based on the sample classification and a preset loss function to obtain the alternative gesture recognition model.

In one possible implementation manner, the extracting the sample image data based on the asymmetric feature extractor to obtain the image feature corresponding to the sample image includes:

extracting the marked image data based on a first backbone network to obtain red, green and blue image features corresponding to the marked image data;

and extracting the depth image data based on the second backbone network to obtain depth image features corresponding to the depth image data.

In one possible embodiment, the acquiring hand image data of the user includes:

collecting real-time image data through a shooting module; the real-time image data comprises gesture actions of a user;

tracking a hand region in the real-time image data through a preset tracking algorithm to obtain hand image data of a user.

In one possible implementation manner, the obtaining feedback data of the user and performing optimization updating on the target gesture recognition model based on the feedback data includes:

collecting feedback behaviors of a user, obtaining feedback data, and dividing the feedback data into a training set and a testing set; the feedback data comprises positive sample data and negative sample data;

training the target gesture recognition model based on the training set to obtain a training model;

testing the training model based on the test set to obtain a test result;

if the test result meets the preset condition, determining the training model as an updated target gesture recognition model;

and if the test result does not meet the preset condition, training the target gesture recognition model again until the test result meets the preset condition.

In a second aspect, an embodiment of the present application provides a gesture recognition apparatus, including:

the labeling module is used for acquiring initial image data, labeling a target area in the initial image data based on a preset detection algorithm, and obtaining sample image data; the labeling mode of the target area comprises a line segment penetrating through the target area;

the training module is used for training a preset gesture recognition model based on the sample image data to obtain a target gesture recognition model;

the recognition module is used for collecting hand image data of a user, recognizing the hand image data based on the target gesture recognition model and determining a target gesture corresponding to the hand image data;

and the optimization module is used for acquiring feedback data of a user and optimizing and updating the target gesture recognition model based on the feedback data.

In one possible embodiment, the sample image data comprises annotation image data and depth image data; the labeling module is specifically configured to:

In a possible implementation manner, the training module is specifically configured to:

In a possible embodiment, the identification module is specifically configured to:

In a possible embodiment, the optimization module is specifically configured to:

testing the training model based on the test set to obtain a test result;

In a third aspect, an embodiment of the present application provides a gesture recognition apparatus, including: a processor, a memory;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the gesture recognition method according to any one of the first aspects.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the gesture recognition method of any one of the first aspects when the computer-executable instructions are executed.

In a fifth aspect, an embodiment of the present application provides a vehicle, including the gesture recognition apparatus of the third aspect.

According to the gesture recognition method, the gesture recognition device, the gesture recognition equipment and the gesture recognition storage medium, initial image data are obtained, target areas are marked in the initial image data based on a preset detection algorithm, and sample image data are obtained; the labeling mode of the target area comprises a line segment penetrating through the target area; training a preset gesture recognition model based on sample image data to obtain a target gesture recognition model; acquiring hand image data of a user, identifying the hand image data based on a target gesture identification model, and determining a target gesture corresponding to the hand image data; and acquiring feedback data of the user, and optimizing and updating the target gesture recognition model based on the feedback data. According to the method, the initial image data can be simply marked based on the preset detection algorithm before model training to obtain sample image data, and the model training is performed based on the sample image data, so that a large number of sample marks are not required to be manually performed, and the marking cost can be reduced; meanwhile, the vehicle can optimize and update the target gesture recognition model based on the feedback data of the user, so that the gesture recognition accuracy can be improved, the vehicle is more suitable for personal habits and preferences of the user, and the actual requirements of the user can be met.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a gesture recognition method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another gesture recognition method according to an embodiment of the present application;

FIG. 4 is a training schematic diagram of a gesture recognition model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of optimizing and updating a gesture recognition model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a gesture recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a gesture recognition apparatus according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail below with reference to the drawings and examples for better understanding of the technical scheme of the present application to those skilled in the art. It is to be understood that the specific embodiments and figures described herein are for purposes of illustration only and are not intended to limit the scope of the present application.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

With rapid development of automobile technology, more and more intelligent functions of the vehicle can effectively improve the use experience of users, and particularly, gesture recognition functions and the like in an intelligent cabin of the vehicle are realized.

In the related art, gesture recognition algorithms of vehicles have various drawbacks. On the one hand, the accuracy of the vehicle gesture recognition algorithm in the related art is not high. Because of various gestures and different hand types and action habits of each user, the gesture recognition algorithm has the conditions of misrecognition and missing recognition; when the hand image is acquired, the camera is influenced by light rays and environment in the vehicle, the acquired hand image is low in quality, and the recognition accuracy of the gesture recognition algorithm is low; and for some gesture actions with complex or small action amplitude, the gesture recognition algorithm is difficult to accurately recognize, so that the use experience of a user is poor.

On the other hand, the training cost of the vehicle gesture recognition algorithm in the related art is high. The gesture recognition algorithm needs massive data covering various complex scenes such as different angles, exposure degrees, skin colors and the like in the training process, and needs manual data marking, so that the manual marking cost is high; in addition, the model needs to be processed in the training process, a large amount of image processing and calculation are performed, higher calculation power and storage resource support are needed, and the equipment cost is higher.

Therefore, the gesture recognition algorithm in the related technology is low in recognition accuracy, cannot meet the actual requirements of users, and is high in training cost such as manual labeling.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. Referring to fig. 1, a vehicle 101 is included. As shown in fig. 1, in the related art, the vehicle gesture recognition algorithm is trained based on manually noted data, the labor cost is high, and the recognition accuracy of the algorithm is also not high.

In the embodiment of the application, the vehicle simply marks the initial data based on the preset detection algorithm, and the manual marking is not needed, so that the manual marking cost can be reduced; meanwhile, the vehicle can optimize and update the gesture recognition model based on the feedback behavior of the user, so that the gesture recognition accuracy can be improved.

The following describes the embodiments of the present application in detail by way of specific examples. It should be noted that the following embodiments may exist independently or may be combined with each other, and for the same or similar content, the description will not be repeated in different embodiments.

Fig. 2 is a flow chart of a gesture recognition method according to an embodiment of the present application. Referring to fig. 2, the gesture recognition method may include:

S201, acquiring initial image data, and marking a target area in the initial image data based on a preset detection algorithm to obtain sample image data.

The execution body of the embodiment of the application may be a vehicle, and specifically may refer to a controller in the vehicle, or the like, or may be a gesture recognition device provided in the vehicle. The gesture recognition apparatus may be implemented by software, or may be implemented by a combination of software and hardware. For ease of understanding, hereinafter, the execution body will be described as an example of a vehicle.

In the embodiment of the present application, the initial image data may refer to original image data of a plurality of users. The preset detection algorithm may refer to a pre-trained image detection algorithm, and specifically may refer to a saliency target detection algorithm and the like. The target area may refer to an image area including a user gesture. Sample image data may refer to sample data for model training.

In this step, in order to realize gesture recognition, the vehicle needs to perform model training first. The vehicle may acquire a large number of original pictures of the user as initial image data, which does not contain labeling information. The vehicle can label the initial image data based on a preset detection algorithm, determine a target area including the hand of the user in the initial image data, and simply label according to a preset labeling mode, for example, line segment labeling and the like can be adopted to obtain labeled image data with labels. In addition, the vehicle may perform other preprocessing on the labeled image data, such as image scaling, gray-scale conversion, etc., to obtain sample image data that is ultimately used for training.

Therefore, compared with the mode of manually marking the original images one by one in the related art, the embodiment of the application realizes automatic marking of the original image data based on the preset detection algorithm, and can save the manual marking cost.

S202, training a preset gesture recognition model based on sample image data to obtain a target gesture recognition model.

In the embodiment of the present application, the preset gesture recognition model may refer to an initial state, untrained gesture recognition model. The preset gesture recognition model can be built based on a plurality of model combinations of a feature extraction model, a deep learning model, a classification model and the like, wherein the feature extraction model can comprise a feature extractor and the like. The deep learning model may specifically include convolutional neural networks (Convolutional Neural Networks, CNN), cyclic neural networks (Recurrent Neural Network, RNN), and the like. The classification model may include a K-nearest neighbor algorithm classifier (K-Nearest Neighbors, KNN) or the like. Of course, the preset gesture recognition model may also include other models or be constructed based on other algorithms, and the specific form of the preset gesture recognition model is not limited in the embodiment of the present application.

The target gesture recognition model may refer to a trained model that may be used for gesture recognition. Specifically, after the sample image data is determined, the vehicle may train the preset gesture recognition model based on the sample image data, and after training is completed, the target gesture recognition model may be obtained. It should be noted that, in the embodiment of the present application, the specific training process of the target gesture recognition model may be directly implemented in the vehicle, or the model training may be performed in the server first, and then the trained target gesture recognition model is configured in the vehicle.

S203, acquiring hand image data of a user, identifying the hand image data based on a target gesture identification model, and determining a target gesture corresponding to the hand image data.

In the embodiment of the application, the hand image data may refer to a gesture image of a current user acquired by a vehicle in real time. The target gesture may refer to a gesture instruction of a user, and the vehicle may perform a corresponding functional operation based on the target gesture.

In this step, after model training is completed, the vehicle may perform gesture recognition based on the target gesture recognition model. Specifically, the vehicle can shoot an image including the hand of the user based on a shooting module such as a camera to obtain hand image data of the user, and then the hand image data is input into a target gesture recognition model, so that a target gesture corresponding to the hand image data of the user can be determined. Then, the vehicle can determine the function instruction corresponding to the target gesture of the user based on the mapping relation between the pre-configured target gesture and the function instruction, and execute the function instruction so as to meet the actual requirement of the user.

S204, acquiring feedback data of the user, and optimizing and updating the target gesture recognition model based on the feedback data.

In the embodiment of the application, the user may refer to a driver of the vehicle, and the like. The feedback data may include various feedback behavior data of the user, and may include, in particular, evaluation information of the user, instruction execution information, and emotion information of the user. The evaluation information may be positive evaluation or negative evaluation based on touch operation feedback such as "like" and "click" by the user. The instruction execution information may refer to feedback behavior information about whether the user receives the gesture recognition result, for example, the user continuously repeats the gesture action, the user allows to execute a function instruction corresponding to the target gesture, or the user refuses to execute a function instruction corresponding to the target gesture. The user emotion information may refer to facial expression information of the user, voice information of the user, and the like. Of course, the feedback data also includes other feedback actions of the user, such as manual operation actions of the user, and the embodiment of the application is not limited to the specific kind and the acquisition mode of the feedback actions.

In this step, as the number of uses of the gesture recognition function of the user increases, the vehicle may collect various feedback data of the current user, which may reflect the operation habits and personal preferences of the user. The vehicle can update and iterate the target gesture recognition model based on the feedback data, so that the target gesture recognition model is more suitable for personal habits of users, and the accuracy of gesture recognition can be improved.

According to the gesture recognition method provided by the embodiment of the application, a vehicle acquires initial image data, and marks a target area in the initial image data based on a preset detection algorithm to obtain sample image data; the labeling mode of the target area comprises a line segment penetrating through the target area; training a preset gesture recognition model based on sample image data to obtain a target gesture recognition model; acquiring hand image data of a user, identifying the hand image data based on a target gesture identification model, and determining a target gesture corresponding to the hand image data; and acquiring feedback data of the user, and optimizing and updating the target gesture recognition model based on the feedback data. According to the method, the initial image data can be simply marked based on the preset detection algorithm before model training to obtain sample image data, and the model training is performed based on the sample image data, so that a large number of sample marks are not required to be manually performed, and the marking cost can be reduced; meanwhile, the vehicle can optimize and update the target gesture recognition model based on the feedback data of the user, so that the gesture recognition accuracy can be improved, the vehicle is more suitable for personal habits and preferences of the user, and the actual requirements of the user can be met.

Based on the above embodiments, fig. 3 is a schematic flow chart of another gesture recognition method according to an embodiment of the present application. Referring to fig. 3, the gesture recognition method may include:

s301, acquiring initial image data, and marking a target area in the initial image data based on a preset detection algorithm to obtain sample image data.

In one possible embodiment, the sample image data includes annotation image data and depth image data. Step S301 may be implemented as follows:

(1) Determining a target area based on a preset detection algorithm in the initial image data, and labeling the target area according to a preset labeling mode to obtain labeled image data; the preset labeling mode comprises a line segment penetrating through the target area.

(2) And acquiring depth image data corresponding to the marked image data based on a preset image processing algorithm.

In the embodiment of the present application, the preset labeling mode may refer to a labeling form of a preset target area, the preset labeling mode may refer to a line segment passing through the target area, etc., and of course, the preset labeling mode may also be other forms, and specifically may be flexibly set based on actual requirements, which is not limited in the embodiment of the present application. The annotation image data may be image data indicating that the hand region was annotated. The preset image processing algorithm may refer to a preset depth image acquisition algorithm. The depth image data may be a depth image corresponding to the index image data. The depth image may refer to an image having a distance (depth) from an image collector (photographing device) to points in a scene as pixel values, and may reflect the geometry of the visible surface of an object.

In this step, when preprocessing the initial image data, the vehicle may first input the initial image data that does not include any label into a preset detection algorithm, and the vehicle may determine a target area in the initial image data through the preset detection algorithm, and label the target area according to a preset labeling manner, so as to obtain labeled image data. And then the image data is marked as a red, green and blue (RGB) image, and the vehicle can further determine depth image data corresponding to the marked image data based on a preset image processing algorithm, so that the comprehensiveness of subsequent feature extraction is further improved.

In the embodiment of the application, no matter how complex the actual scene distribution in the initial image data is, the vehicle is based on a preset detection algorithm, the image annotation can be realized by a simple annotation mode such as graffiti on the target area (or the target position), the pixel representation of the initial hand area can be obtained, and the pixel identification can be used as the supervision information of the subsequent gesture recognition to assist in completing the final gesture recognition. Therefore, a great amount of manual data labeling work is avoided based on a labeling mode of a preset detection algorithm, and a labeling process is simplified; and the labeling information is concise, the model training process becomes more efficient and quicker, the computing power resource is saved, and the model convergence can be accelerated.

S302, training a preset gesture recognition model based on sample image data to obtain an alternative gesture recognition model.

In one possible embodiment, step S302 may specifically include the following steps (3) to (6):

(3) And extracting the sample image data based on the asymmetric feature extractor to obtain the image features corresponding to the sample image data.

In the embodiment of the application, the asymmetric feature extractor can realize feature extraction of sample image data. The image features may refer to image attribute features corresponding to sample image data, such as RGB image features corresponding to annotation image data and depth image features corresponding to depth image data. Specifically, during model training, the vehicle may first extract image features corresponding to sample image data through an asymmetric feature extractor with a simple encoder. The asymmetric feature extractor has higher accuracy of feature extraction than conventional symmetric feature extractors.

In one possible embodiment, step (3) may be implemented specifically by:

extracting the marked image data based on the first backbone network to obtain red, green and blue image features corresponding to the marked image data; and extracting the depth image data based on the second backbone network to obtain depth image features corresponding to the depth image data.

In the embodiment of the present application, the first backbone network may refer to a residual network (res net) and the like, and specifically may refer to a res net50 backbone network and the like. The second backbone network may refer to a visual geometry group network (Visual Geometry Group, VGG) or the like, and in particular may refer to VGG32 or the like.

In this step, since the RGB image (annotation image data) and the depth image have different responses to the same backbone network, the first backbone network can generate a more reliable foreground activation for the RGB image, and the second backbone network can generate a more reliable foreground activation for the depth image, in particular, the backbone features (deep backbone features) can be visualized by performing channel summation and min-max normalization. Under the setting of the saliency target detection of RGB and depth images, the main features of the second main network pay more attention to details, and the main features of the first main network pay more attention to more target areas. According to the embodiment of the application, the asymmetric feature extractor (asymmetric encoder) is used, so that semantic meaningful activation information can be extracted from the RGB image, meanwhile, the activation information with accurate structure can be obtained from the depth image, and the comprehensiveness and accuracy of feature extraction can be improved.

Illustratively, the first backbone network of the embodiment of the present application may be a pre-trained ResNet50 for extracting RGB image features corresponding to the labeled image data, which may be denoted as F _r ＝E _res (x _r ) Wherein E is _res Is a feature extractor based on a ResNet50 backbone; the second backbone network may be a pretrained VGG network, and is used for extracting depth image features corresponding to the depth image data, and may be represented as F _d ＝E _vgg (x _d ) Wherein E is _vgg Is a feature extractor based on the VGG backbone. Of course, the first backbone network and the second backbone network adopted by the asymmetric feature extractor may be other backbone networks, which are not limited in the embodiment of the present application.

In this way, in the embodiment of the application, the image features are extracted according to different backbone networks aiming at different image types in the sample image data, so that the image feature extraction is more accurate and comprehensive, and the accuracy of model training can be improved.

(4) And decoding the image features based on a preset decoder and optimizing the features to obtain target feature information corresponding to the image features.

In the embodiment of the application, the preset decoder can comprise a pre-trained feature fusion algorithm, a feature optimization algorithm and the like. The target feature information may refer to feature information obtained by processing the image feature.

Specifically, in the model training process, the vehicle may first extract image features corresponding to sample image data and generate initial predictions based on an asymmetric feature extractor with a simple encoder, and then perform feature fusion and feature optimization using a preset decoder, and specifically may further refine the initial predictions based on a minimum spanning tree energy loss and a attention-based fusion module to obtain target feature information.

(5) Inputting the target characteristic information into a preset classifier to obtain sample classification corresponding to the sample image data.

In the embodiment of the application, the preset classifier can be used for classifying the characteristics, and specifically can be a KNN classifier and the like. The sample classification may refer to a class to which the sample image data corresponds. In this step, after determining the target feature information, the vehicle may input the target feature information into a preset classifier, to obtain a sample class corresponding to the sample image data. Specifically, the vehicle can map the recognized foreground target hand region features and other background features in the same measurement space through a preset decoder, a KNN classifier is constructed in the space, a significance map of the hand is obtained, the significance map is used as powerful supervision information for subsequent segmentation, and segmentation accuracy and segmentation efficiency are effectively improved.

Taking a preset classifier as a KNN classifier as an example, for training data set D including m sample image data, each sample image data corresponds to one class y, and each sample has n features, training data set D may be expressed as:

D＝[(x ₁ ，y ₁ )，(x ₂ ，y ₂ )，(x ₃ ，y ₃ )，…，(x _m ，y _m )]

wherein x is a feature vector of the sample image data, and y is a category corresponding to the sample image data.

(6) And performing iterative training on the preset gesture recognition model based on the sample classification and the preset loss function to obtain an alternative gesture recognition model.

In the embodiment of the present application, the preset loss function may refer to a loss function of a preset model training, specifically may refer to an absolute value loss function (Mean Absolute Error, L1 loss) and the like, and the embodiment of the present application does not limit a specific form of the loss function. The alternate gesture recognition model may refer to a model trained based on sample image data. Specifically, after the sample classification is obtained through training, a loss value can be calculated based on the sample classification and the real classification of the sample image data and in combination with a preset loss function, and iterative training is performed based on the loss value, so that an alternate gesture recognition model after training is completed is obtained.

Exemplary, fig. 4 is a training schematic diagram of a gesture recognition model according to an embodiment of the present application. As shown in fig. 4, in the model training process, for the sample image data, the vehicle first performs feature extraction on the sample image data based on the asymmetric feature extractor, specifically, may extract RGB image features of the label image data based on the first backbone network, and extract depth image features of the depth image data based on the second backbone network. Illustratively, E may be initialized using VGG16 _vgg Encoder, initializing E using ResNet50 _res An encoder. By E _res Encoder extraction of RGB image features F _r Through E _vgg Encoder extraction of depth image features F _d 。

The image features may then be decoded by a preset decoder after the gesture recognition model is preset. The preset decoding may specifically include decoding RGB image features F _r Decoder D of (D) _res Depth image feature F _d Corresponding decoder D _vgg . In particular, RGB image feature F may be used _r Input decoder D _res Preliminary generation of coarse predictionsAnd corresponding characteristic informationNamely:

meanwhile, a preset gesture recognition model can be used for enabling RGB image features F to be achieved through a cascading module _r And depth image feature F _d Obtaining fusion characteristics through cascading (concate)F _rgb-d . For the fusion feature F _rgb-d ，D _vgg The decoder is responsible for the fusion feature F _rgb-d Outputting preliminary predictionsAnd characteristic information->Namely:

after obtaining the corresponding feature information based on the decoding operation, the preset gesture recognition model can perform optimization operation (refine) on the feature information through a preset refinement module (prediction refinement) to output the corresponding target feature information, includingAnd +.>Specifically, the optimization operation may refer to that an affine matrix is introduced respectively and multiplied by the respective feature matrix to obtain two feature matrices of ref subscripts. These two matrices are then associated with corresponding +. >And->And fusing to obtain the target characteristic information. Finally, the preset gesture recognition model can input the target feature information into a preset classifier to obtain sample classification corresponding to sample image data, and iterate training is performed according to a preset loss function to obtain a trained candidate gesture recognition model. In addition, in the model training process, the model can be optimized and adjusted by adopting the technologies such as cross verification and the like, so that the accuracy and generalization capability of the model can be further improved.

In the related art, since a vehicle driver usually uses a gesture recognition function in a dynamic driving process, a model training process needs a large amount of abundant in-cabin scene data, and needs to rely on manual large-scale data labeling work to a large extent, so that the cost of manual labeling is high.

In the embodiment of the application, the gesture detection and recognition task under the weak supervision learning framework is realized by the vehicle through the preset detection algorithm, the automatic labeling of the target area is realized, a large amount of data labeling work is avoided, and the labeling process is simplified; and the labeling information is simpler, and the model training process becomes more efficient and quicker. According to the method, recognition and adaptation of different gesture postures, angles, illumination and personalized features can be realized by training and optimizing a large amount of gesture data in a weak supervision learning and training mode, and the accuracy and the robustness of gesture recognition can be improved.

S303, acquiring current user history data, and adjusting and optimizing the standby gesture recognition model based on the current user history data to obtain a target gesture recognition model.

In the embodiment of the application, the historical data may refer to historical gesture data of the current user, and may include information such as gesture type, frequency, action amplitude and the like of the current user. The vehicle can adjust and optimize the spare gesture recognition model obtained through training based on historical data, specifically, the operations of parameter adjustment, convolution kernel updating, model fusion and the like can be included, meanwhile, the spare gesture recognition model can be updated and optimized rapidly by adopting technologies such as incremental learning and the like, and the practicability and the adaptability of the model are further improved. Thus, the parameters of the standby gesture recognition model are updated and adjusted based on the historical data of the current user, so that the fit degree of the target gesture recognition model and the personal habit of the current user can be improved, and the accuracy of gesture recognition is improved.

S304, acquiring real-time image data through a shooting module; the real-time image data comprises gesture actions of a user; tracking the hand area in the real-time image data through a preset tracking algorithm to obtain hand image data of the user.

In the embodiment of the application, the shooting module may refer to shooting equipment such as a camera in a vehicle. Real-time image data may refer to image data including user gestures. The preset tracking algorithm may refer to a pre-trained tracking algorithm, and specifically may refer to a target tracking CamShift algorithm and the like. The CamShift algorithm is a target tracking algorithm based on a color histogram, and is suitable for the situation that a single target has obvious color characteristics. In gesture tracking, the gesture may be tracked using the CamShift algorithm using skin tone features of the hand region.

In this step, after the training of the target gesture recognition model is completed, in the actual use process, the vehicle needs to collect the hand image data of the user first. Specifically, the vehicle may first acquire real-time image data, that is, an image sequence including a gesture of a user, based on a photographing module such as a camera. Thereafter, the vehicle may initialize the tracking window, i.e., mark the hand area with a rectangular frame by an image detection algorithm or the like, as an initial position of the tracking window. The vehicle may then calculate a color histogram of the image using the selected hand region, and may specifically select different color channels to calculate the histogram. The vehicle may then execute a CamShift algorithm, which may specifically use the calculated histogram and the initial tracking window as inputs, to calculate the position and size of the user's hand from the image histogram, and to move the tracking window to a new position. In each image frame, a CamShift algorithm is used to calculate a new position of the user's hand, and the tracking window is moved to that position, and the size and shape of the tracking window can be updated and adjusted in real time. When the gesture motion is finished or the target moves out of the image frame, the tracking is stopped.

In the embodiment of the application, when the hand image data of the user is acquired, the vehicle performs gesture tracking by using a preset tracking algorithm such as a CamShift algorithm, has certain robustness to illumination change, background interference and the like, and can realize gesture tracking in a complex environment; the method has the advantages of small calculated amount, high instantaneity, strong adaptability to the change of the target shape, capability of adaptively adjusting the size and the shape of the tracking window, strong expandability, easiness in realization, capability of accurately acquiring the gesture image and further improvement of the gesture recognition accuracy.

S305, recognizing the hand image data based on the target gesture recognition model, and determining the target gesture corresponding to the hand image data.

S306, acquiring feedback data of a user, and optimizing and updating the target gesture recognition model based on the feedback data.

In the embodiment of the application, after the hand image data of the user is obtained, the vehicle can input the hand image data into the target gesture recognition model to obtain the target gesture corresponding to the hand image data, and then the corresponding function instruction can be determined according to the target gesture, and the corresponding function is executed. After the use times of the user are gradually increased, the vehicle can optimize and update the target gesture recognition model based on the feedback data of the user, so that the adaptability of the target gesture recognition model to the current user is further improved.

In one possible embodiment, step S306 may specifically include the following steps (7) to (11):

(7) Collecting feedback behaviors of a user, obtaining feedback data, and dividing the feedback data into a training set and a testing set; the feedback data includes positive sample data and negative sample data.

In the embodiment of the application, the feedback behavior can refer to various interactive behaviors of the user, such as an acceptance confirmation behavior, a rejection behavior, an evaluation behavior and the like. The feedback behavior may include positive feedback behavior and negative feedback behavior, and the corresponding feedback data also includes positive sample data and negative sample data.

Specifically, the vehicle may collect feedback behavior of the user, and generate feedback data according to the type of feedback behavior, where positive feedback behavior may be added to the positive sample data set, and negative feedback behavior may be added to the negative sample data set. Then, the vehicle can segment the feedback data to obtain a training set and a testing set, wherein the training set is used for model training, and the testing set is used for model testing.

It should be noted that, for updating and optimizing the target gesture recognition model, the vehicle may perform updating and iterating according to a preset period, where the preset period may be one week or one month, or may be determined according to the data amount of the feedback data, and when the data amount reaches a preset number threshold, updating and iterating the model may be performed.

(8) Training the target gesture recognition model based on the training set to obtain a training model.

(9) And testing the training model based on the test set to obtain a test result.

(10) And if the test result meets the preset condition, determining the training model as an updated target gesture recognition model.

(11) And if the test result does not meet the preset condition, training the target gesture recognition model again until the test result meets the preset condition.

In the embodiment of the application, the vehicle can input the training set into the target gesture recognition model for model training to obtain a training model. The training model may then be tested based on the test set to obtain test results. If the test result meets the preset conditions, such as accuracy reaching standards, the vehicle can determine the training model as an updated target gesture recognition model. If the test result does not meet the preset condition, the vehicle can retrain the target gesture recognition model to obtain a new training model and test the new training model until the test result meets the preset condition. Therefore, the vehicle forms a feedback data set comprising positive and negative samples based on the feedback behavior of the user, and carries out iterative training on the target gesture recognition model based on the feedback data set, so that updating and optimizing of the model can be realized based on personal habits of the user, the degree of agreement between the model and the current user is higher, the accuracy of gesture recognition of the model can be improved, and the use experience of the user is improved.

Exemplary, fig. 5 is a schematic diagram of optimization updating of a gesture recognition model according to an embodiment of the present application. As shown in fig. 5, the vehicle may count the gesture recognition feedback behavior of the user, add a positive sample data set when the feedback behavior is positive feedback behavior, and add a negative sample data set when the feedback behavior is negative feedback behavior. The vehicle can divide the feedback data set into a training set and a testing set, and input the training set into a neural network, namely a target gesture recognition model for training, so as to obtain a training model. In the training process, gesture distinction is required based on a classifier, specifically, the method can be referred to as a normalization (softmax) classifier and the like, and the iteration number is required to be judged, if the iteration number is not reached, training is continued, and if the iteration number is reached, a training model is generated.

Specifically, for an input x, when it needs to be determined which of the N categories is, the N categories can be scored by a model, where a higher score indicates that x is more likely to be the category, and the highest score can be considered as the correct category. However, the scale of scores in this way is broad, and softmax is a normalized function that converts a set of scores ranging in value (- + -infinity) to a set of probabilities and let their sum be 1. And the function is order-preserving, the probability of the original large score after conversion is large, and the probability of the small score corresponds to small. The specific formula of softmax is as follows:

Wherein s is _i Representing the value of the model's score on the ith class for input x, the e-index functions to change the (- + -infinity, + -infinity) score to (0, + -infinity) without affecting the relative magnitude relationship, and the sum of the denominators can be used to normalize so that they add up to 1, which translates a set of scores into a set of probabilities, the sum of the total probabilities being 1, and the relative magnitude relationship being maintained. According to the embodiment of the application, gestures of different categories can be determined based on the softmax classifier, and the updating training process of the target gesture recognition model is completed, so that a training model is obtained.

Then, the vehicle can test the training model based on the test set, and if the test result of the training model reaches the standard, the training model can be used as a target gesture recognition model after updating and optimizing; if the test result of the training model does not reach the standard, the model training can be carried out again until the test result reaches the standard. Therefore, the vehicle realizes updating of the target gesture recognition model based on the user feedback data, so that the accuracy of gesture recognition can be further improved, and the actual requirements of the user are met.

It should be noted that, in the embodiments of the present application, the sequence number of each step does not mean that the execution sequence of each step should be determined by the function and the internal logic, and should not limit the implementation process of the embodiments of the present application.

In the embodiment of the application, the significance map which can be used as the supervision information can be automatically generated by simply marking based on the preset detection algorithm such as significance target detection and the like, the same task effect is achieved without a large number of pixel-level marks, the segmentation efficiency and the accuracy of images can be improved, the gesture detection and the recognition tasks under the weak supervision learning framework can be realized, a large number of data marking works are avoided, the marking process is simplified, and the manual marking cost is reduced. In addition, because the labeling information is concise, the model training process becomes more efficient and quicker, and compared with the training mode of strong labeling information in the related technology, the training under the weak supervision learning mode has less consumption of computational resources and can accelerate model convergence.

The gesture recognition method provided by the embodiment of the application can be used for carrying out rapid and efficient model training based on a large amount of data, can achieve self-adaptive light brightness change, personalized sign and use habit, can improve the robustness of algorithm recognition, has rapid algorithm response and effectively improves the user experience. In the embodiment of the application, the vehicle can automatically adjust the algorithm model according to the user feedback data, so that the recognition accuracy is improved, the algorithm model can be continuously adapted to new data and user requirements, and the personal preference of the user can be better matched.

Fig. 6 is a schematic structural diagram of a gesture recognition apparatus according to an embodiment of the present application. Referring to fig. 6, the gesture recognition apparatus 60 may include:

the labeling module 61 is configured to obtain initial image data, and label a target area in the initial image data based on a preset detection algorithm, so as to obtain sample image data; the labeling mode of the target area comprises a line segment penetrating through the target area;

the training module 62 is configured to train the preset gesture recognition model based on the sample image data to obtain a target gesture recognition model;

the recognition module 63 is configured to collect hand image data of a user, and recognize the hand image data based on a target gesture recognition model, so as to determine a target gesture corresponding to the hand image data;

the optimizing module 64 is configured to obtain feedback data of the user, and perform optimization updating on the target gesture recognition model based on the feedback data.

In one possible embodiment, the sample image data includes annotation image data and depth image data; the labeling module 61 is specifically configured to:

in the initial image data, determining a target area based on a preset detection algorithm, and marking the target area according to a preset marking mode to obtain marked image data; the preset labeling mode comprises a line segment penetrating through the target area;

In one possible implementation, training module 62 is specifically configured to:

training a preset gesture recognition model based on sample image data to obtain a standby gesture recognition model;

extracting sample image data based on an asymmetric feature extractor to obtain image features corresponding to the sample image data;

decoding processing and feature optimization are carried out on the image features based on a preset decoder, and target feature information corresponding to the image features is obtained;

inputting the target characteristic information into a preset classifier to obtain sample classification corresponding to sample image data;

and performing iterative training on the preset gesture recognition model based on the sample classification and the preset loss function to obtain an alternative gesture recognition model.

extracting the marked image data based on the first backbone network to obtain red, green and blue image features corresponding to the marked image data;

In one possible implementation, the identification module 63 is specifically configured to:

tracking the hand area in the real-time image data through a preset tracking algorithm to obtain hand image data of the user.

In one possible implementation, the optimization module 64 is specifically configured to:

testing the training model based on the test set to obtain a test result;

The gesture recognition apparatus 60 provided in the embodiment of the present application may execute the technical scheme shown in the foregoing method embodiment, and its implementation principle and beneficial effects are similar, and will not be described herein again.

Fig. 7 is a schematic structural diagram of a gesture recognition apparatus according to an embodiment of the present application. Referring to fig. 7, the gesture recognition apparatus 70 may include: memory 71, processor 72. The memory 71, the processor 72, and the like are illustratively interconnected by a bus 73.

The memory 71 is for storing program instructions;

the processor 72 is configured to execute the program instructions stored in the memory to implement the gesture recognition method according to the above embodiment.

The gesture recognition apparatus 70 shown in the embodiment of fig. 7 may implement the technical solution shown in the embodiment of the method, and its implementation principle and beneficial effects are similar, and will not be described herein again.

Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the gesture recognition method described above when the computer-executable instructions are executed by a processor.

The embodiment of the application also provides a vehicle which comprises the gesture recognition device and can realize the gesture recognition method.

It should be noted that, the processor mentioned in the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), but also may be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should be understood that the memory referred to in embodiments of the present application may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct RAM (DR RAM). Note that when the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, the memory (storage module) is integrated into the processor. It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

With respect to each of the apparatuses and each of the modules/units included in the products described in the above embodiments, it may be a software module/unit, a hardware module/unit, or a software module/unit, and a hardware module/unit. The individual devices, products may be applied to or integrated in a chip, a chip module or a terminal device. For example, for each device or product applied to or integrated on a chip, each module/chip included in the device or product may be implemented by hardware such as a circuit, or at least part of modules/units may be implemented by software programs, where the software programs are running on a processor integrated inside the chip, and the rest of modules/units may be implemented by hardware such as a circuit.

In the present disclosure, the term "include" and variations thereof may refer to non-limiting inclusion; the term "or" and variations thereof may refer to "and/or". The terms "first," "second," and the like, herein, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. In the present application, "a plurality of" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing is only a partial embodiment of the present application and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method of gesture recognition, comprising:

2. The method of claim 1, wherein the sample image data comprises annotation image data and depth image data; the marking the target area in the initial image data based on the preset detection algorithm to obtain sample image data comprises the following steps:

3. The method according to claim 1, wherein training a preset gesture recognition model based on the sample image data to obtain a target gesture recognition model comprises:

4. A method according to claim 3, wherein training a preset gesture recognition model based on the sample image data to obtain an alternative gesture recognition model comprises:

5. The method of claim 4, wherein the extracting the sample image data based on the asymmetric feature extractor to obtain the image features corresponding to the sample image comprises:

6. The method of claim 1, wherein the acquiring hand image data of the user comprises:

7. The method of claim 1, wherein the obtaining feedback data of the user and the optimizing updating the target gesture recognition model based on the feedback data comprises:

Testing the training model based on the test set to obtain a test result;

8. A gesture recognition apparatus, comprising:

9. A gesture recognition apparatus, comprising: a processor, a memory;

the memory stores computer-executable instructions;

the processor executing computer-executable instructions stored in the memory to implement the gesture recognition method of any one of claims 1 to 7.

10. A computer readable storage medium having stored therein computer executable instructions for implementing the gesture recognition method of any one of claims 1 to 7 when the computer executable instructions are executed.