CN117373121A

CN117373121A - Gesture interaction method and related equipment in intelligent cabin environment

Info

Publication number: CN117373121A
Application number: CN202311336297.4A
Authority: CN
Inventors: 胡敏; 王磊; 宁欣; 唐小江; 周嵘; 李爽; 李冬冬
Original assignee: Beijing Zhongke Ruitu Technology Co ltd
Current assignee: Beijing Zhongke Ruitu Technology Co ltd
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-01-09

Abstract

The invention discloses a gesture interaction method and related equipment in an intelligent cabin environment, wherein the method comprises the following steps: acquiring gesture video data in the intelligent cabin, and labeling the gesture video data to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting the key frame data into a first gesture recognition model, if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture, generating a target key frame extraction model through reinforcement learning and extracting the key frame, and therefore gesture interaction under the environment of the intelligent cabin is achieved more efficiently and accurately.

Description

Gesture interaction method and related equipment in intelligent cabin environment

Technical Field

The application relates to the technical field of intelligent automobiles, in particular to a gesture interaction method and related equipment in an intelligent cabin environment.

Background

With the continuous development of artificial intelligence technology, the application scene is more diversified and specialized. The intelligent cabin is taken as an innovative function on a modern automobile, is an important application scene of the artificial intelligence technology, and has great significance in improving driving experience and operation convenience and driving safety and comfort.

Gesture interaction is used as a novel interaction mode, and driving experience and operation convenience of the intelligent cabin can be improved. Gesture interaction is based on accurate gesture recognition, in current gesture recognition, most technologies adopt a single picture for gesture recognition, however, the mode has good adaptability to static gestures, but cannot adapt to dynamic gesture motion recognition. Still other methods adopt video data to perform dynamic gesture recognition, but if the number of input frames is too large in the recognition process, data redundancy exists, so that calculation resources are wasted, if the number of input frames is too small, complete gesture can not be recognized, so that gesture false recognition is caused, and recognition accuracy is not high.

Therefore, how to perform gesture interaction in the intelligent cabin environment more efficiently and accurately is a technical problem to be solved at present.

Disclosure of Invention

The embodiment of the application provides a gesture interaction method and related equipment in an intelligent cabin environment, which are used for more efficiently and accurately carrying out gesture interaction in the intelligent cabin environment.

In a first aspect, a gesture interaction method in an intelligent cabin environment is provided, the method comprising: acquiring gesture video data in an intelligent cabin, and labeling the gesture video data to generate a sample data set; training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if the gesture video to be recognized in the intelligent cabin is detected, extracting key frame data from the gesture video based on the target key frame extraction model; inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture.

In a second aspect, a gesture interaction device in an intelligent cabin environment is provided, the device comprising: the acquisition module is used for acquiring gesture video data in the intelligent cabin, marking the gesture video data and generating a sample data set; the training module is used for training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model is used for extracting key frames based on a reinforcement learning algorithm; the extraction module is used for extracting key frame data from the gesture video based on the target key frame extraction model if the gesture video to be identified in the intelligent cabin is detected; and the identification module is used for inputting the key frame data into a first gesture identification model, and if the identification result output by the first gesture identification model accords with a preset gesture, the intelligent cabin executes an instruction corresponding to the preset gesture.

In a third aspect, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the gesture interaction method in the intelligent cockpit environment of the first aspect via execution of the executable instructions.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the gesture interaction method in the intelligent cabin environment according to the first aspect.

By applying the technical scheme, gesture video data in the intelligent cabin are collected, and the gesture video data are marked to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting key frame data into a first gesture recognition model, if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture, generating a target key frame extraction model through reinforcement learning and extracting key frames, retaining main gesture action characteristics, and carrying out gesture recognition based on the extracted key frames, so that the key frames in a gesture video are automatically extracted, the accuracy and the speed of gesture recognition are improved, and further gesture interaction under the environment of the intelligent cabin can be carried out more efficiently and accurately.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic flow chart of a gesture interaction method in an intelligent cabin environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of training a preset keyframe extraction model in an embodiment of the invention;

FIG. 3 is a schematic diagram of training a preset keyframe extraction model according to an embodiment of the present invention;

fig. 4 shows a schematic structural diagram of a gesture interaction device in an intelligent cabin environment according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It is noted that other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise construction set forth herein below and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

The subject application is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

According to the gesture interaction method under the intelligent cabin environment, the target key frame extraction model is generated through reinforcement learning and key frame extraction is carried out, main gesture action characteristics are reserved, gesture recognition is carried out based on the extracted key frames, so that the key frames in the gesture video are automatically extracted, the accuracy and the speed of gesture recognition are improved, and further more efficient and accurate gesture interaction under the intelligent cabin environment is achieved.

As shown in fig. 1, the method comprises the steps of:

step S101, acquiring gesture video data in an intelligent cabin, and labeling the gesture video data to generate a sample data set.

In this embodiment, be provided with video acquisition equipment (such as camera etc.) in the intelligent cabin, gather gesture video data in the intelligent cabin through video acquisition equipment, gesture video data includes the gesture video of the different gesture actions of multistage, and each gesture action is made by different personnel under different illumination environment, and each personnel can include the personnel of different sexes, different ages. And after the gesture video data are acquired, marking the gesture video data according to the actual category of the gesture video, and generating a sample data set after marking is completed.

Step S102, training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm.

The reinforcement learning algorithm DQN (Deep Q Networks) is a deep reinforcement learning algorithm that uses a neural network to learn a Q-value function, which is a function that maps states and actions to Q values, representing the expected return obtained by performing the action under a particular state. The basic process of reinforcement learning is as follows: firstly, the environment gives a state, the DQN obtains a Q value function related to the state according to the value function network, acts according to the Q value function, and gives a reward and the next state after receiving the action, and updates parameters of the value function network according to the reward. A preset key frame extraction model based on a reinforcement learning algorithm is constructed in advance and used for key frame extraction. After the sample data set is obtained, training a preset key frame extraction model according to the sample data set, generating a target key frame extraction model after training is completed, and extracting a key frame sequence from a gesture video to be actually recognized by using the target key frame extraction model.

Step S103, if the gesture video to be recognized in the intelligent cabin is detected, key frame data are extracted from the gesture video based on the target key frame extraction model.

Video data are acquired in real time by video acquisition equipment in the intelligent cabin, if the existence of gesture video to be identified is detected, the gesture video is input into a target key frame extraction model, key frame extraction is carried out by the target key frame extraction model, and key frame data are acquired.

Step S104, inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture.

In this embodiment, a first gesture recognition model is trained in advance, key frame data is input into the first gesture recognition model to perform gesture recognition, if the recognition result accords with a preset gesture, it is indicated that there is an effective interaction gesture, and the intelligent cabin is enabled to execute an instruction corresponding to the preset gesture, for example, the intelligent cabin is enabled to execute an instruction of playing music, or turning on navigation, or turning on lights in a vehicle, etc. The target key frame extraction model is generated through reinforcement learning and key frame extraction is carried out, main gesture action characteristics are reserved, and gesture recognition is carried out based on the extracted key frames, so that more efficient and accurate gesture interaction under the intelligent cabin environment is realized.

It can be understood that if the recognition result output by the first gesture recognition model does not conform to the preset gesture, the intelligent cabin is not enabled to execute the gesture interaction instruction, whether a new gesture video to be recognized exists is continuously detected, or prompt information of gesture recognition failure can be sent, so that a user can adjust gesture actions. In addition, after the target key frame extraction model is generated, steps S101-S102 are not performed any more, and steps S103-S104 are directly performed for gesture recognition and interaction.

In some embodiments of the present application, the first gesture recognition model includes at least one feature extraction module, a full connection layer, and a transducer neural network, where each feature extraction module, the full connection layer, and the transducer neural network are sequentially connected, and the feature extraction module includes a 3D depth separable convolution module and a SE module connected in parallel, and the transducer neural network is configured to extract spatial and temporal joint features from the full connection layer and output a recognition result.

In this embodiment, the SE module is a channel attention module, and the SE module can perform channel feature enhancement on the input feature map without changing the size of the input feature map. Firstly, key frame data are input into each feature extraction module, spatial features and time features are extracted through the 3D depth separable convolution module and the SE module, then the extracted features are input into the full-connection layer, and finally, the combined features of the space and the time are extracted from the full-connection layer through the transducer neural network and the recognition result is output, so that the recognition precision of the model on gesture actions is improved, and the accuracy of the recognition result is further improved.

In some embodiments of the present application, the training the preset keyframe extraction model based on the sample dataset, after the training is completed, generates a target keyframe extraction model, as shown in fig. 2, including the following steps:

step S21, a video frame sequence of a first preset frame number is obtained from the sample data set, and key frame extraction is carried out on the video frame sequence based on the preset key frame extraction model, so that a key frame sequence of a second preset frame number is obtained.

In this embodiment, a video frame sequence of a first preset frame number in the sample data set is input into a preset key frame extraction model to perform key frame extraction, and a key frame sequence of a second preset frame number is determined according to a Q value output by the preset key frame extraction model, where it can be understood that the first preset frame number is greater than the second preset frame number, for example, the first preset frame number is 32, and the second preset frame number is 8.

And S22, inputting the key frame sequence into a second gesture recognition model to perform gesture recognition, and generating rewards according to gesture recognition results.

In this embodiment, the second gesture recognition model is a pre-trained gesture recognition model, the extracted key frame sequence is input into the second gesture recognition model to perform gesture recognition, and a reward is generated according to a gesture recognition result, where the reward characterizes the accuracy of the extracted key frame sequence, and the gesture recognition result may include a gesture conforming to a preset gesture or a gesture not conforming to the preset gesture.

Alternatively, the model structure of the second gesture recognition model may be the same as or different from the first gesture recognition model. When the model structure of the second gesture recognition model is the same as that of the first gesture recognition model, the first gesture recognition model may be obtained by retraining the second gesture recognition model, or the second gesture recognition model may be used as the first gesture recognition model.

Step S23, updating the value function network in the preset key frame extraction model according to the rewards, and acquiring a new video frame sequence from the sample data set.

In this embodiment, the training is aimed at maximizing rewards, updating the value function network in the preset key frame extraction model according to rewards, and acquiring a new video frame sequence from the sample dataset.

And step S24, carrying out iterative updating on the value function network according to rewards corresponding to the new video frame sequences, and generating the target key frame extraction model when the preset training completion condition is met.

In this embodiment, steps S21 to S23 are repeated, and iterative updating is performed according to each reward pair value function network corresponding to each new video frame sequence, and when a preset training completion condition is satisfied, training is completed, and a target key frame extraction model is generated, where the preset training completion condition includes reaching a preset iteration number or making the model converge, so that accuracy of the target key frame extraction model is improved.

In some embodiments of the present application, after generating the target keyframe extraction model, the method further comprises:

extracting a plurality of groups of key frames from the sample data set based on the target key frame extraction model to generate a key frame data set;

training the second gesture recognition model based on the key frame data set, and generating the first gesture recognition model after training is completed;

and deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.

In this embodiment, the second gesture recognition model adopts the same model structure as the first gesture recognition model, after the target key frame extraction model is generated, multiple groups of key frames are extracted from the sample data set through the target key frame extraction model to generate a key frame data set, then the second gesture recognition model is trained through the key frame data set, after training is completed, the first gesture recognition model is generated, finally the output end of the target key frame extraction model is connected with the input end of the first gesture recognition model and is deployed in the intelligent cabin, and gesture recognition is performed subsequently and directly based on the target key frame extraction model and the first gesture recognition model.

After the target key frame extraction model is obtained, a key frame data set is extracted based on the target key frame extraction model, and the first gesture recognition model is trained based on the key frame data set, so that the accuracy of the first gesture recognition model is improved, and further gesture interaction is performed more accurately on the intelligent cabin.

As an alternative, in some embodiments of the present application, after generating the reward according to the gesture recognition result, the method further comprises:

updating model parameters of the second gesture recognition model according to the gesture recognition results, carrying out iterative updating on the model parameters according to the gesture recognition results corresponding to the new video frame sequences, and generating the first gesture recognition model when generating the target key frame extraction model;

In this embodiment, the second gesture recognition model adopts the same model structure as the first gesture recognition model, after generating rewards according to the gesture recognition result, updating model parameters of the second gesture recognition model according to the gesture recognition result, iteratively updating the model parameters, synchronously generating the first gesture recognition model when training the preset target key frame extraction model and obtaining the target key frame extraction model, connecting an output end of the generated target key frame extraction model with an input end of the first gesture recognition model, and deploying the output end of the generated target key frame extraction model in the intelligent cabin, and subsequently performing gesture recognition directly based on the target key frame extraction model and the first gesture recognition model.

The training of the second gesture recognition model is synchronously performed while the training of the preset target key frame extraction model is performed, and the target key frame extraction model and the first gesture recognition model are generated at the same time, so that the training efficiency is improved.

According to the gesture interaction method in the intelligent cabin environment, gesture video data in the intelligent cabin are collected, and the gesture video data are marked to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting key frame data into a first gesture recognition model, if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture, generating a target key frame extraction model through reinforcement learning and extracting key frames, retaining main gesture action characteristics, and carrying out gesture recognition based on the extracted key frames, so that the key frames in a gesture video are automatically extracted, the accuracy and the speed of gesture recognition are improved, and further gesture interaction under the environment of the intelligent cabin can be carried out more efficiently and accurately.

In order to further explain the technical idea of the invention, the technical scheme of the invention is described with specific application scenarios.

The embodiment of the application provides a gesture interaction method in an intelligent cabin environment, which comprises the following steps:

step S201, gesture video data in the intelligent cabin are collected through a video collecting device.

The gesture video data comprise gesture videos of multiple sections of different gesture actions, each gesture action is made by different people in different illumination environments, and each person comprises people with different sexes and different ages. And labeling the gesture video data according to the actual category of the gesture video, and generating a sample data set after labeling is completed.

Step S202, a preset key frame extraction model and a second gesture recognition model are constructed.

The preset key frame extraction model is used for extracting key frames based on a reinforcement learning algorithm. The second gesture recognition model is a pre-trained gesture recognition model and is used for gesture recognition of a key frame sequence extracted by a preset key frame extraction model.

As shown in fig. 3, the preset key frame extraction model includes three convolution layers and two full connection layers, the second gesture recognition model includes two feature extraction modules, one full connection layer and one transform neural network, each feature extraction module includes a 3D depth separable convolution module and a SE module connected in parallel, each feature extraction module, the full connection layer and the transform neural network are sequentially connected, and the transform neural network is used for extracting spatial and temporal joint features from the full connection layer and outputting recognition results.

Step S203, training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed. The specific process may refer to steps S21-S24, which are not described herein.

Step S204, a first gesture recognition model is generated.

Specifically, multiple groups of key frames are extracted from a sample data set through a target key frame extraction model, a key frame data set is generated, then a second gesture recognition model is trained through the key frame data set, and after training is completed, a first gesture recognition model is generated.

Step S205, connecting the output end of the target key frame extraction model with the input end of the first gesture recognition model, and disposing the target key frame extraction model in the intelligent cabin.

In step S206, the video acquisition device in the intelligent cabin acquires video data in real time, if the presence of the gesture video to be identified is detected, the gesture video is input into the target key frame extraction model, and if the identification result output by the first gesture identification model accords with the preset gesture, the intelligent cabin executes the instruction corresponding to the preset gesture.

Corresponding to the gesture interaction method in the intelligent cabin environment in the embodiment of the present application, the embodiment of the present application further provides a gesture interaction device in the intelligent cabin environment, as shown in fig. 4, where the device includes: the acquisition module 401 is used for acquiring gesture video data in the intelligent cabin, labeling the gesture video data and generating a sample data set; the training module 402 is configured to train a preset key frame extraction model based on the sample data set, and generate a target key frame extraction model after training is completed, where the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; the extracting module 403 is configured to extract key frame data from the gesture video based on the target key frame extraction model if the gesture video to be identified in the intelligent cockpit is detected; and the recognition module 404 is configured to input the key frame data into a first gesture recognition model, and if a recognition result output by the first gesture recognition model meets a preset gesture, make the intelligent cabin execute an instruction corresponding to the preset gesture.

In a specific application scenario, the training module 402 is specifically configured to: acquiring a video frame sequence of a first preset frame number from the sample data set, and extracting key frames of the video frame sequence based on the preset key frame extraction model to acquire a key frame sequence of a second preset frame number; inputting the key frame sequence into a second gesture recognition model to perform gesture recognition, and generating rewards according to gesture recognition results; updating a value function network in the preset key frame extraction model according to the rewards, and acquiring a new video frame sequence from the sample data set; and carrying out iterative updating on the value function network according to rewards corresponding to the new video frame sequences, and generating the target key frame extraction model when the preset training completion condition is met.

In a specific application scenario, the training module 402 is further configured to: extracting a plurality of groups of key frames from the sample data set based on the target key frame extraction model to generate a key frame data set; training the second gesture recognition model based on the key frame data set, and generating the first gesture recognition model after training is completed; and deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.

In a specific application scenario, the training module 402 is further configured to: updating model parameters of the second gesture recognition model according to the gesture recognition results, carrying out iterative updating on the model parameters according to the gesture recognition results corresponding to the new video frame sequences, and generating the first gesture recognition model when generating the target key frame extraction model; and deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.

In a specific application scenario, the first gesture recognition model includes at least one feature extraction module, a full connection layer and a transducer neural network, each of the feature extraction module, the full connection layer and the transducer neural network are sequentially connected, the feature extraction module includes a 3D depth separable convolution module and a SE module connected in parallel, and the transducer neural network is used for extracting spatial and temporal joint features from the full connection layer and outputting a recognition result.

The embodiment of the invention also provides an electronic device, as shown in fig. 5, which comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504,

a memory 503 for storing executable instructions of the processor;

a processor 501 configured to execute via execution of the executable instructions:

acquiring gesture video data in the intelligent cabin, and labeling the gesture video data to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture.

The communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry StandardArchitecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include RAM (RandomAccess Memory ) or may include non-volatile memory, such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable GateArray, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor implements a gesture interaction method in a smart cabin environment as described above.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the gesture interaction method in a smart cockpit environment as described above.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A gesture interaction method in an intelligent cabin environment, the method comprising:

acquiring gesture video data in an intelligent cabin, and labeling the gesture video data to generate a sample data set;

training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm;

if the gesture video to be recognized in the intelligent cabin is detected, extracting key frame data from the gesture video based on the target key frame extraction model;

inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture.

2. The method of claim 1, wherein training a preset keyframe extraction model based on the sample dataset, generating a target keyframe extraction model after training is completed, comprises:

acquiring a video frame sequence of a first preset frame number from the sample data set, and extracting key frames of the video frame sequence based on the preset key frame extraction model to acquire a key frame sequence of a second preset frame number;

inputting the key frame sequence into a second gesture recognition model to perform gesture recognition, and generating rewards according to gesture recognition results;

updating a value function network in the preset key frame extraction model according to the rewards, and acquiring a new video frame sequence from the sample data set;

and carrying out iterative updating on the value function network according to rewards corresponding to the new video frame sequences, and generating the target key frame extraction model when the preset training completion condition is met.

3. The method of claim 2, wherein after generating the target keyframe extraction model, the method further comprises:

4. The method of claim 2, wherein after generating the reward based on the gesture recognition result, the method further comprises:

5. The method of claim 1, wherein the first gesture recognition model includes at least one feature extraction module, a fully connected layer, and a fransformer neural network, each of the feature extraction module, the fully connected layer, and the fransformer neural network being connected in sequence, the feature extraction module including a 3D depth separable convolution module and a SE module connected in parallel, the fransformer neural network being configured to extract spatial and temporal joint features from the fully connected layer and output a recognition result.

6. A gesture interaction device in an intelligent cockpit environment, the device comprising:

the acquisition module is used for acquiring gesture video data in the intelligent cabin, marking the gesture video data and generating a sample data set;

the training module is used for training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model is used for extracting key frames based on a reinforcement learning algorithm;

the extraction module is used for extracting key frame data from the gesture video based on the target key frame extraction model if the gesture video to be identified in the intelligent cabin is detected;

and the identification module is used for inputting the key frame data into a first gesture identification model, and if the identification result output by the first gesture identification model accords with a preset gesture, the intelligent cabin executes an instruction corresponding to the preset gesture.

7. The apparatus of claim 6, wherein the training module is specifically configured to:

8. The apparatus of claim 6, wherein the first gesture recognition model comprises at least one feature extraction module, a fully connected layer, and a fransformer neural network, each of the feature extraction module, the fully connected layer, and the fransformer neural network being connected in sequence, the feature extraction module comprising a 3D depth separable convolution module and a SE module connected in parallel, the fransformer neural network for extracting spatial and temporal joint features from the fully connected layer and outputting recognition results.

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the gesture interaction method in the intelligent cockpit environment of any one of claims 1-5 via execution of the executable instructions.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the gesture interaction method in the intelligent cabin environment of any one of claims 1 to 5.