CN117373121B - Gesture interaction method and related equipment in intelligent cabin environment - Google Patents
Gesture interaction method and related equipment in intelligent cabin environment Download PDFInfo
- Publication number
- CN117373121B CN117373121B CN202311336297.4A CN202311336297A CN117373121B CN 117373121 B CN117373121 B CN 117373121B CN 202311336297 A CN202311336297 A CN 202311336297A CN 117373121 B CN117373121 B CN 117373121B
- Authority
- CN
- China
- Prior art keywords
- gesture
- key frame
- model
- preset
- extraction model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000003993 interaction Effects 0.000 title claims abstract description 36
- 238000000605 extraction Methods 0.000 claims abstract description 144
- 238000012549 training Methods 0.000 claims abstract description 57
- 230000002787 reinforcement Effects 0.000 claims abstract description 20
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 14
- 238000002372 labelling Methods 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 18
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 7
- 230000009471 action Effects 0.000 description 16
- 238000004891 communication Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Computational Linguistics (AREA)
- Psychiatry (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The invention discloses a gesture interaction method and related equipment in an intelligent cabin environment, wherein the method comprises the following steps: acquiring gesture video data in the intelligent cabin, and labeling the gesture video data to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting the key frame data into a first gesture recognition model, if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture, generating a target key frame extraction model through reinforcement learning and extracting the key frame, and therefore gesture interaction under the environment of the intelligent cabin is achieved more efficiently and accurately.
Description
Technical Field
The application relates to the technical field of intelligent automobiles, in particular to a gesture interaction method and related equipment in an intelligent cabin environment.
Background
With the continuous development of artificial intelligence technology, the application scene is more diversified and specialized. The intelligent cabin is taken as an innovative function on a modern automobile, is an important application scene of the artificial intelligence technology, and has great significance in improving driving experience and operation convenience and driving safety and comfort.
Gesture interaction is used as a novel interaction mode, and driving experience and operation convenience of the intelligent cabin can be improved. Gesture interaction is based on accurate gesture recognition, in current gesture recognition, most technologies adopt a single picture for gesture recognition, however, the mode has good adaptability to static gestures, but cannot adapt to dynamic gesture motion recognition. Still other methods adopt video data to perform dynamic gesture recognition, but if the number of input frames is too large in the recognition process, data redundancy exists, so that calculation resources are wasted, if the number of input frames is too small, complete gesture can not be recognized, so that gesture false recognition is caused, and recognition accuracy is not high.
Therefore, how to perform gesture interaction in the intelligent cabin environment more efficiently and accurately is a technical problem to be solved at present.
Disclosure of Invention
The embodiment of the application provides a gesture interaction method and related equipment in an intelligent cabin environment, which are used for more efficiently and accurately carrying out gesture interaction in the intelligent cabin environment.
In a first aspect, a gesture interaction method in an intelligent cabin environment is provided, the method comprising: acquiring gesture video data in an intelligent cabin, and labeling the gesture video data to generate a sample data set; training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if the gesture video to be recognized in the intelligent cabin is detected, extracting key frame data from the gesture video based on the target key frame extraction model; inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture.
In a second aspect, a gesture interaction device in an intelligent cabin environment is provided, the device comprising: the acquisition module is used for acquiring gesture video data in the intelligent cabin, marking the gesture video data and generating a sample data set; the training module is used for training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model is used for extracting key frames based on a reinforcement learning algorithm; the extraction module is used for extracting key frame data from the gesture video based on the target key frame extraction model if the gesture video to be identified in the intelligent cabin is detected; and the identification module is used for inputting the key frame data into a first gesture identification model, and if the identification result output by the first gesture identification model accords with a preset gesture, the intelligent cabin executes an instruction corresponding to the preset gesture.
In a third aspect, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the gesture interaction method in the intelligent cockpit environment of the first aspect via execution of the executable instructions.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the gesture interaction method in the intelligent cabin environment according to the first aspect.
By applying the technical scheme, gesture video data in the intelligent cabin are collected, and the gesture video data are marked to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting key frame data into a first gesture recognition model, if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture, generating a target key frame extraction model through reinforcement learning and extracting key frames, retaining main gesture action characteristics, and carrying out gesture recognition based on the extracted key frames, so that the key frames in a gesture video are automatically extracted, the accuracy and the speed of gesture recognition are improved, and further gesture interaction under the environment of the intelligent cabin can be carried out more efficiently and accurately.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a schematic flow chart of a gesture interaction method in an intelligent cabin environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of training a preset keyframe extraction model in an embodiment of the invention;
FIG. 3 is a schematic diagram of training a preset keyframe extraction model according to an embodiment of the present invention;
Fig. 4 shows a schematic structural diagram of a gesture interaction device in an intelligent cabin environment according to an embodiment of the present invention;
Fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It is noted that other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise construction herein after described and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
The application is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiment of the application provides a gesture interaction method in an intelligent cabin environment, which is characterized in that a target key frame extraction model is generated through reinforcement learning and key frame extraction is carried out, main gesture action characteristics are reserved, and gesture recognition is carried out based on the extracted key frames, so that the key frames in a gesture video are automatically extracted, the accuracy and the speed of gesture recognition are improved, and further, more efficient and accurate gesture interaction in the intelligent cabin environment is realized.
As shown in fig. 1, the method comprises the steps of:
Step S101, acquiring gesture video data in an intelligent cabin, and labeling the gesture video data to generate a sample data set.
In this embodiment, be provided with video acquisition equipment (such as camera etc.) in the intelligent cabin, gather gesture video data in the intelligent cabin through video acquisition equipment, gesture video data includes the gesture video of the different gesture actions of multistage, and each gesture action is made by different personnel under different illumination environment, and each personnel can include the personnel of different sexes, different ages. And after the gesture video data are acquired, marking the gesture video data according to the actual category of the gesture video, and generating a sample data set after marking is completed.
Step S102, training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm.
The reinforcement learning algorithm DQN (Deep Q Networks) is a deep reinforcement learning algorithm that uses a neural network to learn a Q-value function, which is a function that maps states and actions to Q values, representing the expected return obtained by performing the action under a particular state. The basic process of reinforcement learning is as follows: firstly, the environment gives a state, the DQN obtains a Q value function related to the state according to the value function network, acts according to the Q value function, and gives a reward and the next state after receiving the action, and updates parameters of the value function network according to the reward. A preset key frame extraction model based on a reinforcement learning algorithm is constructed in advance and used for key frame extraction. After the sample data set is obtained, training a preset key frame extraction model according to the sample data set, generating a target key frame extraction model after training is completed, and extracting a key frame sequence from a gesture video to be actually recognized by using the target key frame extraction model.
Step S103, if the gesture video to be recognized in the intelligent cabin is detected, key frame data are extracted from the gesture video based on the target key frame extraction model.
Video data are acquired in real time by video acquisition equipment in the intelligent cabin, if the existence of gesture video to be identified is detected, the gesture video is input into a target key frame extraction model, key frame extraction is carried out by the target key frame extraction model, and key frame data are acquired.
Step S104, inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture.
In this embodiment, a first gesture recognition model is trained in advance, key frame data is input into the first gesture recognition model to perform gesture recognition, if the recognition result accords with a preset gesture, it is indicated that there is an effective interaction gesture, and the intelligent cabin is enabled to execute an instruction corresponding to the preset gesture, for example, the intelligent cabin is enabled to execute an instruction of playing music, or turning on navigation, or turning on lights in a vehicle, etc. The target key frame extraction model is generated through reinforcement learning and key frame extraction is carried out, main gesture action characteristics are reserved, and gesture recognition is carried out based on the extracted key frames, so that more efficient and accurate gesture interaction under the intelligent cabin environment is realized.
It can be understood that if the recognition result output by the first gesture recognition model does not conform to the preset gesture, the intelligent cabin is not enabled to execute the gesture interaction instruction, whether a new gesture video to be recognized exists is continuously detected, or prompt information of gesture recognition failure can be sent, so that a user can adjust gesture actions. In addition, after the target key frame extraction model is generated, steps S101-S102 are not performed any more, and steps S103-S104 are directly performed for gesture recognition and interaction.
In some embodiments of the present application, the first gesture recognition model includes at least one feature extraction module, a fully connected layer, and a transducer neural network, where each feature extraction module, the fully connected layer, and the transducer neural network are sequentially connected, and the feature extraction module includes a 3D depth separable convolution module and a SE module connected in parallel, and the transducer neural network is configured to extract spatial and temporal joint features from the fully connected layer and output a recognition result.
In this embodiment, the SE module is a channel attention module, and the SE module can perform channel feature enhancement on the input feature map without changing the size of the input feature map. Firstly, key frame data are input into each feature extraction module, spatial features and time features are extracted through the 3D depth separable convolution module and the SE module, then the extracted features are input into the full-connection layer, and finally, the combined features of the space and the time are extracted from the full-connection layer through the transducer neural network and the recognition result is output, so that the recognition precision of the model on gesture actions is improved, and the accuracy of the recognition result is further improved.
In some embodiments of the present application, the training is performed on the preset key frame extraction model based on the sample data set, and after the training is completed, a target key frame extraction model is generated, as shown in fig. 2, including the following steps:
Step S21, a video frame sequence of a first preset frame number is obtained from the sample data set, and key frame extraction is carried out on the video frame sequence based on the preset key frame extraction model, so that a key frame sequence of a second preset frame number is obtained.
In this embodiment, a video frame sequence of a first preset frame number in the sample data set is input into a preset key frame extraction model to perform key frame extraction, and a key frame sequence of a second preset frame number is determined according to a Q value output by the preset key frame extraction model, where it can be understood that the first preset frame number is greater than the second preset frame number, for example, the first preset frame number is 32, and the second preset frame number is 8.
And S22, inputting the key frame sequence into a second gesture recognition model to perform gesture recognition, and generating rewards according to gesture recognition results.
In this embodiment, the second gesture recognition model is a pre-trained gesture recognition model, the extracted key frame sequence is input into the second gesture recognition model to perform gesture recognition, and a reward is generated according to a gesture recognition result, where the reward characterizes the accuracy of the extracted key frame sequence, and the gesture recognition result may include a gesture conforming to a preset gesture or a gesture not conforming to the preset gesture.
Alternatively, the model structure of the second gesture recognition model may be the same as or different from the first gesture recognition model. When the model structure of the second gesture recognition model is the same as that of the first gesture recognition model, the first gesture recognition model may be obtained by retraining the second gesture recognition model, or the second gesture recognition model may be used as the first gesture recognition model.
Step S23, updating the value function network in the preset key frame extraction model according to the rewards, and acquiring a new video frame sequence from the sample data set.
In this embodiment, the training is aimed at maximizing rewards, updating the value function network in the preset key frame extraction model according to rewards, and acquiring a new video frame sequence from the sample dataset.
And step S24, carrying out iterative updating on the value function network according to rewards corresponding to the new video frame sequences, and generating the target key frame extraction model when the preset training completion condition is met.
In this embodiment, steps S21 to S23 are repeated, and iterative updating is performed according to each reward pair value function network corresponding to each new video frame sequence, and when a preset training completion condition is satisfied, training is completed, and a target key frame extraction model is generated, where the preset training completion condition includes reaching a preset iteration number or making the model converge, so that accuracy of the target key frame extraction model is improved.
In some embodiments of the present application, after generating the target keyframe extraction model, the method further comprises:
Extracting a plurality of groups of key frames from the sample data set based on the target key frame extraction model to generate a key frame data set;
Training the second gesture recognition model based on the key frame data set, and generating the first gesture recognition model after training is completed;
And deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.
In this embodiment, the second gesture recognition model adopts the same model structure as the first gesture recognition model, after the target key frame extraction model is generated, multiple groups of key frames are extracted from the sample data set through the target key frame extraction model to generate a key frame data set, then the second gesture recognition model is trained through the key frame data set, after training is completed, the first gesture recognition model is generated, finally the output end of the target key frame extraction model is connected with the input end of the first gesture recognition model and is deployed in the intelligent cabin, and gesture recognition is performed subsequently and directly based on the target key frame extraction model and the first gesture recognition model.
After the target key frame extraction model is obtained, a key frame data set is extracted based on the target key frame extraction model, and the first gesture recognition model is trained based on the key frame data set, so that the accuracy of the first gesture recognition model is improved, and further gesture interaction is performed more accurately on the intelligent cabin.
Alternatively, in some embodiments of the present application, after generating the reward based on the gesture recognition result, the method further comprises:
Updating model parameters of the second gesture recognition model according to the gesture recognition results, carrying out iterative updating on the model parameters according to the gesture recognition results corresponding to the new video frame sequences, and generating the first gesture recognition model when generating the target key frame extraction model;
And deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.
In this embodiment, the second gesture recognition model adopts the same model structure as the first gesture recognition model, after generating rewards according to the gesture recognition result, updating model parameters of the second gesture recognition model according to the gesture recognition result, iteratively updating the model parameters, synchronously generating the first gesture recognition model when training the preset target key frame extraction model and obtaining the target key frame extraction model, connecting an output end of the generated target key frame extraction model with an input end of the first gesture recognition model, and deploying the output end of the generated target key frame extraction model in the intelligent cabin, and subsequently performing gesture recognition directly based on the target key frame extraction model and the first gesture recognition model.
The training of the second gesture recognition model is synchronously performed while the training of the preset target key frame extraction model is performed, and the target key frame extraction model and the first gesture recognition model are generated at the same time, so that the training efficiency is improved.
According to the gesture interaction method in the intelligent cabin environment, gesture video data in the intelligent cabin are collected, and the gesture video data are marked to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting key frame data into a first gesture recognition model, if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture, generating a target key frame extraction model through reinforcement learning and extracting key frames, retaining main gesture action characteristics, and carrying out gesture recognition based on the extracted key frames, so that the key frames in a gesture video are automatically extracted, the accuracy and the speed of gesture recognition are improved, and further gesture interaction under the environment of the intelligent cabin can be carried out more efficiently and accurately.
In order to further explain the technical idea of the invention, the technical scheme of the invention is described with specific application scenarios.
The embodiment of the application provides a gesture interaction method in an intelligent cabin environment, which comprises the following steps:
Step S201, gesture video data in the intelligent cabin are collected through a video collecting device.
The gesture video data comprise gesture videos of multiple sections of different gesture actions, each gesture action is made by different people in different illumination environments, and each person comprises people with different sexes and different ages. And labeling the gesture video data according to the actual category of the gesture video, and generating a sample data set after labeling is completed.
Step S202, a preset key frame extraction model and a second gesture recognition model are constructed.
The preset key frame extraction model is used for extracting key frames based on a reinforcement learning algorithm. The second gesture recognition model is a pre-trained gesture recognition model and is used for gesture recognition of a key frame sequence extracted by a preset key frame extraction model.
As shown in fig. 3, the preset key frame extraction model includes three convolution layers and two full connection layers, the second gesture recognition model includes two feature extraction modules, one full connection layer and one transform neural network, each feature extraction module includes a 3D depth separable convolution module and a SE module connected in parallel, each feature extraction module, the full connection layer and the transform neural network are sequentially connected, and the transform neural network is used for extracting spatial and temporal joint features from the full connection layer and outputting recognition results.
Step S203, training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed. The specific process may refer to steps S21-S24, which are not described herein.
Step S204, a first gesture recognition model is generated.
Specifically, multiple groups of key frames are extracted from a sample data set through a target key frame extraction model, a key frame data set is generated, then a second gesture recognition model is trained through the key frame data set, and after training is completed, a first gesture recognition model is generated.
Step S205, connecting the output end of the target key frame extraction model with the input end of the first gesture recognition model, and disposing the target key frame extraction model in the intelligent cabin.
In step S206, the video acquisition device in the intelligent cabin acquires video data in real time, if the presence of the gesture video to be identified is detected, the gesture video is input into the target key frame extraction model, and if the identification result output by the first gesture identification model accords with the preset gesture, the intelligent cabin executes the instruction corresponding to the preset gesture.
Corresponding to the gesture interaction method in the intelligent cabin environment in the embodiment of the present application, the embodiment of the present application further provides a gesture interaction device in the intelligent cabin environment, as shown in fig. 4, where the device includes: the acquisition module 401 is used for acquiring gesture video data in the intelligent cabin, labeling the gesture video data and generating a sample data set; the training module 402 is configured to train a preset key frame extraction model based on the sample data set, and generate a target key frame extraction model after training is completed, where the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; the extracting module 403 is configured to extract key frame data from the gesture video based on the target key frame extraction model if the gesture video to be identified in the intelligent cockpit is detected; and the recognition module 404 is configured to input the key frame data into a first gesture recognition model, and if a recognition result output by the first gesture recognition model meets a preset gesture, make the intelligent cabin execute an instruction corresponding to the preset gesture.
In a specific application scenario, the training module 402 is specifically configured to: acquiring a video frame sequence of a first preset frame number from the sample data set, and extracting key frames of the video frame sequence based on the preset key frame extraction model to acquire a key frame sequence of a second preset frame number; inputting the key frame sequence into a second gesture recognition model to perform gesture recognition, and generating rewards according to gesture recognition results; updating a value function network in the preset key frame extraction model according to the rewards, and acquiring a new video frame sequence from the sample data set; and carrying out iterative updating on the value function network according to rewards corresponding to the new video frame sequences, and generating the target key frame extraction model when the preset training completion condition is met.
In a specific application scenario, the training module 402 is further configured to: extracting a plurality of groups of key frames from the sample data set based on the target key frame extraction model to generate a key frame data set; training the second gesture recognition model based on the key frame data set, and generating the first gesture recognition model after training is completed; and deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.
In a specific application scenario, the training module 402 is further configured to: updating model parameters of the second gesture recognition model according to the gesture recognition results, carrying out iterative updating on the model parameters according to the gesture recognition results corresponding to the new video frame sequences, and generating the first gesture recognition model when generating the target key frame extraction model; and deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.
In a specific application scenario, the first gesture recognition model includes at least one feature extraction module, a full connection layer and a transducer neural network, each of the feature extraction module, the full connection layer and the transducer neural network are sequentially connected, the feature extraction module includes a 3D depth separable convolution module and a SE module connected in parallel, and the transducer neural network is used for extracting spatial and temporal joint features from the full connection layer and outputting a recognition result.
The embodiment of the invention also provides an electronic device, as shown in fig. 5, which comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504,
A memory 503 for storing executable instructions of the processor;
a processor 501 configured to execute via execution of the executable instructions:
Acquiring gesture video data in the intelligent cabin, and labeling the gesture video data to generate a sample data set; training a preset key frame extraction model based on a sample data set, and generating a target key frame extraction model after training, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm; if a gesture video to be identified in the intelligent cabin is detected, extracting key frame data from the gesture video based on a target key frame extraction model; inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture.
The communication bus may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, an EISA (Extended Industry StandardArchitecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include RAM (RandomAccess Memory ) or may include non-volatile memory, such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (DIGITAL SIGNAL Processing, digital signal processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GateArray ) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor implements a gesture interaction method in a smart cabin environment as described above.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the gesture interaction method in a smart cockpit environment as described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.
Claims (8)
1. A gesture interaction method in an intelligent cabin environment is characterized in that,
The method comprises the following steps:
acquiring gesture video data in an intelligent cabin, and labeling the gesture video data to generate a sample data set;
training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model performs key frame extraction based on a reinforcement learning algorithm;
If the gesture video to be recognized in the intelligent cabin is detected, extracting key frame data from the gesture video based on the target key frame extraction model;
inputting the key frame data into a first gesture recognition model, and if the recognition result output by the first gesture recognition model accords with a preset gesture, enabling the intelligent cabin to execute an instruction corresponding to the preset gesture;
The first gesture recognition model comprises at least one feature extraction module, a full connection layer and a transducer neural network, wherein the feature extraction module, the full connection layer and the transducer neural network are sequentially connected, the feature extraction module comprises a 3D depth separable convolution module and a SE module which are connected in parallel, and the transducer neural network is used for extracting the joint features of space and time from the full connection layer and outputting a recognition result.
2. The method of claim 1, wherein,
Training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the training comprises the following steps:
Acquiring a video frame sequence of a first preset frame number from the sample data set, and extracting key frames of the video frame sequence based on the preset key frame extraction model to acquire a key frame sequence of a second preset frame number;
Inputting the key frame sequence into a second gesture recognition model to perform gesture recognition, and generating rewards according to gesture recognition results;
Updating a value function network in the preset key frame extraction model according to the rewards, and acquiring a new video frame sequence from the sample data set;
And carrying out iterative updating on the value function network according to rewards corresponding to the new video frame sequences, and generating the target key frame extraction model when the preset training completion condition is met.
3. The method of claim 2, wherein,
After generating the target keyframe extraction model, the method further comprises:
Extracting a plurality of groups of key frames from the sample data set based on the target key frame extraction model to generate a key frame data set;
Training the second gesture recognition model based on the key frame data set, and generating the first gesture recognition model after training is completed;
And deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.
4. The method of claim 2, wherein,
After generating the reward based on the gesture recognition result, the method further comprises:
Updating model parameters of the second gesture recognition model according to the gesture recognition results, carrying out iterative updating on the model parameters according to the gesture recognition results corresponding to the new video frame sequences, and generating the first gesture recognition model when generating the target key frame extraction model;
And deploying the target key frame extraction model and the first gesture recognition model in the intelligent cabin.
5. A gesture interaction device in an intelligent cabin environment is characterized in that,
The device comprises:
The acquisition module is used for acquiring gesture video data in the intelligent cabin, marking the gesture video data and generating a sample data set;
The training module is used for training a preset key frame extraction model based on the sample data set, and generating a target key frame extraction model after training is completed, wherein the preset key frame extraction model is used for extracting key frames based on a reinforcement learning algorithm;
The extraction module is used for extracting key frame data from the gesture video based on the target key frame extraction model if the gesture video to be identified in the intelligent cabin is detected;
The identification module is used for inputting the key frame data into a first gesture identification model, and if the identification result output by the first gesture identification model accords with a preset gesture, the intelligent cabin executes an instruction corresponding to the preset gesture;
The first gesture recognition model comprises at least one feature extraction module, a full connection layer and a transducer neural network, wherein the feature extraction module, the full connection layer and the transducer neural network are sequentially connected, the feature extraction module comprises a 3D depth separable convolution module and a SE module which are connected in parallel, and the transducer neural network is used for extracting the joint features of space and time from the full connection layer and outputting a recognition result.
6. The apparatus of claim 5, wherein,
The training module is specifically configured to:
Acquiring a video frame sequence of a first preset frame number from the sample data set, and extracting key frames of the video frame sequence based on the preset key frame extraction model to acquire a key frame sequence of a second preset frame number;
Inputting the key frame sequence into a second gesture recognition model to perform gesture recognition, and generating rewards according to gesture recognition results;
Updating a value function network in the preset key frame extraction model according to the rewards, and acquiring a new video frame sequence from the sample data set;
And carrying out iterative updating on the value function network according to rewards corresponding to the new video frame sequences, and generating the target key frame extraction model when the preset training completion condition is met.
7. An electronic device, characterized in that,
Comprising the following steps:
A processor;
And
A memory for storing executable instructions of the processor;
Wherein the processor is configured to perform the gesture interaction method in the intelligent cockpit environment of any one of claims 1-4 via execution of the executable instructions.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that,
The computer program, when executed by a processor, implements the gesture interaction method in an intelligent cabin environment according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311336297.4A CN117373121B (en) | 2023-10-16 | 2023-10-16 | Gesture interaction method and related equipment in intelligent cabin environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311336297.4A CN117373121B (en) | 2023-10-16 | 2023-10-16 | Gesture interaction method and related equipment in intelligent cabin environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117373121A CN117373121A (en) | 2024-01-09 |
CN117373121B true CN117373121B (en) | 2024-06-18 |
Family
ID=89405360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311336297.4A Active CN117373121B (en) | 2023-10-16 | 2023-10-16 | Gesture interaction method and related equipment in intelligent cabin environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117373121B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399850A (en) * | 2019-07-30 | 2019-11-01 | 西安工业大学 | A kind of continuous sign language recognition method based on deep neural network |
CN112183217A (en) * | 2020-09-02 | 2021-01-05 | 鹏城实验室 | Gesture recognition method, interaction method based on gesture recognition and mixed reality glasses |
CN114898457A (en) * | 2022-04-11 | 2022-08-12 | 厦门瑞为信息技术有限公司 | Dynamic gesture recognition method and system based on hand key points and transform |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111402203B (en) * | 2020-02-24 | 2024-03-01 | 杭州电子科技大学 | Fabric surface defect detection method based on convolutional neural network |
CN112733823B (en) * | 2021-03-31 | 2021-06-22 | 南昌虚拟现实研究院股份有限公司 | Method and device for extracting key frame for gesture recognition and readable storage medium |
CN113792635A (en) * | 2021-09-07 | 2021-12-14 | 盐城工学院 | Gesture recognition method based on lightweight convolutional neural network |
CN114360067A (en) * | 2022-01-12 | 2022-04-15 | 武汉科技大学 | Dynamic gesture recognition method based on deep learning |
CN116028889A (en) * | 2023-02-02 | 2023-04-28 | 中国科学技术大学 | Multi-mode progressive hierarchical fusion method for natural gesture recognition |
-
2023
- 2023-10-16 CN CN202311336297.4A patent/CN117373121B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399850A (en) * | 2019-07-30 | 2019-11-01 | 西安工业大学 | A kind of continuous sign language recognition method based on deep neural network |
CN112183217A (en) * | 2020-09-02 | 2021-01-05 | 鹏城实验室 | Gesture recognition method, interaction method based on gesture recognition and mixed reality glasses |
CN114898457A (en) * | 2022-04-11 | 2022-08-12 | 厦门瑞为信息技术有限公司 | Dynamic gesture recognition method and system based on hand key points and transform |
Also Published As
Publication number | Publication date |
---|---|
CN117373121A (en) | 2024-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3605394B1 (en) | Method and apparatus for recognizing body movement | |
CN112418292B (en) | Image quality evaluation method, device, computer equipment and storage medium | |
CN109740018B (en) | Method and device for generating video label model | |
CN110009059B (en) | Method and apparatus for generating a model | |
CN113095346A (en) | Data labeling method and data labeling device | |
CN114064974B (en) | Information processing method, apparatus, electronic device, storage medium, and program product | |
CN113626612B (en) | Prediction method and system based on knowledge graph reasoning | |
CN114494815B (en) | Neural network training method, target detection method, device, equipment and medium | |
CN112749556A (en) | Multi-language model training method and device, storage medium and electronic equipment | |
WO2024213099A1 (en) | Data processing method and apparatus | |
CN116128055A (en) | Map construction method, map construction device, electronic equipment and computer readable medium | |
JP2022103136A (en) | Method, device, and computer readable storage medium for image processing | |
CN112668608A (en) | Image identification method and device, electronic equipment and storage medium | |
CN118096924B (en) | Image processing method, device, equipment and storage medium | |
KR101646926B1 (en) | Method and system of deep concept hioerarchy for reconstruction in multi-modality data | |
CN113705402A (en) | Video behavior prediction method, system, electronic device and storage medium | |
CN117877113A (en) | Teaching gesture recognition method and system based on space-time skeleton topology | |
CN115292439A (en) | Data processing method and related equipment | |
CN110728359B (en) | Method, device, equipment and storage medium for searching model structure | |
CN110705695B (en) | Method, device, equipment and storage medium for searching model structure | |
CN117373121B (en) | Gesture interaction method and related equipment in intelligent cabin environment | |
CN116958041A (en) | Product defect detection method and device, electronic equipment and storage medium | |
CN116977195A (en) | Method, device, equipment and storage medium for adjusting restoration model | |
CN114547308A (en) | Text processing method and device, electronic equipment and storage medium | |
CN112417260B (en) | Localized recommendation method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |