CN115424356B

CN115424356B - Gesture interaction method and device in cabin

Info

Publication number: CN115424356B
Application number: CN202211381906.3A
Authority: CN
Inventors: 沈锦瑞; 林垠; 殷保才; 胡金水; 殷兵
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-04-04
Anticipated expiration: 2042-11-07
Also published as: CN115424356A

Abstract

The invention discloses a method and a device for gesture interaction in a cockpit, wherein the method for gesture interaction in the cockpit comprises the following steps: receiving a real-time image within the cabin; inputting the real-time image into a gesture recognition model to obtain a first gesture type detection result and a first position type detection result output by the gesture recognition model; controlling equipment in the cabin according to a control command corresponding to a first gesture on the first cabin position; wherein, for different first cockpit positions, the control commands corresponding to the same first gesture are different. According to the invention, the gesture type detection result is directly obtained through the gesture recognition model, the working pressure of the car machine caused by two-stage gesture recognition is avoided, different functions are configured for the same gesture on different cabin positions by recognizing the cabin position where the gesture action person is located, and the richness of gesture interaction is improved.

Description

Gesture interaction method and device in cabin

Technical Field

The invention relates to the technical field of computer information processing, in particular to a method and a device for gesture interaction in a cockpit.

Background

The existing gesture recognition in the cabin generally comprises two stages, wherein the first stage is used for human body detection or hand detection, and the second stage is used for performing gesture recognition by using hand key point information. In a common multi-person scene in a cabin, position information of hands is required to be modeled in a complex spatial relationship, especially dynamic gestures, a human body and the hands can move to a certain degree, the corresponding relationship between the hands and the human body is not easy to distinguish, the recognition method can cause burden on a car machine, the real-time recognition efficiency of the gestures is reduced, and the real-time recognition effect of the dynamic gestures is poor. In addition, the gesture recognition system in the existing cockpit has no positioning of an action person, and for the same gesture, the corresponding functions of different cockpit positions are the same, so that the interaction richness of the whole gesture is greatly reduced.

In an existing cabin-inside gesture recognition system, a single-frame image is usually and directly adopted for gesture recognition, but gesture instructions in a cabin usually have a certain duration, so that the existing gesture recognition has a poor recognition effect on dynamic gestures under the condition that time information is not fused.

The camera equipment of the existing gesture recognition system in the cockpit is installed on the left side or the right side of the front row of the cockpit, and the gesture of the hand closest to the camera equipment, namely the gesture of the front row of users, is often detected in consideration of efficiency problems and recognition accuracy, so that the gesture interaction experience of the back row of users is greatly reduced.

Disclosure of Invention

In view of the above, the present invention aims to provide a method and a device for gesture interaction in a cockpit, in which a gesture recognition model is used to directly obtain a gesture type detection result, thereby avoiding a work pressure of a vehicle machine caused by two-stage gesture recognition, and by recognizing a cockpit position where a gesture actor is located, different functions are configured for the same gesture in different cockpit positions, thereby improving the richness of gesture interaction.

The technical scheme adopted by the invention is as follows:

in a first aspect, the invention provides a method for gesture interaction in a cockpit, comprising the following steps:

receiving a real-time image within the cabin;

inputting the real-time image into a gesture recognition model to obtain a first gesture type detection result and a first position type detection result output by the gesture recognition model, wherein the first gesture type detection result indicates the type of a first gesture, and the first position type detection result indicates the position of a first cabin where an action person of the first gesture is located;

controlling equipment in the cabin according to a control command corresponding to a first gesture on the first cabin position; wherein, for different first cockpit positions, the control commands corresponding to the same first gesture are different.

In one possible implementation manner, after obtaining a first gesture class detection result, voting is performed by using the current frame and the gesture class detection results of a first preset number of continuous frames before the current frame, so as to determine whether the first gesture class detection result of the current frame is valid;

and if the first gesture is valid, controlling equipment in the cabin according to a control command corresponding to the first gesture on the first cabin position.

In one possible implementation manner, if the first gesture type detection result of the current frame is valid, voting is performed by using the position type detection results of the current frame and a second preset number of continuous frames before the current frame, and whether the first position type detection result of the current frame is valid is determined;

and if the first position type detection result of the current frame is valid, controlling equipment in the cockpit according to a control instruction corresponding to the first gesture on the first cockpit position.

In one possible implementation manner, the processing, by the gesture recognition model, of the real-time image to obtain a first gesture type detection result and a first position type detection result specifically includes:

processing the real-time image to obtain a first gesture classification result, a first position classification result and a first central point of a first gesture action person;

taking the first gesture classification result as a first gesture class detection result and outputting the first gesture class detection result;

judging whether the first center point is located in the area where the cabin position indicated by the first position classification result is located;

and if so, taking the first position classification result as a first position type detection result and outputting the first position type detection result.

In one possible implementation manner, if the first center point is not located in the area where the cabin location indicated by the first location classification result is located, the first location classification detection result is determined to be an unknown area and output.

In one possible implementation manner, if the first location type detection result is an unknown region, the device in the cabin is not controlled, or the device in the cabin is controlled according to a general instruction corresponding to the type of the first gesture.

In one possible implementation manner, the gesture recognition model includes a third preset number of convolutional layers, and each convolutional layer outputs a timing characteristic and a convolution result;

the input data of the gesture recognition model are all time sequence characteristics of a previous frame of a current frame obtained by the real-time image and the convolution layers with the third preset number.

In one possible implementation manner, after the first position classification result is obtained, voting is carried out on the first position classification result by using the first unique modulation vectors of all cabin positions in the cabin, and a second position classification result is determined; the first unique modulation vector is generated according to the actual position category label of the real-time image;

and if the first center point is located in the area where the cabin position indicated by the second position classification result is located, outputting the second position classification result as a first position type detection result.

In one possible implementation manner, before receiving the real-time image, the method further includes:

receiving light intensity information in the cabin;

and controlling the camera equipment in the cabin to shoot by using visible light or near infrared light according to the light intensity information.

In one possible implementation, training the gesture recognition model includes:

inputting successive image samples into an initial model;

acquiring a gesture convolution result and a position convolution result obtained after the continuous image samples pass through a third preset number of convolution layers, wherein each convolution layer sequentially comprises a time sequence offset module and a convolution module, and input data of the time sequence offset module is time sequence characteristics output by the previous convolution layer;

inputting the gesture convolution result and the position convolution result into a gesture classifier and a position classifier respectively to obtain a second gesture class detection result and a second position class detection result;

and performing iterative training on the initial model according to the loss functions between the second gesture type detection result and the gesture type label of the continuous image sample and between the second position type detection result and the actual position type label of the continuous image sample to obtain a gesture recognition model.

In one possible implementation, the input data of the position classifier is a dot product of the position convolution result and a second unique modulation vector generated by the position modulator, and the second unique modulation vector is generated according to an actual position class label of the continuous image sample.

In one possible implementation manner, the second central point of the gesture action person is obtained after the continuous image samples pass through a third preset number of convolution layers;

and the initial model is also subjected to iterative training by using a loss function between a third central point and a second central point of the gesture action person obtained by human body detection of the continuous image samples.

In one of the possible implementations, the in-cabin camera device is provided at a ceiling light or an in-cabin rearview mirror of the cabin.

In a second aspect, the invention provides a gesture interaction device in a cabin, which comprises a real-time image receiving module, a gesture recognition module and a control module;

the real-time image receiving module is used for receiving a real-time image in the cabin;

the gesture recognition module is used for inputting the real-time image into the gesture recognition model to obtain a first gesture type detection result and a first position type detection result which are output by the gesture recognition model, the first gesture type detection result indicates the type of the first gesture, and the first position type detection result indicates the position of a first cabin where an action person of the first gesture is located;

the control module is used for controlling equipment in the cockpit according to a control instruction corresponding to a first gesture on the first cockpit position; wherein, for different first cockpit positions, the control commands corresponding to the same first gesture are different.

In one possible implementation manner, the gesture recognition model comprises a processing module, a judging module and an output module;

the processing module is used for processing the real-time image to obtain a first gesture classification result, a first position classification result and a first central point of a first gesture action person;

the judging module is used for judging whether the first central point is positioned in the area where the cabin position indicated by the first position classification result is positioned;

the output module is used for taking the first position classification result as a first position classification detection result and outputting the first position classification result when the first center point is located in the area where the cabin position indicated by the first position classification result is located, and taking the first gesture classification result as a first gesture classification detection result and outputting the first gesture classification result.

In one possible implementation manner, the processing module includes a timing sequence feature extractor, the timing sequence feature extractor includes an offset feature storage module and a third preset number of convolutional layers, each convolutional layer sequentially includes a timing sequence offset module and a convolution module, input data of the timing sequence offset module is timing sequence features output by a previous convolutional layer, and the offset feature storage module only stores the third preset number of timing sequence features obtained by the same frame;

the input data of the time sequence feature extractor is all time sequence features and real-time images in the offset feature storage module.

In one possible implementation manner, the processing module further comprises a position modulator, a voting module, and a gesture classifier and a position classifier which are connected with the output end of the timing characteristic extractor;

the position classifier outputs a first position classification result, the gesture classifier outputs a first gesture classification result, and the position modulator outputs third unique modulation vectors of all cabin positions in the cabin;

the voting module is used for voting the first position classification result by using the third unique modulation vectors of all the cabin positions in the cabin to determine a second position classification result;

wherein the third unique modulation vector is generated according to the actual position class label of the real-time image.

In one possible implementation manner, the processing module further includes a position regressor, input data of the position regressor is output data of the time sequence feature extractor, and output data of the position regressor is the first central point.

In one possible implementation manner, in the training phase, the input data of the position classifier is a dot product of a position convolution result output by the convolution layer and a fourth one-hot modulation vector generated by the position modulator, and the fourth one-hot modulation vector is generated according to an actual position class label of the continuous image sample.

The invention has the conception that firstly, a gesture type detection result is directly obtained through a gesture recognition model, the working pressure of a car machine caused by two-stage gesture recognition is avoided, different functions are configured for the same gesture on different cabin positions by recognizing the cabin position where a gesture action person is located, and the gesture interaction richness is improved. Secondly, the gesture recognition model generates different time sequence characteristics when processing the same frame of image and is used for gesture recognition of subsequent frames to form gesture recognition based on continuous frame images, time sequence information of historical frames is fully utilized, the dynamic gesture recognition method has better discrimination and improves the recognition accuracy of the dynamic gesture. In addition, the positions of the camera shooting equipment are adjusted, so that the shooting effects of all cabin positions are basically the same, the gesture recognition effect of the back row users is greatly improved, and a good basis is provided for improving the richness of gesture interaction.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a preferred embodiment of a method for gesture interaction within a cockpit provided by the present invention;

FIG. 2 is a flowchart of one embodiment of obtaining the first gesture class detection result and the first location class detection result according to the present invention;

FIG. 3 is a block diagram illustrating an embodiment of a processing module in a training phase;

FIG. 4 is a schematic diagram of an embodiment of a processing module of the inference phase provided by the present invention, wherein a position classifier, a position modulator, and a position regressor are omitted;

FIG. 5 is a flow diagram for one embodiment of training a gesture recognition model provided by the present invention;

FIG. 6 is a schematic structural diagram of an embodiment of an in-cabin gesture interaction apparatus provided by the present invention;

FIG. 7 is a schematic block diagram of one embodiment of a processing module according to the present invention;

fig. 8 is a schematic structural diagram of an embodiment of the gesture interaction device in the cockpit provided in the present invention.

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

In view of the foregoing core concept, the present invention provides at least one embodiment of a method for gesture interaction in a cockpit, as shown in fig. 1, which may include the following steps:

s110: a real-time image within the cabin is received.

Specifically, when a user logs in a passenger Monitoring System (OMS) of the cabin, the camera devices located in the cabin are synchronously turned on to acquire images inside the cabin. Meanwhile, the gesture interaction devices (please see the following description) in the cockpit of the invention are simultaneously started.

The camera equipment is arranged on the ceiling lamp of the cockpit or the rearview mirror in the cockpit, so that the camera equipment can clearly collect all positions of the cockpit in the cockpit, and the gesture recognition of all positions of the cockpit is facilitated.

In one possible implementation, the image capture device may be an RGB camera, a near-infrared camera, or the like.

In a preferred implementation mode, the image pickup device is an adaptive light camera, visible light imaging is adopted when light is sufficient, and near infrared imaging is switched to when the light is insufficient, so that the imaging quality in the cabin meets the conditions.

When the user finishes static or dynamic gesture actions in the cabin, the camera shooting equipment can acquire images in the cabin and transmit the images to the gesture interaction device in the cabin for processing. Wherein the static gesture refers to a gesture in which the hand does not need to move, such as praise, love, etc. Dynamic gestures refer to some fingers that require hand or finger motion, such as rotating a finger, waving a hand left, etc.

On the basis, in a preferred implementation manner, before S110, the method further includes receiving light intensity information in the cabin, which is acquired by a light sensor in the cabin, and controlling an in-cabin image capturing device to capture images using visible light or near infrared light according to the light intensity information.

S120: inputting the real-time image into a gesture recognition model, and obtaining a first gesture type detection result and a first position type detection result output by the gesture recognition model, wherein the first gesture type detection result indicates the type of a first gesture, and the first position type detection result indicates the position of a first cabin where an action person of the first gesture is located.

It should be noted that the real-time image of the input gesture recognition model is a frame image in a video stream acquired by the camera device, that is, a current frame image.

In a possible implementation manner, after the gesture recognition model obtains the first gesture classification result and the first position classification result through the internal neural network, the first gesture classification result and the first position classification result are directly used as the first gesture class detection result and the first position class detection result respectively and output.

In a preferred implementation manner, as shown in fig. 2, the processing, by the gesture recognition model, the real-time image to obtain a first gesture type detection result and a first position type detection result specifically includes:

s210: and processing the real-time image to obtain a first gesture classification result, a first position classification result and a first central point of the first gesture action person.

S220: and taking the first gesture classification result as a first gesture class detection result and outputting the first gesture class detection result.

S230: and judging whether the first central point is located in the area where the cabin position indicated by the first position classification result is located. If yes, go to S240; otherwise, S250 is executed.

S240: and taking the first position classification result as a first position category detection result and outputting the first position category detection result.

S250: and judging the first position type detection result as an unknown area and outputting the unknown area.

In one possible implementation, if the first location type detection result is an unknown region, no control is performed on the equipment in the cabin.

In another possible implementation manner, if the first position type detection result is an unknown region, the device in the cabin is controlled according to a general instruction corresponding to the type of the first gesture. The general instruction refers to executing a uniform control instruction when the gesture category is recognized no matter where the position of the gesture action person is.

The processing mode for the unknown region may be determined according to the setting of the upper layer application, and the present invention is not limited thereto.

It is understood that S220 and S250 may be executed simultaneously, or S230-S240 may be executed first, and then S220 is executed.

In a preferred embodiment, before performing S230, the following steps are further performed:

p1: voting the first position classification result by using the first unique modulation vectors of all the cabin positions in the cabin to determine a second position classification result; wherein the first unique modulation vector is generated according to the actual position class label of the real-time image. And the first unique modulation vector is utilized to carry out feature enhancement of class prior, so that the position classification is more robust.

P2: and if the first center point is located in the area where the cabin position indicated by the second position classification result is located, taking the second position classification result as the first position classification detection result and outputting the first position classification result. Otherwise, S250 is performed.

S150: and controlling equipment in the cabin according to the control command corresponding to the first gesture on the first cabin position, and returning to the step S110. Wherein, for different first cockpit positions, the control commands corresponding to the same first gesture are different. For example, when the gesture human at different cabin positions is recognized to perform the action of rotating the fingers, the opening and closing degrees of the windows at different positions are adjusted.

The training process of the gesture recognition model is explained as follows. The gesture recognition model comprises a processing module, and fig. 3 shows a structural schematic diagram of the processing module.

As shown in fig. 3, the processing module includes a timing feature extractor, a gesture classifier connected to an output of the timing feature extractor, and a position classifier. The processing module is a model based on a lightweight convolutional neural network, the time sequence feature extractor is a convolutional network, the time sequence feature extractor comprises a third preset number of convolutional layers, and the gesture classifier and the position classifier are all connected networks. On the basis of the lightweight convolutional neural network, a timing offset module is added at the front end of the convolutional module of each convolutional layer, and the input data of the timing offset module is the timing characteristic output by the previous convolutional layer, please refer to fig. 4. That is, the output data of each convolutional layer is a timing characteristic, a position convolution result and a gesture convolution result, and the convolutional layer at the rear end takes the timing characteristic, the position convolution result and the gesture convolution result output by the convolutional layer as input data. Therefore, each time sequence offset module moves some channels between frames in the time dimension, so that the information between frames is exchanged, namely, each convolution layer fuses the information of the previous historical frame, and the model can fuse partial characteristics of the historical frame to predict the current frame during training.

In a preferred implementation, considering that the gesture actions are short-time actions, the timing feature extractor sets ten convolutional layers, that is, in the training phase, each frame fuses ten frames of historical information (the sampling frame rate of the camera is about twelve frames), and combines the ten frames of information to predict the current behavior state.

It should be noted that, in the training stage, each sample includes multiple frames of continuous images, each sample only contains one type of gesture motion at most, and each training sample finally predicts one gesture class (no gesture also belongs to one of the gesture classes). And before the samples are input into the time sequence feature extractor, the samples with different time lengths are uniformly sampled to the same frame number, and the input size of the network is fixed.

In a preferred implementation, to increase the diversity in timing, a variety of timing enhancement strategies such as timing sampling, timing clipping, and part of action reversible gestures are used to add timing inversion to form the input data of the timing feature extractor.

Based on the above description, as shown in fig. 5, training the gesture recognition model includes:

s510: successive image samples are input into the initial model.

S520: and the gesture convolution result and the position convolution result are obtained after the continuous image samples pass through a third preset number of convolution layers.

S530: and respectively inputting the gesture convolution result and the position convolution result into a gesture classifier and a position classifier to obtain a second gesture class detection result and a second position class detection result.

S540: and performing iterative training on the initial model according to the loss functions between the second gesture type detection result and the gesture type label of the continuous image sample and between the second position type detection result and the actual position type label of the continuous image sample to obtain a gesture recognition model.

In a preferred implementation, as shown in fig. 3, the processing module further comprises a position modulator that outputs the second, unique modulation vectors for all cabin positions within the cabin. The second unique modulation vector is generated according to the actual position class label of the continuous image sample (i.e. the cockpit position where the gesture actor in the continuous image sample is actually located). Specifically, as shown in fig. 3, the position modulator generates a unique modulation vector (1*N) through an actual position category label of a continuous image sample, generates a modulation feature (1*C) with the same length as a position convolution result through a learnable feature mapping layer, performs dot product with the position convolution result, inputs the dot product result into the position classifier, and performs category-prior feature enhancement, thereby making the position classification more robust.

Based on the position modulator, in one possible implementation, in the training phase, the input data of the position classifier is a dot product of the position convolution result and a second one-hot modulation vector generated by the position modulator.

It should be noted that, in the training phase, the position modulator uses the second unique modulation vector as the input data of the position classifier, but in the using phase (i.e., the inference phase) of the model, the first position classification result of the position classifier is voted by using the first unique modulation vector of the position modulator (see step P1 above).

In one possible implementation, the result of the location classifier comprises a first location classification result for all cabin locations, the first one-hot modulation vector for each cabin location being obtained from a real-time image. On the basis, the voting in step P1 is specifically to calculate a dot product of the first position classification result and the first unique modulation vector for each car position, and to take the car position with the highest score as the second position classification result.

In another possible implementation, the result of the position classifier is the first position classification result of only one cabin position, and the first one-hot modulation vectors of all cabin positions are obtained through real-time images. On this basis, the voting in step P1 is specifically to calculate a weighted sum of the first location classification result and each first unique modulation vector as the second location classification result.

In the inference stage, the position modulator acts on the output end of the position classifier, and the modulation type is only the total number of the cabin positions on the vehicle, so that the calculation amount is small, and the influence on the efficiency is small.

In a preferred implementation manner, as shown in fig. 3, the processing module further includes a position regressor, where input data of the position regressor is output data of the time-series feature extractor, and output data of the position regressor is a central point of a gesture action person in the continuous image samples, and is recorded as a second central point. In the training process, the off-line human body detection model is used for detecting the human body of each frame in the continuous image sample to obtain the central point of the gesture action person in each frame, then the central points of all the frames are averaged to obtain the actual central point of the gesture action person in the continuous image sample, and the actual central point is recorded as a third central point, such as the point at the center of the square frame in the image at the lower right in fig. 3. And the third central point is used as a supervision signal for loss calculation.

Based on the position regressor, in a possible implementation manner, in a training stage, after the continuous image samples pass through a third preset number of convolutional layers, a second central point of the gesture action person is obtained; and, the initial model is also iteratively trained using a loss function between the third center point and the second center point. It can be understood that, in the training phase, when no user performs a gesture action (i.e. no gesture) in the continuous image samples, the position detection result output by the position classifier is "no position", and at this time, the position regressor does not participate in the loss calculation.

Therefore, in the preferred implementation, the loss function is divided into two parts, one part is the location classification supervision, and the other part is the location center point regression supervision of the gesture operator.

Based on the above training process, in a possible implementation manner, as shown in fig. 4, the trained processing module further includes an offset feature storage module. In the reasoning stage, when the time sequence feature extractor processes each frame of image, the time sequence feature output by each convolution layer is stored in the offset feature, and the offset feature storage module only stores all the time sequence features of the same frame, and the time sequence features and the next frame of real-time image are simultaneously input into the time sequence feature extractor and then erased to prepare for storing the time sequence feature of the next frame. In the inference stage, the time sequence feature extractor takes the time sequence feature of the previous frame as the historical frame information of the current frame for analysis and processing, thereby realizing the utilization of the time sequence information in the inference stage.

In a preferred implementation, in the inference phase, the in-cabin gesture interaction method further includes:

s130: after obtaining the first gesture type detection result (please see S120), voting is performed by using the gesture type detection results of the current frame and a first preset number of consecutive frames before the current frame, and determining whether the first gesture type detection result of the current frame is valid. If yes, executing S150; otherwise, return to S110.

On the basis of the above preferred implementation, in another preferred implementation, the in-cabin gesture interaction method further includes:

if the first gesture type detection result of the current frame is valid, S140 is executed.

S140: voting by using the current frame and the position type detection results of continuous second preset number frames before the current frame, and determining whether the first position type detection result of the current frame is effective. If the first position type detection result of the current frame is valid, executing S150; otherwise, return to S110.

In a possible implementation manner, as shown in fig. 4, the trained processing module further includes a timing result storage module, and the timing result storage module stores a fourth preset number of consecutive first gesture class detection results (shown as T frames in fig. 4) including the current frame and the first position class detection result. The time sequence result storage module dynamically maintains by adopting a first-in first-out principle, occupies small memory in a system, can enable the prediction result of the gesture to be more stable, and has robustness.

On the basis, when determining whether the first gesture class detection result of the current frame is valid, the voting process is as follows: and calculating a first average value of the fourth preset number of first gesture class detection results, and judging whether the first average value is greater than a first threshold value. If the first average value is larger than a first threshold value, the first gesture type detection result of the current frame is valid; otherwise, the first gesture class detection result of the current frame is invalid. When determining whether the first position class detection result of the current frame is valid, the voting process is as follows: and calculating a second average value of the fourth preset number of first position type detection results, and judging whether the second average value is greater than a second threshold value. If the second average value is larger than a second threshold value, the first position type detection result of the current frame is valid; otherwise, the first position type detection result of the current frame is invalid.

It will be appreciated that the first gesture class detection result and the first location class detection result may be voted in other ways. For example, when the first gesture type detection results obtained from two frames with a shorter time interval are the same, the two gesture type detection results are combined, so that the detection results of the gesture types are more continuous; when the first gesture type detection results obtained by two frames with short time intervals are different, the subsequent first gesture type detection results are shielded, and the accuracy of the output gesture types is ensured.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of a gesture interaction apparatus in a cockpit, which specifically includes a real-time image receiving module 610, a gesture recognition module 620, and a control module 630, as shown in fig. 6.

The real-time image receiving module 610 is configured to receive a real-time image within the cabin.

The gesture recognition module 620 is configured to input the real-time image into the gesture recognition model, and obtain a first gesture type detection result and a first position type detection result output by the gesture recognition model, where the first gesture type detection result indicates a type of the first gesture, and the first position type detection result indicates a first cockpit position where an action person of the first gesture is located.

The control module 630 is used for controlling equipment in the cockpit according to a control command corresponding to a first gesture at a first cockpit position; wherein, for different first cockpit positions, the control commands corresponding to the same first gesture are different.

In one possible implementation manner, the gesture recognition module in the gesture recognition module 620 includes a processing module 6201, a determining module 6202, and an output module 6203.

The processing module 6201 is configured to process the real-time image to obtain a first gesture classification result, a first position classification result, and a first central point of the first gesture actor;

the determining module 6202 is configured to determine whether the first center point is located in an area where the cabin location indicated by the first location classification result is located.

The output module 6203 is configured to output the first position classification result as a first position class detection result when the first center point is located in an area where the cabin position indicated by the first position classification result is located, and output the first gesture classification result as a first gesture class detection result.

In one possible implementation manner, as shown in fig. 7, the processing module 6201 includes a timing feature extractor 710, where the timing feature extractor 710 includes an offset feature storage module and a third preset number of convolutional layers, each convolutional layer sequentially includes a timing offset module and a convolution module, input data of the timing offset module is a timing feature output by a previous convolutional layer, and the offset feature storage module only stores the third preset number of timing features obtained in the same frame; the input data of the time sequence feature extractor is all time sequence features and real-time images in the offset feature storage module.

In one possible implementation, the processing module 6201 further includes a position modulator 720, a voting module 730, and a gesture classifier 740 and a position classifier 750 connected to an output of the timing feature extractor 710.

The location classifier 750 outputs a first location classification result, the gesture classifier 740 outputs a first gesture classification result, and the location modulator 720 outputs a third unique modulation vector for all cabin locations within the cabin.

The voting module 730 is configured to vote on the first location classification result by using the third unique modulation vectors of all the capsule locations in the capsule to determine the second location classification result. Wherein the third unique modulation vector is generated according to the actual position class label of the real-time image.

In one possible implementation manner, the processing module 6201 further includes a position regressor 760, the input data of the position regressor 760 is output data of the time-series feature extractor, and the output data of the position regressor 760 is the first central point.

In one possible implementation, in the training phase, the input data of the position classifier 750 is the dot product of the position convolution result and the fourth unique modulation vector generated by the position modulator 720, and the fourth unique modulation vector is generated according to the actual position class label of the continuous image sample.

It should be understood that the division of the components of the gesture interaction device in the cockpit shown in fig. 6-7 above is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these components may all be implemented in software invoked by a processing element; or can be implemented in the form of hardware; and part of the components can be realized in the form of calling by the processing element in software, and part of the components can be realized in the form of hardware. For example, a certain module may be a separate processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and their preferred embodiments, it will be appreciated by those skilled in the art that in practice, the invention may be practiced in a variety of embodiments, and that the invention is illustrated schematically in the following vectors:

(1) An in-cabin gesture interaction device, which may comprise:

one or more processors, memory, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or equivalent implementations.

Fig. 8 is a schematic structural diagram of an embodiment of the gesture interaction device in the cockpit according to the present invention, where the device may be an electronic device or a circuit device built in the electronic device. The electronic equipment can be unmanned aerial vehicles, intelligent (automobiles), vehicle-mounted equipment and the like. The embodiment does not limit the specific form of the gesture interaction device in the cockpit.

As shown in particular in fig. 8, the in-cabin gesture interaction device 900 includes a processor 910, a memory 930, a camera 990, and a sensor 901. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or may be separate from the processor 910.

In addition to this, to further improve the functionality of the in-cabin gesture interaction device 900, the device 900 may further comprise one or more of an input unit 960, a display unit 970, an audio circuit 980, which may further comprise a speaker 982, a microphone 984, etc. The display unit 970 may include a display screen, among others.

Further, the above-described in-cabin gesture interaction device 900 may also include a power supply 950 for providing power to various devices or circuits within the device 900.

It should be understood that the in-cabin gesture interaction device 900 shown in fig. 8 is capable of implementing the various processes of the methods provided by the foregoing embodiments. The operations and/or functions of the various components of the apparatus 900 may each be configured to implement the corresponding flow in the above-described method embodiments. In particular, reference may be made to the above description of embodiments of the method and apparatus, and a detailed description is omitted here as appropriate to avoid redundancy.

It should be understood that the processor 910 in the in-cabin gesture interaction device 900 shown in fig. 8 may be a system on chip SOC, and the processor 910 may include a Central Processing Unit (CPU), and may further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.

In summary, various portions of the processors or processing units within the processor 910 may cooperate to implement the foregoing method flows, and corresponding software programs for the various portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium, on which a computer program or the above-mentioned apparatus is stored, which, when executed, causes the computer to perform the steps/functions of the above-mentioned embodiments or equivalent implementations.

In several embodiments provided by the present invention, any function, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of software products, which are described below, or portions thereof, which substantially contribute to the art.

(3) A computer program product (which may include the above-described apparatus) which, when run on a terminal device, causes the terminal device to perform the in-cabin gesture interaction method of the preceding embodiment or equivalent embodiments.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program products may include, but are not limited to, refer to APP; continuing, the device/terminal may be a computer device (e.g., a mobile phone, a PC terminal, a cloud platform, a server cluster, or a network communication device such as a media gateway). Moreover, the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can all complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may comprise: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), random Access Memories (RAM), etc.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and the like, refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, elements, and method steps described in the embodiments disclosed in this specification can be implemented as electronic hardware, combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other. In particular, for embodiments of devices, apparatuses, etc., since they are substantially similar to the method embodiments, reference may be made to some of the descriptions of the method embodiments for relevant points. The above-described embodiments of devices, apparatuses, etc. are merely illustrative, and modules, units, etc. described as separate components may or may not be physically separate, and may be located in one place or distributed in multiple places, for example, on nodes of a system network. Some or all of the modules and units can be selected according to actual needs to achieve the purpose of the above-mentioned embodiment. Can be understood and carried out by those skilled in the art without inventive effort.

The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all the modifications and equivalent embodiments that can be made according to the idea of the invention are within the scope of the invention as long as they are not beyond the spirit of the description and the drawings.

Claims

1. An in-cabin gesture interaction method, comprising:

receiving a real-time image within the cabin;

inputting the real-time image into a gesture recognition model, and obtaining a first gesture type detection result and a first position type detection result output by the gesture recognition model, wherein the first gesture type detection result indicates the type of a first gesture, and the first position type detection result indicates the position of a first cockpit where an action person of the first gesture is located;

controlling equipment in the cockpit according to a control command corresponding to the first gesture on the first cockpit position; for different first cabin positions, the control instructions corresponding to the same first gesture are different;

the processing of the real-time image by the gesture recognition model to obtain the first gesture type detection result and the first position type detection result specifically includes:

taking the first gesture classification result as the first gesture class detection result and outputting the first gesture classification result;

and if so, taking the first position classification result as the first position type detection result and outputting the first position type detection result.

2. The in-cabin gesture interaction method according to claim 1, wherein after obtaining the first gesture class detection result, voting is performed by using gesture class detection results of a current frame and a first preset number of consecutive frames before the current frame to determine whether the first gesture class detection result of the current frame is valid;

and if the first gesture is valid, controlling equipment in the cabin according to the control command corresponding to the first gesture on the first cabin position.

3. The inter-cockpit gesture interaction method of claim 2, wherein if the first gesture class detection result of the current frame is valid, voting is performed by using the position class detection results of the current frame and a second preset number of consecutive frames before the current frame to determine whether the first position class detection result of the current frame is valid;

4. The method according to claim 1, wherein if the first center point is not located in the area where the cockpit position indicated by the first position classification result is located, the first position classification detection result is determined to be an unknown area and output.

5. The method according to claim 4, wherein if the first location type detection result is an unknown area, no control is performed on the devices in the cabin, or the devices in the cabin are controlled according to a general command corresponding to the type of the first gesture.

6. The in-cabin gesture interaction method according to claim 1, wherein the gesture recognition model comprises a third preset number of convolutional layers, each of which outputs a timing feature and a convolutional result;

and the input data of the gesture recognition model is all time sequence characteristics of a previous frame of the current frame obtained by the real-time image and the convolution layers with the third preset number.

7. The in-cockpit gesture interaction method of claim 1 or 6 wherein after obtaining a first location classification result, voting the first location classification result with a first unique modulation vector for all cockpit locations in a cockpit to determine a second location classification result; wherein the first unique modulation vector is generated according to an actual position category label of the real-time image;

and if the first center point is located in the area where the cabin position indicated by the second position classification result is located, outputting the second position classification result as the first position classification detection result.

8. The in-cabin gesture interaction method according to claim 1, wherein before receiving the real-time image, further comprising:

receiving light intensity information in the cockpit;

9. The in-cabin gesture interaction method according to claim 6, wherein training the gesture recognition model comprises:

inputting successive image samples into an initial model;

acquiring a gesture convolution result and a position convolution result obtained after the continuous image sample passes through the third preset number of convolution layers, wherein each convolution layer sequentially comprises a time sequence offset module and a convolution module, and input data of the time sequence offset module is a time sequence feature output by the previous convolution layer;

and performing iterative training on the initial model according to loss functions between the second gesture type detection result and the gesture type label of the continuous image sample and between the second position type detection result and the actual position type label of the continuous image sample to obtain the gesture recognition model.

10. The in-cabin gesture interaction method according to claim 9, wherein input data of the position classifier is a dot product of the position convolution result and a second-one-hot modulation vector generated by a position modulator, the second-one-hot modulation vector being generated according to an actual position class label of the consecutive image samples.

11. The in-cabin gesture interaction method according to claim 9 or 10, wherein a second central point of the gesture actor is obtained after the continuous image samples pass through the third preset number of convolution layers;

and performing iterative training on the initial model by using a loss function between a third central point of a gesture action person and the second central point, which is obtained by detecting the human body of the continuous image sample.

12. The in-cabin gesture interaction method according to claim 8, wherein the in-cabin camera device is provided at a ceiling light or an in-cabin rearview mirror of a cabin.

13. A gesture interaction device in a cabin is characterized by comprising a real-time image receiving module, a gesture recognition module and a control module;

the gesture recognition module is used for inputting the real-time image into a gesture recognition model to obtain a first gesture type detection result and a first position type detection result output by the gesture recognition model, wherein the first gesture type detection result indicates the type of a first gesture, and the first position type detection result indicates the position of a first cockpit where an action person of the first gesture is located;

the control module is used for controlling equipment in the cockpit according to a control instruction corresponding to the first gesture on the first cockpit position; for different first cabin positions, the control instructions corresponding to the same first gesture are different;

the gesture recognition model comprises a processing module, a judging module and an output module;

the output module is used for taking the first position classification result as the first position category detection result and outputting the first position classification result when the first center point is located in the area where the cabin position indicated by the first position classification result is located, and taking the first gesture classification result as the first gesture category detection result and outputting the first gesture classification result.

14. The device of claim 13, wherein the processing module comprises a timing feature extractor, the timing feature extractor comprises an offset feature storage module and a third preset number of convolutional layers, each convolutional layer comprises a timing offset module and a convolution module in sequence, input data of the timing offset module is timing features output by a previous convolutional layer, and the offset feature storage module only stores the third preset number of timing features obtained in the same frame;

the input data of the time sequence feature extractor is all time sequence features in the offset feature storage module and the real-time image.

15. The in-cabin gesture interaction device according to claim 14, wherein the processing module further comprises a position modulator, a voting module, and a gesture classifier and a position classifier connected to an output of the timing feature extractor;

the position classifier outputs a first position classification result, the gesture classifier outputs a first gesture classification result, and the position modulator outputs third independent modulation vectors of all cabin positions in the cabin;

wherein the third unique modulation vector is generated according to an actual location class label of the real-time image.

16. The in-cabin gesture interaction device according to claim 15, wherein the processing module further comprises a position regressor, wherein input data of the position regressor is output data of the time-series feature extractor, and wherein output data of the position regressor is the first center point.

17. The in-cabin gesture interaction device according to claim 16, wherein in a training phase, the input data of the position classifier is a dot product of the position convolution result output by the convolution layer and a fourth-sole-heating modulation vector generated by a position modulator, and the fourth-sole-heating modulation vector is generated according to actual position class labels of continuous image samples.