CN115509366A

CN115509366A - Intelligent cabin multi-modal man-machine interaction control method and device and electronic equipment

Info

Publication number: CN115509366A
Application number: CN202211465041.9A
Authority: CN
Inventors: 宁传光; 陈云飞; 蒋正中
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2022-12-23

Abstract

The invention discloses a multi-mode man-machine interaction control method, a device and electronic equipment for an intelligent cabin. The vehicle-scale intelligent interaction method can effectively achieve vehicle-scale intelligent interaction, fills the blank of combining voice and gesture in a multi-mode, provides more humanized, efficient and accurate human-computer interaction experience in a vehicle scene for a user, and achieves fine recognition and response of the intention of the user.

Description

Intelligent cabin multi-mode man-machine interaction control method and device and electronic equipment

Technical Field

The invention relates to the field of intelligent cabins, in particular to an intelligent cabin multi-mode human-computer interaction control method and device and electronic equipment.

Background

Along with the product in the intelligent automobile field, the technique constantly promotes, the user knows deeper and deeper to intelligent automobile, and car factory, user also more accurate, succinct, the pluralism to the interactive scene requirement of car intelligence, and intelligent experience no longer limits in the voice interaction field, and the aspect such as personnel's face in the car, action is monitored and is analyzed to build convenient, safe, comfortable intelligent passenger cabin environment for driver and passenger. Specifically, the intelligent cockpit is a comprehensive system which integrates functions of environment perception, planning decision, multi-level auxiliary driving and the like by intensively applying technologies such as computer, modern sensing, information fusion, communication, artificial intelligence, automatic control and the like, and is a typical high and new technology comprehensive body. With the rapid development of the car networking system, the human-computer interaction under the intelligent cabin scene is more emphasized by car enterprises, and voice is taken as the most convenient interaction inlet and plays a vital role in the human-computer interaction, however, when the capability of the vehicle-mounted voice can cover all functions which can be manually adjusted on the car, the whole scene voice control is mature, but from the viewpoint of subdividing the scene, some complex scenes are not lacked, and detailed operation can bring inconvenience to the car internal control or prolong the time for achieving the control target, and in addition, the defect of being easily interfered by noise which cannot be avoided exists in a single voice interaction mode.

The gesture interaction is also another interaction mode actively explored in the field of intelligent cockpits in recent years, and is different from common interaction modes such as keys and voice, the gesture interaction is relatively easier to master and apply, and has the advantages of high response speed and short operation path, but the gesture interaction is limited by the development of gesture interaction technology at present, for example, the subdivision determination and application of specific scenes and control objects of gestures are not perfect, so that the problems that a single gesture interaction scheme still has the problems of incapability of controlling complex scenes, high landing cost, low recognition accuracy and the like exist, and the vehicle gesture interaction is not widely popularized.

Disclosure of Invention

In view of the above, the present invention aims to provide an intelligent cockpit multi-modal human-computer interaction control method, apparatus and electronic device, so as to solve the problem caused by a single control mode in an intelligent cockpit scene and fill up the gap of multi-modal fusion cockpit interaction.

The technical scheme adopted by the invention is as follows:

in a first aspect, the invention provides an intelligent cabin multi-modal human-computer interaction control method, which comprises the following steps: continuously acquiring audio signals in a cabin and video signals containing passengers in the vehicle;

extracting voice of the passenger from the audio signal and capturing image of the hand area of the passenger from the video signal;

carrying out voice recognition on the voice of the passenger and obtaining a first instruction, and carrying out gesture recognition on the hand area image and obtaining a second instruction;

determining the control intention of the passenger based on the first instruction and the second instruction and combined with a preset cabin model;

and triggering a corresponding controlled object in the cabin to execute an action responding to the first instruction and/or the second instruction according to the control intention.

In at least one possible implementation manner, the performing gesture recognition on the hand area image and obtaining a second instruction includes:

detecting gesture types and judging hand movements as dynamic gestures or static gestures;

and when the gesture type is determined to be a pointing motion and a static gesture, acquiring the finger position information.

In at least one possible implementation, the determining the manipulation intention of the occupant includes:

after the first instruction is obtained, obtaining a second instruction according to the first instruction, and determining the control intention by using the first instruction, the second instruction and the cabin model; or,

after a second instruction is obtained, a first instruction is obtained according to the second instruction, and the control intention is determined by using the first instruction, the second instruction and a cabin model; or,

and synchronously obtaining the first instruction and the second instruction, and determining the control intention by combining with the cabin model.

In at least one possible implementation manner, the modeling manner of the cabin model comprises: and calibrating a plurality of preset areas corresponding to the real cabins in a stereo modeling mode.

In at least one possible implementation manner, after the second instruction is obtained, the second instruction is displayed according to a preset gesture icon.

In at least one possible implementation manner, after the controlled object executes the action responding to the first instruction and/or the second instruction, prompt sound broadcasting is carried out according to different preset sound effects according to the action completion condition.

In at least one possible implementation manner, the control method further includes: after the current control intention is determined, the guidance voice corresponding to the current control intention is broadcasted, and whether the follow-up corresponding guidance voice is broadcasted or not is decided according to the times of using the control function corresponding to the current control intention within the set time length.

In a second aspect, the invention provides an intelligent cabin multi-modal human-computer interaction control device, which comprises:

the audio and video acquisition module is used for continuously acquiring audio signals in the cabin and video signals containing passengers in the vehicle;

the passenger voice and hand image acquisition module is used for extracting passenger voice from the audio signal and capturing passenger hand area images from the video signal;

the multi-mode instruction recognition module is used for carrying out voice recognition on the voice of the passenger and obtaining a first instruction, and carrying out gesture recognition on the hand area image and obtaining a second instruction;

the control intention determining module is used for determining the control intention of the passenger based on the first instruction and the second instruction and combined with a preset cabin model;

and the intention response module is used for triggering a corresponding controlled object in the cabin to execute an action responding to the first instruction and/or the second instruction according to the control intention.

In at least one possible implementation manner, the multimode instruction identification module includes:

the gesture recognition unit is used for detecting gesture types and judging hand movements as dynamic gestures or static gestures;

and the position information acquisition unit is used for acquiring the finger position information when the gesture type is determined to be the pointing motion and is the static gesture.

In at least one possible implementation manner, the manipulation intention determining module is specifically configured to:

after the second instruction is obtained, obtaining a first instruction according to the second instruction, and determining the control intention by using the first instruction, the second instruction and the cabin model; or,

and synchronously obtaining a first instruction and a second instruction, and determining the control intention by combining with a cabin model.

In at least one possible implementation manner, the control device further includes: and the gesture display module is used for displaying the second instruction according to a preset gesture icon after the second instruction is obtained.

In at least one possible implementation manner, the control device further includes: and the controlled object action execution state prompting module is used for performing prompt sound broadcasting according to different preset sound effects according to action completion conditions after the controlled object executes actions responding to the first instruction and/or the second instruction.

In at least one possible implementation manner, the control device further includes: and the function guidance voice module is used for broadcasting the guidance voice corresponding to the current operation intention after the current operation intention is determined, and deciding whether the follow-up corresponding guidance voice is broadcasted or not according to the times of using the operation function corresponding to the current operation intention in a set time length.

In a third aspect, the present invention provides an electronic device, comprising:

one or more processors, memory, which may employ a non-volatile storage medium, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions, which when executed by the apparatus, cause the apparatus to perform the method as in the first aspect or any possible implementation of the first aspect.

The invention mainly aims to solve the problems of single interaction mode, easiness in noise interference and the like in a man-machine interaction mode taking a voice technology as a core in an automobile cabin scene, and provides a combination of voice recognition, gesture perception and cabin modeling technologies, specifically, voice instruction analysis is carried out based on an acoustic processing technology, gesture recognition is carried out by using a computer vision technology, and then two instructions are combined with a cabin model, so that multi-mode control interaction can be realized. The invention can effectively realize intelligent interaction at the vehicle gauge level, fills the gap of multi-mode combination of voice and gestures, provides more humanized, efficient and accurate human-computer interaction experience under the vehicle scene for users, can more accurately control the opening and closing of vehicle windows, the brightness level of each screen in the vehicle and the like, and realizes fine recognition and response of user intention.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of an embodiment of an intelligent cabin multi-modal human-computer interaction control method provided by the invention;

fig. 2 is a schematic diagram of an embodiment of an intelligent cabin multi-modal human-computer interaction control device provided by the invention;

fig. 3 is a schematic diagram of an embodiment of an electronic device provided in the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

The invention provides at least one embodiment of an intelligent cabin multi-modal human-computer interaction control method, as shown in fig. 1, which specifically includes:

s1, continuously acquiring audio signals in a cabin and video signals including passengers in the vehicle;

s2, extracting voice of the passenger from the audio signal and capturing image of the hand area of the passenger from the video signal;

s3, carrying out voice recognition on the voice of the passenger to obtain a first instruction, and carrying out gesture recognition on the hand area image to obtain a second instruction;

s4, determining the control intention of the passenger based on the first instruction and the second instruction and in combination with a preset cabin model;

and S5, triggering a corresponding controlled object in the cabin to execute an action responding to the first instruction and/or the second instruction according to the control intention.

In some preferred embodiments, the step S3 may have an extended implementation manner, that is, the performing gesture recognition on the hand region image and obtaining the second instruction may specifically include: detecting gesture types and judging hand movements as dynamic gestures or static gestures; and when the gesture type is determined to be a pointing motion and a static gesture, acquiring the finger position information. A specific implementation of this embodiment will be exemplified later.

It should be noted that, when determining the manipulation intention of the occupant in step S4, there is no time-sequential limitation, that is, the first instruction may be analyzed first, and then the second instruction related to the analyzed first instruction is obtained, so as to determine the current intention by combining with the cockpit model, for example, when the first instruction is obtained as "windowing", then according to the gesture of the pointing type and the orientation information thereof, the second instruction corresponding to the windowing instruction is determined as pointing to the right front side, and then the window which is currently intended to open the passenger side may be obtained by matching with the cockpit model; and if the first command and the second command are synchronously analyzed to determine the current intention, for example, if the first command is synchronously obtained to turn on the light and the second command is directed to the roof, the intention of the passenger is determined to turn on the top light of the passenger cabin by combining the cabin model.

As for step S5, it can be specifically noted that: the controlled object executes the action responding to the first instruction and/or the second instruction, which means that only the content of the first instruction representing the voice instruction, such as 'turning down the volume', can be executed according to a preset rule, and then the second instruction can be used for assisting to definitely turn down the specific controlled object (or other assisting actions) of the 'turning down the volume'; or according to a preset rule, only executing a second instruction content representing the gesture instruction, for example, switching the next media content by a waving gesture, where the first instruction may be used to assist in specifying the switched specific controlled object (or other assisting functions); and executing a fusion instruction of the first instruction and the second instruction according to a preset rule, for example, executing a windowing action according to voice and stopping after opening to the opening degree required by the passenger by combining gestures.

The cabin model may be modeled by marking a plurality of preset areas corresponding to the real cabin, such as but not limited to side windows, windshields, skylights, instruments, central control, rearview mirrors, etc. by stereo modeling.

In combination with the above embodiments, in actual operation, after recognizing a control voice instruction of an in-vehicle person and detecting that a corresponding gesture instruction is made, two instructions are matched with a cabin model to a current specific control scene by giving different gesture meanings in advance, so as to implement a series of relatively complex and fine cabin operations, for example, the method is used for controlling multimedia, screen brightness, opening and closing degree of a window, instrument indication information, air conditioner gear and other functions, and the following is an example of a function list:

function(s)	Gesture and voice instruction
		Screen brightness adjustment	And controlling all display screens in the vehicle. Finger master control screen, and say "dim a little"; i.e. dimming the screen.
Vehicle window and rearview mirror adjustment	The finger drives the window secondarily, and says 'close'; auxiliary driving window capable of being closed
		Introduction instrument in vehicle	One light on finger, and say "what light this is"; the voice assistant replies "this is the fuel level indicator light".
Adjusting the temperature, wind speed and wind direction of the air conditioner	Gesture + speech control, suitable humiture in the accurate control car.

(1) Screen brightness:

all display screens in the controllable car: HUD, well accuse screen, instrument screen, vice screen of driving. Example (c): such as a finger controlling the screen and say "dim some", i.e. dim the screen.

(2) Vehicle window, rear-view mirror:

the interior window and the rearview mirror can be controlled under the condition of ensuring safety (such as only allowing in a constant speed driving or parking state). Example (c): if the finger is used for driving the auxiliary window, and the auxiliary window is closed, the auxiliary window can be closed.

(3) The instrument is as follows:

and the indicator light of the instrument area is controlled in a gesture + voice mode. Example (c): if a finger indicates an indicator light and says "what light this is", a preset voice reply "this is the oil quantity indicator light" can be obtained.

(4) Air-conditioning:

can combine multiple parameters such as environment, temperature, air quality in the car, use gesture + speech control to can keep required humiture in the accurate control car.

For the voice recognition and gesture recognition processing in step S3, the following description may be referred to:

(I) speech recognition

The network of the voice recognition can adopt an Encoder-Decoder or a connection time series classification (CTC) model framework, and based on at least four vehicle-mounted microphones, the voice commands of different passengers are accurately distinguished and recognized by matching with algorithms such as corresponding sound source positioning, blind source separation, noise reduction and the like.

For example, in some preferred embodiments, the user speech recognition is performed by using the CTC model, and the blank symbols are introduced to solve the problem of unequal length of input and output sequences, specifically, in the processing process, the sum of all possible corresponding sequence probabilities is maximized, and the method can be trained by only inputting and outputting without considering the alignment relationship between speech frames and characters. The model based on CTC has exquisite structure and strong readability, but has high dependence on a pronunciation dictionary and a language model, and needs to make independence assumption. Based on the optimization, an RNN-Transducer model can be adopted to improve CTC, a language model prediction network is added, and new output is obtained through a layer of full connection layer with the CTC network, so that the problem that conditional independence assumption is needed to be made on CTC output is solved, information accumulation can be performed on historical output and historical voice characteristics, and the linguistic information is better utilized to improve the identification accuracy.

(II) gesture recognition

The gesture perception can be achieved by acquiring a person video in real time through a 3D camera (such as a person monitoring camera in a vehicle) arranged above the center console, conveying the person video to an AI intelligent driving auxiliary platform in the vehicle for gesture recognition, recognizing common gestures through a pre-trained gesture recognition model based on a YOLO network, determining the pointing direction, and finally mapping the common gestures to automobile parts corresponding to the vehicle model.

In some preferred embodiments, the gesture recognition is two-dimensional gesture recognition under a near-field condition, and more particularly, two-dimensional static and dynamic gesture recognition based on a monocular camera and computer vision. Generally, a 3D CNN network is generally used in the industry to perform video motion detection, but in a vehicle-mounted application environment, the computational power of a vehicle-mounted chip is limited, and in order to reduce computational power requirements and improve operation efficiency, a video understanding network framework based on a TSM is preferably adopted in the invention. A main network of the method is constructed based on MobileNet V2, and in order to fully utilize time sequence context information of a previous frame to assist feature extraction of a current frame in model design, the previous 1/8 channel of each block input feature is replaced by the previous frame feature to construct association between video frames.

It will be understood by those skilled in the art that the foregoing two recognition processes are merely illustrative, and the present invention is not described or defined herein since it is not the focus of the present invention how to specifically recognize speech and gestures.

After the second instruction is obtained, the second instruction can be displayed according to preset gesture icons, and specifically, after all gestures are successfully recognized and set in a scene, currently recognized gesture icons can be displayed on the vehicle-mounted device interface; and if the gesture is a static gesture, a gesture static image can be displayed, and if the gesture is a dynamic gesture, a complete dynamic gesture animation can be displayed.

In addition, after the controlled object executes the action responding to the first instruction and/or the second instruction, prompt tone broadcasting is carried out according to different preset sound effects according to the action completion condition, specifically, after the execution of the voice + gesture trigger action is completed (success or failure), sound effect prompting can be adopted, for example, a gesture recognition signal is fed back to a vehicle according to the second instruction, the controlled object is triggered to execute corresponding action, whether the currently executed action state is successful or not is judged, if the currently executed action state is successful, the corresponding first sound effect is directly broadcasted, and if the currently executed action state is not successful, the second sound effect is broadcasted.

Finally, it can be added that, in order to increase the success rate of interaction and accelerate the interaction speed of a specific scene, the control method further comprises: after the current control intention is determined, the guidance voice corresponding to the current control intention is broadcasted, and whether the follow-up corresponding guidance voice is broadcasted or not is decided according to the times of using the control function corresponding to the current control intention within the set time length.

For example, the voice of the passenger is "the song is really good to hear" and makes a more than comfortable gesture, and the synthesized guidance voice can be broadcasted by the car machine: if the interaction times of the passengers in the mode exceed 3 times in one day, the broadcasting is suspended; otherwise, if the time interval between two times of using the interactive mode exceeds the established standard, the voice guidance can be recovered.

In summary, the main idea of the present invention is to provide a combination of voice recognition, gesture sensing and cabin modeling technologies, specifically, a combination of voice instruction analysis based on an acoustic processing technology and gesture recognition using a computer vision technology, and then a combination of two instructions and a cabin model to realize multi-modal operation and interaction, for the pain problems of single interaction mode, susceptibility to noise interference, and the like existing in a human-computer interaction form using a voice technology as a core in an automobile cabin scene. The invention can effectively realize intelligent interaction at the vehicle gauge level, fills the gap of multi-mode combination of voice and gestures, provides more humanized, efficient and accurate human-computer interaction experience under the vehicle scene for users, can more accurately control the opening and closing of vehicle windows, the brightness level of each screen in the vehicle and the like, and realizes fine recognition and response of user intention.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of an intelligent cabin multi-modal human-machine interaction control apparatus, as shown in fig. 2, which specifically includes the following components:

the audio and video acquisition module 1 is used for continuously acquiring audio signals in the cabin and video signals containing passengers in the vehicle;

a passenger voice and hand image acquisition module 2, configured to extract passenger voice from the audio signal and capture a passenger hand area image from the video signal;

the multi-mode instruction recognition module 3 is used for performing voice recognition on the voice of the passenger and obtaining a first instruction, and performing gesture recognition on the hand area image and obtaining a second instruction;

the control intention determining module 4 is used for determining the control intention of the passenger based on the first instruction and the second instruction and combined with a preset cabin model;

and the intention response module 5 is used for triggering a corresponding controlled object in the cabin to execute an action responding to the first instruction and/or the second instruction according to the control intention.

the gesture recognition unit is used for detecting gesture types and judging hand motions as dynamic gestures or static gestures;

It should be understood that the division of each component in the intelligent cabin multi-modal human-computer interaction control device shown in fig. 2 is only a logical division, and the actual implementation can be wholly or partially integrated into one physical entity or can be physically separated. And these components may all be implemented in software invoked by a processing element; or may be implemented entirely in hardware; and part of the components can be realized in the form of calling by the processing element in software, and part of the components can be realized in the form of hardware. For example, a certain module may be a separate processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing embodiments and their preferred embodiments, it will be appreciated by those skilled in the art that, in practice, the technical idea underlying the present invention may be applied to various embodiments, and that the present invention is illustrated schematically by the following vectors:

(1) An electronic device is provided. The device may specifically include: one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or an equivalent implementation.

The electronic device may specifically be a computer-related electronic device, such as but not limited to various interactive terminals and electronic products, a mobile terminal, and the like.

Fig. 3 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention, and specifically, the electronic device 900 includes a processor 910 and a memory 930. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or may be separate from the processor 910.

In addition, to further improve the functionality of the electronic device 900, the device 900 may include one or more of an input unit 960, a display unit 970, audio circuitry 980, which may include a speaker 982, a microphone 984, a camera 990, a sensor 901, and the like. The display unit 970 may include a display screen, among others.

Further, the device 900 may also include a power supply 950 for providing power to various components or circuits within the device 900.

It should be understood that the operations and/or functions of the various components of the apparatus 900 may be referred to in detail in the foregoing description of the embodiments of the method, system, etc., and the detailed description is omitted here where appropriate to avoid repetition.

It should be understood that the processor 910 in the electronic device 900 shown in fig. 3 may be a system on chip SOC, and the processor 910 may include a Central Processing Unit (CPU), and may further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.

In general, various parts of the processors or processing units within the processor 910 may cooperate to implement the previous method flow, and corresponding software programs for the various parts of the processors or processing units may be stored in the memory 930.

(2) A computer data storage medium having stored thereon a computer program or the above apparatus which, when executed, causes a computer to perform the steps/functions of the above embodiments or equivalent implementations.

In several embodiments provided by the present invention, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer data-accessible storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of a software product as described below, or a part thereof, which essentially contributes to the prior art.

In particular, it should be noted that the storage medium may refer to a server or a similar computer device, and specifically, the aforementioned computer program or the aforementioned apparatus is stored in a storage device in the server or the similar computer device.

(3) A computer program product (which may include the above-mentioned apparatus) which, when run on a terminal device, causes the terminal device to execute the smart car multi-modal human machine interaction control method of the foregoing embodiment or an equivalent implementation.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program product may include, but is not limited to referring to APP.

In the foregoing, the above device/terminal may be a computer device, and the hardware structure of the computer device may further include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can all complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may comprise: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), random Access Memories (RAM), etc.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, elements, and method steps described in the embodiments disclosed in this specification can be implemented as electronic hardware, combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

And, modules, units, etc. described herein as separate components may or may not be physically separate, may be located in one place, or may be distributed over multiple places, such as nodes of a system network. Some or all of the modules and units may be selected according to actual needs to achieve the purposes of the solutions of the embodiments. Can be understood and carried out by those skilled in the art without inventive effort.

The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all the modifications and equivalent embodiments that can be made according to the idea of the invention are within the scope of the invention as long as they are not beyond the spirit of the description and the drawings.

Claims

1. An intelligent cabin multi-modal human-computer interaction control method is characterized by comprising the following steps:

continuously acquiring audio signals in a cabin and video signals containing passengers in the vehicle;

determining the control intention of the passenger based on the first instruction and the second instruction and in combination with a preset cabin model;

and triggering the corresponding controlled object in the cabin to execute the action responding to the first instruction and/or the second instruction according to the control intention.

2. The intelligent cabin multi-modal human-computer interaction control method according to claim 1, wherein the performing gesture recognition on the hand area image and obtaining a second instruction comprises:

3. The intelligent cabin multi-modal human-computer interaction control method of claim 1, wherein the determining the manipulation intention of the occupant comprises:

4. The intelligent cockpit multi-modal human-computer interaction control method of claim 1, wherein the modeling manner of the cockpit model comprises: and calibrating a plurality of preset areas corresponding to the real cabins in a stereo modeling mode.

5. The intelligent cabin multi-modal human-computer interaction control method of claim 1, wherein after the second instruction is obtained, the second instruction is displayed according to a preset gesture icon.

6. The intelligent cabin multi-modal human-computer interaction control method of claim 1, wherein after the controlled object performs the action in response to the first instruction and/or the second instruction, the controlled object performs prompt sound broadcasting according to different preset sound effects according to the action completion condition.

7. The intelligent cabin multi-modal human-computer interaction control method according to any one of claims 1 to 6, wherein the control method further comprises the following steps: after the current operation intention is determined, broadcasting the guide voice corresponding to the current operation intention, and deciding whether the follow-up corresponding guide voice is broadcasted or not according to the times of using the operation function corresponding to the current operation intention within the set time length.

8. An intelligent cabin multi-modal human-computer interaction control device is characterized by comprising:

and the intention response module is used for triggering a controlled object corresponding to the cockpit to execute an action responding to the first instruction and/or the second instruction according to the control intention.

9. An electronic device, comprising:

one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the electronic device, cause the electronic device to perform the intelligent cockpit multimodal human machine interaction control method of any one of claims 1 to 7.

10. A computer data storage medium, wherein a computer program is stored in the computer data storage medium, and when the computer program runs on a computer, the computer program causes the computer to execute the intelligent cockpit multi-modal human-computer interaction control method according to any one of claims 1 to 7.