CN115205729A

CN115205729A - Behavior recognition method and system based on multi-mode feature fusion

Info

Publication number: CN115205729A
Application number: CN202210641293.6A
Authority: CN
Inventors: 张伟捷; 姚劲; 高瑞; 任昶伟; 李波
Original assignee: Zhiji Automobile Technology Co Ltd
Current assignee: Zhiji Automobile Technology Co Ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-10-18

Abstract

The invention discloses a behavior recognition method and a system based on multi-modal feature fusion, wherein the method comprises the following steps: acquiring voice information, video information and vehicle running state data, wherein the video information comprises face data, road condition and environment data and limb data; processing the voice information and the video information respectively to obtain voice characteristic information, text characteristic information and image characteristic information; respectively inputting the voice characteristic information, the text characteristic information and the image characteristic information into corresponding classifiers, and fusing results output by the plurality of classifiers to form fused information; and performing behavior recognition based on the fusion information, and matching and detecting the running state data of the vehicle based on the behavior recognition result.

Description

Behavior recognition method and system based on multi-modal feature fusion

Technical Field

The invention relates to the field of intelligent driving, in particular to a behavior recognition method and system based on multi-mode feature fusion.

Background

With the continuous development and application of artificial intelligence technology, human behavior recognition is a popular research direction in the field of computer vision and pattern recognition at present, and is widely applied to the fields of intelligent video monitoring, behavior analysis, man-machine intelligent interaction, virtual reality and the like. The existing methods based on deep learning are divided into three types: recurrent neural network models (RNN), convolutional neural network models (CNN), graph convolutional network models (GCN), in which connection point information is represented as a vector sequence, a pseudo image, a graph, etc., respectively.

At present, most modal behaviors basically follow the extraction of single modal behavior characteristics and carry out simple mapping on instructions of a robot in the interaction process, the single behavior identification cannot meet the requirements of users on intelligent vehicles in a complex driving scene, and the identification effect is not ideal enough.

The prior art is therefore still subject to further development.

Disclosure of Invention

Aiming at the technical problem, the invention provides a behavior recognition method and system based on multi-modal feature fusion.

The invention provides a behavior recognition method based on multi-modal feature fusion, which is applied to a vehicle and comprises the following steps:

acquiring voice information, video information and vehicle running state data, wherein the video information comprises face data, road condition and environment data and limb data;

processing the voice information and the video information respectively to obtain voice characteristic information, text characteristic information and image characteristic information; respectively inputting the voice characteristic information, the text characteristic information and the image characteristic information into corresponding classifiers, and fusing results output by the classifiers to form fused information;

and performing behavior recognition based on the fusion information, and matching and detecting the running state data of the vehicle based on the behavior recognition result.

Optionally, the processing the voice information and the video information respectively to obtain voice feature information, text feature information, and image feature information includes:

processing the voice information by using a voice recognition model, and at least acquiring identity information, emotional state information, voiceprint characteristics and voice text information of a driver and a passenger;

processing the video information by using a video recognition model, and at least acquiring facial information, age information, identity information, gender information, emotional state information, body posture information, gesture information, lip information, sight line information and road condition and environment information of a driver and a passenger;

and judging whether the voice information contains a control instruction or not according to the voice text information, and if the voice information does not contain the control instruction, taking a video information processing result as a high priority.

Optionally, the fusing the results output by the plurality of classifiers to form fused information includes:

and fusing the voice characteristic information, the text characteristic information and the image characteristic information by using the voice characteristic information, the text characteristic information and the image characteristic information which are obtained by fusing the multi-modal recognition models, wherein the voice characteristic information, the text characteristic information and the image characteristic information take an image sequence, an acoustic characteristic or a spectrogram characteristic as a fusion direction to obtain fusion information.

Optionally, the performing behavior recognition based on the fusion information includes:

the fusion information comprises at least one behavior data, and each behavior data is obtained at least according to one of the voice characteristic information, the text characteristic information and the image characteristic information; and if the fusion information comprises a plurality of behavior data, determining the behavior corresponding to the fusion information according to the identification probability of each behavior data and the behavior parameter quantity associated with the behavior data.

In a second aspect of the present invention, there is provided a behavior recognition system based on multi-modal feature fusion, applied to a vehicle, including:

the system comprises an acquisition module, a display module and a control module, wherein the acquisition module is used for acquiring voice information, video information and vehicle running state data, and the video information comprises face data, road condition and environment data and limb data;

the fusion module is used for respectively processing the voice information and the video information to obtain voice characteristic information, text characteristic information and image characteristic information; respectively inputting the voice characteristic information, the text characteristic information and the image characteristic information into corresponding classifiers, and fusing results output by the plurality of classifiers to form fused information;

and the identification module is used for performing behavior identification based on the fusion information and matching and detecting the running state data of the vehicle based on the behavior identification result.

Optionally, the fusion module comprises:

the voice recognition module is used for processing the voice information and at least acquiring identity information, emotional state information, voiceprint characteristics and voice text information of a driver and a passenger;

the video recognition module is used for processing the video information and at least acquiring facial information, age information, identity information, gender information, emotional state information, body posture information, gesture information, lip information, sight line information and road condition and environment information of the driver and passengers;

and judging whether the voice information contains a control instruction according to the voice text information, and if not, taking the video information processing result as the high priority.

Optionally, the fusion module further comprises:

and the fusion subunit is used for fusing the voice characteristic information, the text characteristic information and the image characteristic information by using the voice characteristic information, the text characteristic information and the image characteristic information which are obtained by the fusion of the multi-modal recognition model and taking an image sequence, an acoustic characteristic or a spectrogram characteristic as a fusion direction to obtain fusion information.

Optionally, the identification module comprises:

the identification subunit is used for acquiring the fusion information which comprises at least one behavior data, wherein each behavior data is obtained at least according to one of the voice characteristic information, the text characteristic information and the image characteristic information; and if the fusion information comprises a plurality of behavior data, determining the behavior corresponding to the fusion information according to the identification probability of each behavior data and the behavior parameter quantity associated with the behavior data.

In a third aspect of the present invention, a vehicle control method based on multi-modal feature fusion is provided, which is applied to a vehicle, and includes:

according to the behavior recognition method based on the multi-modal feature fusion, a behavior recognition result is obtained;

analyzing the recognition result, and determining a driving scene matched with the recognition result, wherein different driving scenes are associated with different functions of the vehicle;

starting the driving scene, controlling the vehicle functions related to the driving scene to start and closing the vehicle functions not related to the driving scene; the driving scenario includes at least one scenario mode.

In a fourth aspect of the present invention, there is provided a vehicle control system based on multi-modal feature fusion, applied to a vehicle, including:

the behavior recognition module is used for obtaining a behavior recognition result according to the behavior recognition method based on the multi-modal feature fusion in the first aspect of the invention;

the scene determining module is used for analyzing the recognition result and determining a driving scene matched with the recognition result, wherein different driving scenes are associated with different functions of the vehicle;

and the control module is used for starting the driving scene, controlling the vehicle function related to the driving scene to start and closing the vehicle function not related to the driving scene.

In a fifth aspect of the invention, there is provided a vehicle comprising a processor, a memory and a computer program stored on the memory and operable on the processor, the computer program when executed by the processor implementing the first aspect of the invention or the third aspect of the invention, providing the steps of the method.

A sixth aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the first aspect of the invention or the third aspect of the invention, providing the steps of the method.

According to the technical scheme provided by the invention, the voice characteristic information, the text characteristic information and the image characteristic information are obtained based on the voice information and the video information, and the control instruction can be timely identified according to the text characteristic information to improve the response speed; the voice characteristic information, the text characteristic information and the image characteristic information are subjected to multi-mode characteristic fusion, behavior recognition is carried out based on the fusion information, response speed, accuracy and system fault tolerance can be improved, behavior recognition and control are achieved through a multi-mode recognition model, and development and maintenance cost is reduced compared with a single-mode recognition model.

Drawings

FIG. 1 is a schematic flow chart of a behavior recognition method based on multi-modal feature fusion according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-modal fusion behavior recognition algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a behavior recognition system based on multi-modal feature fusion according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a vehicle control method based on multi-modal feature fusion according to an embodiment of the present invention;

FIG. 5 is a block diagram of a vehicle control system based on multi-modal feature fusion according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Deep learning is the intrinsic law and expression hierarchy of learning sample data, and information obtained in the learning process is very helpful for interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning has achieved many achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technologies, and other related fields. Deep learning enables a machine to simulate human activities such as audio-visual and thinking, and solves many complex pattern recognition problems.

Modality refers to the manner in which things occur or exist, and multimodal refers to a combination of various forms of two or more modalities. Each information source or form can be called a Modality (Modality), and at present, the processing of three modalities, namely image, text and voice, is mainly performed in the research field. The modes are fused because different modes have different expression modes and different angles of things to be seen, so that certain phenomena of intersection (information redundancy) and complementation (more excellent than single characteristic) exist, even multiple different information interactions possibly exist among the modes, and if multi-mode information can be reasonably processed, rich characteristic information can be obtained. In summary, multimodal modalities have two distinct features, redundancy and complementarity. Through complementary fusion of different feature sets, potential shared information of each modal data is jointly learned, and therefore effectiveness of data tasks is improved.

During the vehicle control process, analysis can be carried out according to the behavior states of drivers and passengers, and according to the analysis result, functions suitable for controlling the vehicle in the current driving process can be identified. The occupant's behavioral state may be the occupant's stationary state, behavioral hold, facial expressions, direction of sight, and the like. As an example of a scene, a driver may have a drowsy or squint behavior during fatigue driving, and at this time, the vehicle may be controlled to give a reminder or an audio stimulus to the driver.

Referring to fig. 1, the present invention provides a behavior recognition method based on multi-modal feature fusion, including:

step 101: and acquiring voice information, video information and running state data of the vehicle, wherein the video information comprises face data, road condition and environment data and limb data.

The voice information and the video information can be obtained through a vehicle-mounted microphone and a vehicle-mounted camera, wherein the vehicle-mounted camera comprises a camera arranged in a cabin in the vehicle and a camera arranged outside the vehicle, video images in the vehicle and outside the vehicle are respectively collected, and environmental noise filtering pretreatment (distortion removal, ISP (Internet service provider), noise reduction and the like) is carried out. Based on the data, the face data and the limb data of the driver and the passenger can be obtained, and the road condition and environment data comprises but is not limited to the identification of surrounding vehicles, surrounding building identification and surrounding specific environment (such as a flower sea, a cherry blossom tree and the like) identification on a driving road.

The running state data of the vehicle comprises running conditions of various functions of the vehicle, such as an air conditioner running state, a fan blowing amount, a multimedia state, a door and window state, an atmosphere lamp state, the running speed of the vehicle and the like. After the running state of the corresponding vehicle is known, various functions and driving conditions of the vehicle can be correspondingly controlled, and especially under the driving assisting state, automatic control can be realized.

Step 102: processing the voice information and the video information respectively to obtain voice characteristic information, text characteristic information and image characteristic information; and respectively inputting the voice characteristic information, the text characteristic information and the image characteristic information into corresponding classifiers, and fusing results output by a plurality of classifiers to form fused information.

Specifically, the voice information can be processed by using a voice recognition model, and at least the identity information, emotional state information, voiceprint characteristics and voice text information of the driver and the passenger can be obtained. The speech recognition model can comprise various recognition models, such as a speech-to-text recognition model, a voiceprint recognition model, a recognition model for emotion recognition through voice and identity recognition. The recognition model output result for converting the voice into the characters is also used for executing the control of the vehicle, namely the recognized characters are control instructions, and then the recognized characters are transferred to a central computing platform (control system) to be executed, for example, the characters are converted out, namely the characters are 'air conditioner on', and the control instructions are needed to be executed; meanwhile, the recognition result of the 'air conditioner on' is also used for multi-modal feature fusion.

The video information is processed by utilizing a video recognition model, and accordingly, a plurality of video recognition models can be adopted for recognition. The video recognition models at least acquire the facial information, the age information, the identity information, the gender information, the emotional state information, the body posture information, the gesture information, the lip shape information, the sight line information and the road condition and environment information of the driver and passengers. Wherein the facial information may include facial expressions, facial pupils, eye states, etc.; the emotional state of the driver and the passengers can be identified in a micro-expression identification mode; the body posture can judge whether the current posture of the driver and the passenger is normal or not based on the posture recognition model, for example, whether the posture of the driver is in an inclined state or not, and whether the user is dozing or not can be judged by matching with the facial information.

Referring to fig. 2, fig. 2 is a schematic diagram of a multi-modal fusion behavior recognition algorithm, in which an input device collects voice information and video information, divides the voice information into voice and text as input data, extracts features through a corresponding feature extraction model, and classifies the features through a classifier. And performing feature fusion on the classified voice, text and video by using a multi-modal recognition model, and outputting a recognition result by a multi-modal recognition module through calculation.

Those skilled in the art should understand that the multi-modal recognition model deals with information redundancy, but does not exclude the disadvantage that the redundant information will interfere with the recognition result, and can also be used to supplement the recognition of single modality. Such as voice-to-text voice control commands, redundant information can cause interference.

Thus, the multi-modal recognition model can be designed to output a variety of recognition results: the method comprises the steps of speech recognition 1, speech recognition 2 (text recognition), image recognition, speech and text fusion recognition, speech and image fusion recognition and speech-text-image three-model fusion recognition.

And fusing the voice characteristic information, the text characteristic information and the image characteristic information by using the voice characteristic information, the text characteristic information and the image characteristic information which are obtained by fusing the multi-modal recognition models, wherein the voice characteristic information, the text characteristic information and the image characteristic information are fused by taking an image sequence, an acoustic characteristic or a spectrogram characteristic as a fusion direction to obtain fusion information.

The recognition model extracts key frames of video data, performs preprocessing such as image segmentation/facial feature alignment/data enhancement, performs slicing/denoising/data enhancement processing on audio data based on spectral features, and performs feature extraction on voice/image data through a pre-trained deep learning model to obtain a voice/image recognition result. After voice information (namely useful information) is recognized, fusion is carried out by taking the acoustic characteristics or spectrogram characteristics as fusion directions, and after the voice information is not recognized, fusion is carried out by taking the image sequence as the fusion direction. And obtaining a fusion recognition result through a pre-trained multi-modal behavior recognition model.

Step 103: and performing behavior recognition based on the fusion information, and matching and detecting the running state data of the vehicle based on the behavior recognition result.

Based on the recognition result in step 102, it can be known that the fusion information includes at least one behavior data, there may be only one behavior data, and of course, there may be two or more data, where there is only one useful behavior data, and the others are redundant behavior data. For example, if it is recognized that the emotional state of the driver is not good, and the emotion of the passenger is general, the recognition result is only made for the driver, and the vehicle can be controlled to play the soothing music accordingly.

Wherein detecting the operating state of the vehicle based on the recognition result may provide a decision for vehicle control. For example, the drowsy behavior of the driver can affect the driving of the vehicle, and the vehicle obviously has danger if running at high speed; and detecting the states of an air conditioner, a vehicle window, an entertainment function and a fragrance of the vehicle so as to realize autonomous control or reminding of the vehicle in time.

Each behavior data is obtained at least according to one of the voice characteristic information, the text characteristic information and the image characteristic information; and if the fusion information comprises a plurality of behavior data, determining the behavior corresponding to the fusion information according to the identification probability of each behavior data and the behavior parameter quantity associated with the behavior data. The constraint may be performed using a cross-entropy loss function, a consistency constraint loss function, or the like, identification model.

Because of the overlapping nature of the technology, voice control is widely used by intelligent automobiles, so voice recognition is particularly important, in order to avoid reduction of response speed and accuracy of voice recognition, whether voice information contains a control instruction or not can be judged according to the voice text information, and if the voice text information does not contain the control instruction, the processing result of video information is high priority.

The voice characteristic information, the text characteristic information and the image characteristic information are obtained based on the voice information and the video information, and the response speed can be increased by identifying the control instruction in time according to the text characteristic information; the voice characteristic information, the text characteristic information and the image characteristic information are subjected to multi-mode characteristic fusion, behavior recognition is carried out based on the fusion information, response speed, accuracy and system fault tolerance can be improved, behavior recognition and control are achieved through the multi-mode recognition models, and compared with a single-mode recognition model, training of the single multi-mode fusion recognition model is compared with training of a plurality of single-mode recognition models, development and maintenance cost is reduced.

Correspondingly, as shown in fig. 3, the present invention further provides a behavior recognition system based on multi-modal feature fusion, including:

the acquiring module 31 is configured to acquire voice information, video information and data of a running state of a vehicle, where the video information includes face data, road condition and environment data, and limb data.

The fusion module 32 is configured to process the voice information and the video information respectively to obtain voice feature information, text feature information, and image feature information; and respectively inputting the voice characteristic information, the text characteristic information and the image characteristic information into corresponding classifiers, and fusing results output by the plurality of classifiers to form fused information.

And the identification module 33 is used for performing behavior identification based on the fusion information and matching and detecting the running state data of the vehicle based on the behavior identification result.

Further, the fusion module 32 includes:

The fusion module 32 further comprises:

and the fusion subunit is used for utilizing the voice characteristic information, the text characteristic information and the image characteristic information obtained by the multi-modal recognition model fusion to fuse the voice characteristic information, the text characteristic information and the image characteristic information by taking an image sequence, an acoustic characteristic or a spectrogram characteristic as a fusion direction to obtain fusion information.

The identification module 33 includes:

The functions of the modules may refer to fig. 1 and the description of the related embodiments, and are not described again.

As shown in fig. 4, the present invention further provides a vehicle control method based on multi-modal feature fusion, applied to a vehicle, including:

step 401: according to the behavior recognition method based on the multi-modal feature fusion, the behavior recognition result is obtained.

In this embodiment, the control of the vehicle is not limited to the data described in fig. 1 and the related embodiments, and may include the geographical position of the vehicle, the travel route, the navigation route traffic state, and the like. Because the vehicle control system can acquire various data of the vehicle, the data dimension is wider, and therefore the vehicle can be better controlled based on the data dimension. Otherwise, reference may be made to FIG. 1 and the description of the related embodiments.

Step 402: and analyzing the recognition result, and determining a driving scene matched with the recognition result, wherein different driving scenes are associated with different functions of the vehicle.

As a common vehicle control mode, a plurality of functions of a vehicle are controlled by adopting driving scene linkage. For example, turning on a certain driving mode, the associated vehicle function, driving mode, etc. is activated.

Step 403: starting the driving scene, controlling the vehicle function related to the driving scene to start and closing the vehicle function not related to the driving scene; the driving scenario includes at least one scenario mode.

The scene mode is a set vehicle running state or a control state of a vehicle function. The driving scene is a vehicle control scheme formulated according to the requirements of the user, and if the driving scene is started, the state of the vehicle before can not meet the requirements of the user, and the vehicle can be closed. Each driving mode is associated with one or more vehicle functional states, the activation of the driving mode will detect the corresponding functional state, and if the associated function is activated, the corresponding control instruction is not sent.

The method is applied to vehicles, for example, a Central Computing Platform (CCP), a sensor and an actuator core hardware unit, and respectively provides: the business operation logic capacity of intelligent driving, information entertainment and vehicle body control; collecting target object characteristic data and converting the target object characteristic data into CCP identifiable signals; the control instructions communicated by the CCP are executed. A control/data flow is established between the sensor and the CCP based on Ethernet/LVDS, and a sensor control instruction and target object data (picture/video/text) are transmitted; the intelligent driving control domain, the entertainment control domain and the vehicle body control domain establish control/data flow based on the inter-chip bus, and transmit operation results (specifically, reference algorithm deployment strategy is needed) depending on the corresponding domains; control/data flow is established between the CCP and the actuator based on CAN/LIN/LVDS/I2S/I2C, and commands and data such as picture display/multimedia playing/vehicle body control and the like are transmitted.

The scheme is based on a signal collector, a signal processor and a target execution controller integrated by a central computing platform, and forms an executable scene and a controller instruction by acquiring, post-processing and fusing driver and passenger behavior information. The common scenes comprise a car owner fatigue driving scene, an intelligent cabin and automatic driving fusion scene and a car owner travel scene.

When a CCP acquires car interior vision detection camera to acquire car owner facial expression data, the CCP is processed to detect and judge conditions, the requirements of no blinking for a long time, no eyeball rotation and no obvious deviation trace of facial contours are met, sound data are fused to continuously monitor the matching of no active signal intervals in data slices, the identification condition of fatigue driving behaviors of a driver is met, namely, the CCP acquires the states of an air conditioner, a door window, a windshield wiper, a large entertainment screen and an atmosphere lamp controller, the ventilation of the air conditioner, the door window, entertainment music and the atmosphere lamp controller are confirmed to be executable, meanwhile, windshield wiper rainfall sensing is fused (the rain-free state is detected), when the window opening condition is met, a skylight of the door window, air blowing of the air conditioner and music are opened, and the atmosphere lamp flickers. Meanwhile, the user is reminded whether to switch the automatic driving mode or not in the state that the automatic driving mode is determined to be executable.

When the CCP acquires MIC and camera data in the vehicle, and the CCP obtains audio and video data matching results after processing and analyzing, and meets the requirement that the vehicle is provided with multi-person sound and multi-person facial data, such as copilot or back row data source matching with opposite sex, old people, children and the like, the CCP acquires entertainment large-screen data, driving destinations and routes are not in common modes, such as non-commuting and homeward road sections, and reminds users whether to enter a long-distance travel mode, and a corresponding mode controller is adjusted to an executable state.

Corresponding to fig. 4, as shown in fig. 5, the present invention further provides a vehicle control system based on multi-modal feature fusion, applied to a vehicle, including:

the behavior recognition module 51 is used for obtaining a behavior recognition result according to the behavior recognition method based on the multi-modal feature fusion provided by the invention;

a scene determining module 52, configured to analyze the recognition result, and determine a driving scene matching the recognition result, where different driving scenes are associated with different functions of the vehicle;

and the control module 53 is configured to start the driving scenario, control a vehicle function associated with the driving scenario to start, and control a vehicle function not associated with the driving scenario to stop.

The invention also provides a vehicle which comprises an automobile, can be a new energy automobile or a fuel automobile, is not limited to a passenger automobile, and can also comprise a transport automobile and the like. The vehicle comprises a processor, a memory and a computer program stored on the memory and being executable on the processor, the computer program, when executed by the processor, implementing the steps of the multimodal feature fusion vehicle control method and/or the multimodal feature fusion behavior recognition method as described above.

The present invention also provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the multimodal feature fusion vehicle control method and/or the multimodal feature fusion behavior recognition method as described above.

It is understood that the computer-readable storage medium may include: any entity or device capable of carrying a computer program, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), software distribution medium, and the like. The computer program includes computer program code. The computer program code may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), software distribution medium, and the like.

In some embodiments of the present invention, the apparatus may include a controller, and the controller is a single chip integrated with a processor, a memory, a communication module, and the like. The processor may refer to a processor included in the controller. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A behavior recognition method based on multi-modal feature fusion is applied to a vehicle, and is characterized by comprising the following steps:

processing the voice information and the video information respectively to obtain voice characteristic information, text characteristic information and image characteristic information; respectively inputting the voice characteristic information, the text characteristic information and the image characteristic information into corresponding classifiers, and fusing results output by the plurality of classifiers to form fused information;

2. The behavior recognition method according to claim 1, wherein the processing the voice information and the video information to obtain voice feature information, text feature information, and image feature information respectively comprises:

processing the voice information by using a voice recognition model, and at least acquiring identity information, emotional state information, voiceprint characteristics and voice text information of the driver and passengers;

and judging whether the voice information does not contain a control instruction according to the voice text information, and if not, taking the video information processing result as the high priority.

3. The behavior recognition method according to claim 2, wherein the fusing the results of the plurality of classifier outputs to form fused information includes:

4. The behavior recognition method according to claim 1, wherein the performing behavior recognition based on the fusion information includes:

5. A behavior recognition system based on multi-modal feature fusion is applied to a vehicle and is characterized by comprising:

and the identification module is used for performing behavior identification based on the fusion information and matching and detecting the running state data of the vehicle based on a behavior identification result.

6. The behavior recognition system of claim 5, wherein the fusion module comprises:

the voice recognition module is used for processing the voice information and at least acquiring identity information, emotional state information, voiceprint characteristics and voice text information of the driver and passengers;

7. The behavior recognition system of claim 5, wherein the fusion module further comprises:

8. The behavior recognition system of claim 5, wherein the recognition module comprises:

9. A vehicle control method based on multi-modal feature fusion is applied to a vehicle, and is characterized by comprising the following steps:

obtaining a behavior recognition result according to the behavior recognition method based on multi-modal feature fusion of any one of claims 1 to 5;

starting the driving scene, controlling the vehicle function related to the driving scene to start and closing the vehicle function not related to the driving scene; the driving scenario includes at least one scenario mode.

10. A vehicle control system based on multi-modal feature fusion is applied to a vehicle and is characterized by comprising:

a behavior recognition module, configured to obtain a behavior recognition result according to the behavior recognition method based on multi-modal feature fusion of any one of claims 1 to 5;

and the control module is used for starting the driving scene, controlling the vehicle functions related to the driving scene to start and closing the vehicle functions which are not related to the driving scene.

11. A vehicle comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the method of any one of claims 1 to 5 or claim 9.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5 or 9.