WO2023273063A1

WO2023273063A1 - Passenger speaking detection method and apparatus, and electronic device and storage medium

Info

Publication number: WO2023273063A1
Application number: PCT/CN2021/127096
Authority: WO
Inventors: 王飞; 钱晨
Original assignee: 上海商汤临港智能科技有限公司
Priority date: 2021-06-30
Filing date: 2021-10-28
Publication date: 2023-01-05
Also published as: CN113488043B; CN113488043A; JP2024505968A

Abstract

A passenger speaking detection method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring a video stream and a sound signal in a vehicle cabin (S11); performing facial detection on the video stream, and determining a facial area, in the video stream, of at least one passenger in the vehicle cabin (S12); and according to the facial area of the at least one passenger and the sound signal, determining a target passenger in the vehicle cabin that produces the sound signal (S13).

Description

Passenger speech detection method and device, electronic equipment and storage medium

This disclosure claims the priority of the Chinese patent application with the application number 202110738677.5 and the application title "Occupant Speech Detection Method and Device, Electronic Equipment and Storage Medium" submitted to the China Patent Office on June 30, 2021, the entire contents of which are incorporated by reference incorporated in this disclosure.

technical field

The present disclosure relates to the field of computer technology, and in particular to a method and device for detecting occupant speech, electronic equipment, and a storage medium.

Background technique

Cabin intelligence includes multi-mode interaction, personalized service, safety perception, etc., which is an important direction for the current development of the automotive industry. The multi-mode interaction in the cabin is intended to provide passengers with a comfortable interactive experience. The means of multi-mode interaction include voice recognition, gesture recognition, etc. Among them, speech recognition occupies a significant market share in the field of vehicle interaction.

However, there are multiple sound sources in the cabin, such as audio, driving noise, noise outside the cabin, etc., which have caused very strong interference to speech recognition.

Contents of the invention

The present disclosure proposes a technical solution for occupant speech detection.

According to an aspect of the present disclosure, a method for detecting occupant speech is provided, including: acquiring video streams and sound signals in the cabin; performing face detection on the video stream, and determining that at least one occupant in the cabin is speaking The face area in the video stream; according to the face area of each occupant and the sound signal, determine the target occupant in the cabin that sends out the sound signal.

In a possible implementation manner, the method further includes: performing content identification on the sound signal, and determining the voice content corresponding to the voice signal; when the voice content includes a preset voice instruction, Execute the control function corresponding to the voice command.

In a possible implementation manner, when the voice content includes a preset voice command, executing the control function corresponding to the voice command includes: when the voice command corresponds to a multi-directional In the case of two control functions, according to the face area of the target occupant, determine the gaze direction of the target occupant; according to the gaze direction of the target occupant, determine the target control function from the plurality of control functions ; Execute the target control function.

In a possible implementation manner, the video stream includes a first video stream of the driver's area; the determining the face area of at least one occupant in the vehicle cabin in the video stream includes: determining the The face area of the driver in the cabin in the first video stream; according to the face area of each occupant and the sound signal, determine the target occupant who sends the sound signal in the cabin , comprising: according to the face area of the driver and the sound signal, determining whether the target occupant who sends out the sound signal in the cabin is the driver.

In a possible implementation manner, the video stream includes a second video stream of the occupant area; and according to the face area of each occupant and the sound signal, it is determined that the sound is emitted in the cabin The target occupant of the signal includes: for the face area of each occupant, according to the face area and the sound signal, determine whether the target occupant who sends out the sound signal in the cabin is the face The occupant corresponding to the area.

In a possible implementation manner, the method further includes: determining the seating area of the target occupant according to the video stream; performing content recognition on the sound signal, and determining the voice content corresponding to the sound signal; If the voice content includes a preset voice instruction, according to the seating area of the target occupant, determine an area control function corresponding to the voice instruction; and execute the area control function.

In a possible implementation manner, the determining the target occupant who sends out the sound signal in the cabin according to the face area of each occupant and the sound signal includes: determining A video frame sequence corresponding to the time period of the sound signal; for the face area of each occupant, perform feature extraction on the occupant's face area in the video frame sequence to obtain the occupant's face area Facial feature; according to the facial feature and the voice feature extracted from the sound signal, determine the fusion feature of the occupant; according to the fusion feature, determine the speech detection result of the occupant; according to the occupant's As a result of the speaking detection, the target occupant who sends out the sound signal is determined.

In a possible implementation manner, the feature extraction of the face area of the occupant in the sequence of video frames includes: extracting the features of each of the N video frames of the occupant in the sequence of video frames Feature extraction is performed on the face area of one frame to obtain N facial features of the occupant; the voice feature is extracted in the following manner, including: segmenting the sound signal according to the acquisition time of the N video frames and speech feature extraction to obtain N speech features respectively corresponding to the N video frames.

In a possible implementation manner, according to the acquisition time of the N video frames, the sound signal is segmented and speech features are extracted to obtain N speech features respectively corresponding to the N video frames, Including: segmenting the sound signal according to the acquisition time of the N video frames to obtain N speech frames respectively corresponding to the N video frames, and the nth video frame in the N video frames The collection time is within the time period corresponding to the nth speech frame, where n is an integer and 1≤n≤N; speech feature extraction is performed on the N speech frames respectively to obtain N speech features.

In a possible implementation manner, the segmenting the sound signal according to the acquisition time of the N video frames to obtain N speech frames respectively corresponding to the N video frames includes: according to the At the acquisition moment of the N video frames, determine the time window length and the moving step size for dividing the time window of the sound signal, and the moving step size is less than the time window length; for the n speech frame, according to the The moving step, moving the time window, and determining the time period corresponding to the nth voice frame; according to the time period corresponding to the nth voice frame, segmenting the first voice from the sound signal n speech frames.

In a possible implementation manner, the determining the fusion features of the occupant according to the facial features and the voice features includes: one-to-one correspondence between the N facial features and the N voice features Fusing to obtain N sub-fusion features; splicing the N sub-fusion features to obtain the fusion features of the occupant.

According to an aspect of the present disclosure, there is provided an occupant speaking detection device, including: a signal acquisition module, used to acquire video streams and sound signals in the cabin; a face detection module, used to perform face detection on the video stream Detecting and determining the face area of at least one occupant in the vehicle cabin in the video stream; the occupant determination module is used to determine the sound signal emitted by the vehicle cabin according to the face area of each occupant and the sound signal. The target occupant of the sound signal.

In a possible implementation manner, the device further includes: a first identification module, configured to perform content identification on the sound signal, and determine the voice content corresponding to the sound signal; a function execution module, configured to If the voice content includes a preset voice command, the control function corresponding to the voice command is executed.

In a possible implementation manner, the function execution module is configured to: determine, according to the facial area of the target occupant, the A gaze direction of a target occupant; determining a target control function from the plurality of control functions according to the gaze direction of the target occupant; and executing the target control function.

In a possible implementation manner, the video stream includes a first video stream of the driver's area; the face detection module is configured to: determine the position of the driver in the cabin in the first video stream Face area; the occupant determination module is used to: determine whether the target occupant who sends out the sound signal in the cabin is the driver according to the driver's face area and the sound signal .

In a possible implementation manner, the video stream includes a second video stream of an occupant area; the occupant determining module is configured to: for each occupant's face area, according to the face area and the The sound signal is used to determine whether the target occupant who sends out the sound signal in the cabin is the occupant corresponding to the face area.

In a possible implementation manner, the device further includes: a seating area determining module, configured to determine the seating area of the target occupant according to the video stream; a second identification module, configured to perform Content recognition, determining the voice content corresponding to the sound signal; a function determination module, used to determine the voice command corresponding to the voice command according to the seat area of the target occupant when the voice content includes a preset voice command Corresponding area control function; an area control module, configured to execute the area control function.

In a possible implementation manner, the occupant determination module is configured to: determine the video frame sequence corresponding to the time period of the sound signal in the video stream; Feature extraction is performed on the face area of the occupant in the video frame sequence to obtain the occupant's facial features; according to the facial features and the voice features extracted from the sound signal, the fusion of the occupant is determined Features; according to the fusion feature, determine the occupant's speech detection result; according to each occupant's speech detection result, determine the target occupant who sends out the sound signal.

In a possible implementation manner, the occupant determination module performs feature extraction on the face area of the occupant in the video frame sequence, including: extracting the N video frames of the occupant in the video frame sequence The face region of each frame in the feature extraction is carried out to obtain the N face features of the occupant; the voice feature is obtained by extracting the occupant determination module in the following manner: according to the acquisition time of the N video frames, Segmentation and speech feature extraction are performed on the sound signal to obtain N speech features respectively corresponding to the N video frames.

In a possible implementation manner, the occupant determination module performs segmentation and speech feature extraction on the sound signal according to the acquisition time of the N video frames to obtain N video frames respectively corresponding to the N video frames. Speech features, including: segmenting the sound signal according to the acquisition time of the N video frames to obtain N speech frames respectively corresponding to the N video frames, and the nth video frame in the N video frames The acquisition time of the video frame is within the time period corresponding to the nth speech frame, where n is an integer and 1≤n≤N; performing speech feature extraction on the N speech frames respectively to obtain N speech features.

In a possible implementation manner, the occupant determining module divides the sound signal according to the acquisition time of the N video frames to obtain N speech frames respectively corresponding to the N video frames, including : according to the acquisition moment of the N video frames, determine the time window length and the moving step size for dividing the time window of the sound signal, the moving step size is less than the time window length; for the nth voice frame , move the time window according to the moving step, and determine the time period corresponding to the nth speech frame; segment the sound signal from the sound signal according to the time period corresponding to the nth speech frame The nth speech frame.

In a possible implementation manner, the occupant determination module determines the fusion features of the occupant according to the facial features and the voice features, including: combining the N facial features with the N voice features One-to-one correspondence fusion to obtain N sub-fusion features; splicing the N sub-fusion features to obtain the fusion features of the occupant.

According to an aspect of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to execute the above-mentioned method.

According to one aspect of the present disclosure, there is provided a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented.

According to one aspect of the present disclosure, a computer program is provided, including computer readable codes, and when the computer readable codes are run in an electronic device, a processor in the electronic device executes the above method.

In the embodiment of the present disclosure, the video stream and sound signal in the cabin can be obtained; face detection is performed on the video stream to determine the face area of at least one occupant in the cabin in the video stream; according to the face area of each occupant and the sound signal to determine the target occupant who emitted the sound signal from among the various occupants. Judging whether the occupant is speaking according to the face area and the sound signal can improve the accuracy of the occupant's speech detection and reduce the false alarm rate of speech recognition.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

The accompanying drawings here are incorporated into the description and constitute a part of the present description. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.

FIG. 1 shows a flowchart of a method for detecting occupant speaking according to an embodiment of the present disclosure.

FIG. 2 shows a schematic diagram of a speaking detection process of an embodiment of the present disclosure.

Fig. 3 shows a block diagram of an occupant speaking detection device according to an embodiment of the present disclosure.

Fig. 4 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

detailed description

Various exemplary embodiments, features, and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures indicate functionally identical or similar elements. While various aspects of the embodiments are shown in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as superior or better than other embodiments.

The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific implementation manners. It will be understood by those skilled in the art that the present disclosure may be practiced without some of the specific details. In some instances, methods, means, components and circuits that are well known to those skilled in the art have not been described in detail so as to obscure the gist of the present disclosure.

In vehicle voice interaction, the voice detection function usually runs in real time in the vehicle, and the false alarm rate of the voice detection function needs to be kept at a very low level. In related technologies, a signal detection method based on pure voice is usually used, and it is difficult to suppress voice false alarms, resulting in a high false alarm rate and poor user interaction experience.

According to the occupant speech detection method of the embodiment of the present disclosure, the video image and the sound signal can be multimodally fused, and the occupant in the speaking state in the cabin can be identified, thereby improving the accuracy of occupant speech detection and reducing false positives in speech recognition rate and improve user interaction experience.

The object speaking detection method according to an embodiment of the present disclosure may be performed by electronic equipment such as a terminal device or a server, and the terminal device may be a vehicle-mounted device, a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone , a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc., the method can be implemented by calling a computer-readable instruction stored in a memory by a processor.

Among them, the on-board device can be the car machine, domain controller or processor in the cabin, and can also be used in DMS (Driver Monitor System, driver detection system) or OMS (Occupant Monitoring System, occupant detection system) to execute image processing. Device hosts for data processing operations, etc.

Fig. 1 shows a flow chart of a method for detecting occupant speaking according to an embodiment of the present disclosure. As shown in Fig. 1 , the method for detecting occupant speaking includes:

In step S11, the video stream and sound signal in the cabin are obtained;

In step S12, face detection is performed on the video stream to determine the face area of at least one occupant in the cabin in the video stream;

In step S13, according to the face area of each occupant and the sound signal, the target occupant in the cabin who sends out the sound signal is determined.

For example, embodiments of the present disclosure may be applied to any type of vehicle, such as passenger cars, taxis, shared cars, buses, freight vehicles, subways, trains, and the like.

In a possible implementation manner, in step S11, the video stream in the vehicle cabin may be collected through the vehicle camera, and the sound signal may be collected through the vehicle microphone. Wherein, the vehicle-mounted camera can be any camera installed in the vehicle, the number can be one or more, and the type can be DMS camera, OMS camera, common camera, etc. The vehicle-mounted microphone can also be arranged at any position in the vehicle, and the number can be one or more. The present disclosure does not limit the location, quantity and type of the vehicle-mounted camera and the vehicle-mounted microphone.

In a possible implementation manner, in step S12, face detection may be performed on the video stream. The face detection can be directly performed on the video frame sequence of the video stream to determine the face frame in each video frame; the video frame sequence of the video stream can also be sampled, and the face detection is performed on the sampled video frames to determine the face frame after sampling The face frame in each video frame of , the present disclosure does not limit the specific processing manner.

In a possible implementation, the face frame in each video frame can be tracked to determine the face frame of the occupant belonging to the same identity, so as to determine the face of at least one occupant in the cabin in the video stream area.

Among them, the method of face detection can be, for example, facial key point recognition, face contour detection, etc.; . Those skilled in the art should understand that face detection and tracking can be implemented in any manner in the related art, which is not limited in the present disclosure.

In a possible implementation, there may be faces of one or more occupants (such as the driver and/or passengers) in the video frame of the video stream, and after the processing in step S12, the face area of each occupant can be obtained. In step S13, each occupant can be analyzed separately to determine whether the occupant is talking.

In a possible implementation manner, for any occupant to be analyzed, the face area of the occupant in N video frames of the video stream may be determined, where N is an integer greater than 1. That is, N video frames corresponding to a certain duration (for example, 2s) are selected from the video stream. In the case of real-time detection, the N video frames may be the latest N video frames collected in the video stream. N may be, for example, 10, 15, 20, etc., which is not limited in the present disclosure.

In a possible implementation, the sound signal of the time period corresponding to N video frames can be determined, for example, the time period corresponding to N video frames is the latest 2s (2s ago-now), and the sound signal is also the most recent 2s sound signal.

In a possible implementation, the image and sound signals of the occupant in the face area of N video frames can be directly input into the preset speech detection network for processing, and the occupant's speech detection result is output, that is, the The occupant is either speaking or not speaking.

In a possible implementation manner, feature extraction can also be performed on the image of the occupant's face area in N video frames to obtain face features; sound feature extraction is performed on the sound signal to obtain sound features; And the input voice feature is processed in the preset speech detection network, and the speech detection result of the occupant is output. The present disclosure does not limit the specific processing manner.

In a possible implementation manner, in step S13, each occupant can be separately detected for speaking, to determine the result of each occupant's speaking detection; target occupant.

According to the embodiment of the present disclosure, it is possible to obtain the video stream and sound signal in the cabin; perform face detection on the video stream to determine the face area of at least one occupant in the cabin in the video stream; according to the face area of each occupant and the sound signal to determine the target occupant who emitted the sound signal from among the various occupants. Judging whether the occupant is speaking according to the face area and the sound signal can improve the accuracy of the occupant's speech detection and reduce the false alarm rate of speech recognition.

The following is an expanded description of the occupant speech detection method of the embodiment of the present disclosure.

As mentioned above, in step S11, the video stream in the cabin collected by the vehicle camera and the sound signal collected by the vehicle microphone can be obtained.

In a possible implementation manner, the vehicle-mounted camera may include a DMS camera for a driver detection system, and/or an OMS camera for an occupant detection system. The video stream collected by the DMS camera is the video stream of the driver's area (called the first video stream), and the video stream collected by the OMS camera is the video stream of the occupant area in the cabin (called the second video stream). In this way, the video stream acquired in step S11 may include the first video stream and/or the second video stream.

In a possible implementation, the video stream includes a first video stream of the driver's area; in step S12, determining the face area of at least one occupant in the cabin in the video stream includes:

Determining the face area of the driver in the vehicle cabin in the first video stream.

Wherein, step S13 may include: according to the face area of the driver and the sound signal, determining whether the target occupant who sends out the sound signal in the cabin is the driver.

For example, the first video stream corresponds to the driver area, which only includes the driver. In this case, a plurality of video frames (referred to as the first video frame) of the first video stream can be obtained, face detection and tracking are performed on each first video frame in the plurality of first video frames, and the driver's face is obtained. The face area in each first video frame.

In a possible implementation, according to the driver's face area and sound signal, the driver's speech detection can be performed to determine whether the driver is talking, so as to determine whether the target occupant who emits the sound signal in the cabin is the driver . That is, if it is determined that the driver is speaking, it can be determined that the target occupant who sends out the sound signal is the driver; otherwise, if it is determined that the driver is not speaking, it can be determined that the target occupant who sends out the sound signal is not the driver.

In a possible implementation manner, subsequent processing may be performed according to whether the target occupant who sends out the sound signal in the vehicle cabin is the driver. For example, if the target occupant who sends out the sound signal is the driver, the voice recognition function can be activated to respond to the sound signal; otherwise, if the target occupant who sends out the sound signal is not the driver, then the sound signal can not be responded to. The present disclosure does not limit the way of subsequent processing.

In this way, it can be determined whether the driver is speaking according to the first video stream and sound signal in the driver's area, so as to determine whether the target occupant who sends out the sound signal is the driver, thereby reducing the false positive rate of speech recognition and improving the user experience. convenience.

In a possible implementation manner, the video stream includes a second video stream of the occupant area. Wherein, step S13 may include:

For each face area of the occupant, according to the face area and the sound signal, it is determined whether the target occupant who sends out the sound signal in the cabin is the occupant corresponding to the face area.

For example, the second video frame corresponds to the occupant area in the vehicle cabin, including the driver and/or passengers. In this case, in step S12, a plurality of video frames (referred to as second video frames) can be obtained from the second video stream, and face detection is performed on each second video frame in the plurality of second video frames and tracking to obtain the face area of each occupant in the cabin in each second video frame.

For example, in the case where the driver's area is at the left front of the cabin, the face area at the lower right position in the second video frame can be determined as the driver's face area; it will be at the lower left position in the second video frame is determined as the face area of the co-pilot passenger. The present disclosure does not limit the specific manner of determining each occupant.

In a possible implementation, for each occupant's face area, according to the occupant's face area and sound signal, the occupant's speech detection can be performed to determine whether the occupant is speaking, so as to determine whether the occupant is speaking. Whether the target occupant of the sound signal is this occupant. That is, if it is determined that the occupant is speaking, it can be determined that the target occupant who sends out the sound signal is the occupant corresponding to the face area; The occupant corresponding to the face area.

In a possible implementation manner, subsequent processing may be performed according to the identity of the target occupant who sends out the sound signal in the vehicle cabin. For example, if the target occupant who sends the sound signal is the driver, the voice recognition function can be activated to respond to the sound signal; if the target occupant who sends the sound signal is a passenger, and the passenger has no control authority, the sound signal can not be responded ; If the target occupant who sends out the sound signal is a passenger, and the passenger has control authority, the voice recognition function can also be activated to respond to the sound signal. The present disclosure does not limit the way of subsequent processing.

In this way, it is possible to determine whether each occupant is speaking according to the second video stream and sound signal in the occupant area, thereby determining which occupant is the target occupant who sends out the sound signal, reducing the false alarm rate of voice recognition, and improving occupant speech detection accuracy and make subsequent responses more targeted.

In a possible implementation manner, the occupant's speech detection may be performed in step S13. Wherein, step S13 may include:

determining a video frame sequence corresponding to a time period of the sound signal in the video stream;

For the face area of each occupant,

performing feature extraction on the occupant's face area in the video frame sequence to obtain the occupant's facial features;

determining the fusion feature of the occupant according to the face feature and the speech feature extracted from the sound signal;

determining a speech detection result of the occupant according to the fusion feature;

According to the speech detection results of each occupant, the target occupant who sends out the sound signal is determined.

For example, a certain duration may be preset, and speaking detection is performed within the duration. The duration can be set as 1s, 2s or 3s, for example, which is not limited in the present disclosure.

In a possible implementation manner, feature extraction may be performed on the sound signal to obtain speech features, and then the facial features of each occupant detected from the video stream are fused with the speech features to obtain fusion features.

In a possible implementation manner, the sound signal of the duration may be selected from the sound signals collected by the vehicle-mounted microphone, and the video frame sequence corresponding to the time period of the sound signal is determined from the video stream. In the case of real-time processing, the time period of the sound signal is, for example, the latest 2s (2s ago-now), and the video frame sequence also includes multiple video frames of the latest 2s (set as N video frames, N>1).

In a possible implementation manner, for the face area of each occupant, the image of the occupant's face area in the video frame sequence may be determined, and feature extraction is performed on the images of each face area to obtain N facial features of the occupant. The manner of feature extraction may be, for example, face key point extraction, face contour extraction, etc., which is not limited in the present disclosure.

In a possible implementation, for the detected face area of each occupant, N video frames in which the face area appears in the video can be determined, and the voice in the time period corresponding to the N video frames In this case, feature extraction can be performed on the occupant's face area in the video frame sequence in the following manner to obtain the occupant's facial features: for the occupant in the video frame Feature extraction is performed on the face area of each frame in the sequence of N video frames to obtain N face features of the occupant. In this way, facial features and speech features can be "aligned" in time, thereby improving the accuracy of speech detection results.

For example, for the N video frames I1, I2, ..., IN of the video frame sequence at time T~T+k in the video stream, through face detection and tracking, the M faces of the occupants in the cabin can be obtained. Face frame sequence (M≥1), that is, each occupant corresponds to a face frame sequence. Wherein, T is an arbitrary moment, and the value of k is 1s, 2s, or 3s, etc., and the value of k is not limited in the present disclosure.

In a possible implementation, for any occupant (set as the i occupant, i is an integer and 1≤i≤M), for any one of the N video frames (called the nth video frame, n is an integer and 1≤n≤N), the occupant's face area is denoted as In-face-i. The face area In-face-i can be input into the face feature extraction network MFaceNet to extract features, and the feature map In-Featuremap-i is obtained, which is the nth face feature of the i-th occupant. Among them, the feature dimension of the face feature is (c, h, w), and c, h, and w represent the number of channels, height, and width, respectively.

In a possible implementation manner, the face feature extraction network MFaceNet may be a convolutional neural network, for example, the face feature extraction network MFaceNet is obtained by removing the key point head (head) part from the face key point detection model. The present disclosure does not limit the network structure of the face feature extraction network.

In this way, features are extracted from the face area of each of the N video frames to obtain N face features of the occupant.

In a possible implementation manner, the step of performing speech feature extraction on the sound signal to obtain the speech feature may include: performing segmentation and speech feature extraction on the sound signal according to the acquisition time of the N video frames, N speech features respectively corresponding to the N video frames are obtained.

That is to say, the audio signal can be segmented to obtain N speech frames respectively corresponding to the N video frames; and then speech feature extraction is performed on each of the N speech frames to obtain N speech features.

In a possible implementation manner, according to the acquisition time of the N video frames, the sound signal is segmented and the speech features are extracted to obtain the N speech features respectively corresponding to the N video frames. steps, which may include:

According to the acquisition moment of the N video frames, the sound signal is segmented to obtain N speech frames respectively corresponding to the N video frames, and the acquisition moment of the nth video frame in the N video frames In the time period corresponding to the nth speech frame, 1≤n≤N;

Perform speech feature extraction on the N speech frames respectively to obtain N speech features.

For example, for the sound signal Audio obtained by the microphones at time T˜T+k, the first and last silences may be cut off to reduce interference. Then the sound signal is divided into frames, that is, the sound is divided into small segments, and each segment is called a speech frame. In order to ensure the timing alignment of the audio frame and the video frame, the time period of each audio frame corresponds to the acquisition time of the video frame, that is, the acquisition time of the nth video frame is within the time period corresponding to the nth audio frame.

In a possible implementation manner, the step of segmenting the sound signal according to the acquisition time of the N video frames to obtain N speech frames respectively corresponding to the N video frames includes:

According to the acquisition moment of the N video frames, determine a time window length and a moving step for dividing the time window of the sound signal, and the moving step is smaller than the time window;

For the nth speech frame, according to the moving step, move the time window, and determine the time period corresponding to the nth speech frame;

Segmenting the nth speech frame from the sound signal according to the time period corresponding to the nth speech frame.

For example, there may be an overlap between the time periods of the speech frames to reduce the occurrence of sound distortion. The segmentation of the sound signal can be realized by moving the window function.

In a possible implementation manner, according to the acquisition time of N video frames, the time window length and the moving step of the time window of the moving window function may be determined, wherein the moving step is smaller than the time window. For example, if the time interval between the acquisition moments of adjacent video frames in N video frames is 50ms (that is, the frame rate of video frames is 20 frames/s), then the moving step can be set to 50ms, and the time window length can be set to 60ms, in this case, the overlap between adjacent speech frames is 10ms. The present disclosure does not limit the specific values of the time window length and the moving step size.

In a possible implementation, for the first speech frame, starting from time T, the time period corresponding to the time window can be used as the time period corresponding to the first speech frame, for example, T～T+ 60ms; for the second voice frame, you can move the time window according to the moving step, and use the time period corresponding to the time window as the time period corresponding to the second voice frame, for example, T+50ms～T+110ms ; For the nth speech frame, the time period corresponding to the nth speech frame can be determined according to the moving step and the moving time window. In this way, the time periods corresponding to the N voice frames can be respectively determined.

In a possible implementation manner, according to the time period corresponding to the nth speech frame, the nth speech frame may be segmented from the sound signal. After dividing according to the time segments of the N speech frames, N speech frames can be obtained, which are denoted as A1, A2, . . . , AN.

In this way, the speech segmentation process can be realized and the subsequent processing effect can be improved.

In a possible implementation, the voice feature extraction can be performed on the voice frame, and the voice frame can be transformed into c-dimensional information containing sound information, for example, by means of MFCC (Mel-Frequency Cepstral Coefficients, Mel cepstral coefficients) transformation. Vector, the c-dimensional vector is used as a speech feature, which is recorded as An-feature. Among them, the length c of speech features is the same as the number of channels of face features.

In this way, N voice features can be obtained by processing N voice frames respectively. It should be understood that other methods may also be used to extract speech features from speech frames, which is not limited in the present disclosure.

In a possible implementation manner, after the N facial features and N voice features of the occupant are obtained, the facial features and voice features may be fused. Wherein, according to the facial feature and the voice feature, the step of determining the fusion feature of the occupant may include:

The N facial features and the N voice features are fused one-to-one to obtain N sub-fusion features;

The N sub-fusion features are spliced to obtain the fusion features of the occupant.

That is to say, the nth face feature In-featuremap-i of the occupant i can be fused with the nth voice feature An-feature, for example, the voice feature (c-dimensional vector) is used to compare the face feature (feature dimension is (c , h, w)) for each position to obtain the nth sub-fusion feature, denoted as Fusionfeature-n(c, h, w). In this way, N sub-fusion features can be obtained by one-to-one fusion of N facial features and N voice features.

In a possible implementation, the N sub-fusion features can be spliced to obtain the fusion feature of the occupant i, which is recorded as video-fusionfeature.

In this way, the multi-modal fusion of face features and voice features can be realized. The fusion of the two at the neural network level can significantly reduce the false positive rate of speech detection; and, compared to logical fusion at the upper layer, Fusion at the neural network level can improve the robustness of speech detection.

In a possible implementation manner, according to the fusion feature, the speech detection result of the occupant i may be determined. A speech detection network may be preset, and the fusion feature is input into the speech detection network for processing, and the speech detection result of the occupant i is output.

Wherein, the speaking detection network may be, for example, a convolutional neural network, including multiple fully connected layers (for example, three layers of fully connected layers), a softmax layer, etc., for performing binary classification on fusion features. The fusion feature is input into the fully connected layer of the speaking detection network, and two-dimensional output can be obtained, corresponding to the speaking state and other states; after being processed by the softmax layer, a normalized score (score) or confidence degree is obtained.

In a possible implementation manner, a preset threshold (for example, set to 0.8) may be set for the score or confidence level of the speaking state. If the preset threshold is exceeded, it is determined that the occupant i is in a speaking state; otherwise, it is determined that the occupant i is in a non-speaking state. The present disclosure does not limit the network structure, training method and specific value of the preset threshold of the speaking detection network.

FIG. 2 shows a schematic diagram of a speaking detection process according to an embodiment of the present disclosure.

As shown in Figure 2, for the N video frames to be processed: video frame 1, video frame 2, ..., video frame N, face detection can be performed on the N video frames respectively, and it is determined that the occupant i is in the N video frames The face area of the occupant i is extracted from the face areas of N video frames respectively to obtain N face features; for the N voice frames to be processed: voice frame 1, voice frame 2, ..., Speech frame N, MFCC transformation can be performed on N speech frames respectively, and N speech features can be extracted; N face features and N speech features can be fused one by one by dot multiplication, and N sub-fusion features can be obtained: Fusion feature 1, sub-fusion feature 2, ..., sub-fusion feature N; splice the N sub-fusion features to obtain the fusion feature of the occupant i; input the fusion feature into the speech detection network for processing, and input the speech detection result of the occupant i , that is, the occupant i is speaking or not speaking.

In this way, based on the multi-modal fusion feature of image features and voice features, it can be judged whether the occupant in the cabin is speaking, thereby improving the accuracy of speech detection.

In a possible implementation manner, the above processing is performed on each occupant to obtain the speech detection result of each occupant; furthermore, the target occupant who sends out the sound signal can be determined according to the speech detection result of each occupant, so as to determine the person who sent the sound signal Which occupant is the target occupant to improve the accuracy of occupant speech detection.

In a possible implementation manner, the occupant speaking detection method according to the embodiment of the present disclosure may further include:

performing content recognition on the sound signal, and determining the speech content corresponding to the sound signal;

If the voice content includes a preset voice command, a control function corresponding to the voice command is executed.

For example, if the target occupant who sent out the sound signal has been determined in step S13, the voice recognition function can be activated to identify the content of the sound signal and determine the voice content corresponding to the voice signal. The implementation of voice content recognition in this disclosure No limit.

In a possible implementation manner, various voice commands may be preset. If the recognized voice content includes a preset voice command, the control function corresponding to the voice command can be executed. For example, if the recognition of the voice content includes the voice command "play music", it can control the car's music player to play music; if the recognition of the voice content includes the voice command "open the left window", it can control the opening of the left window.

In this way, the voice interaction with the occupants in the vehicle can be realized, so that the user can realize various control functions through voice, which improves the convenience of the user and improves the user experience.

In a possible implementation manner, when the voice content includes a preset voice command, the step of executing the control function corresponding to the voice command may include:

In the case where the voice command corresponds to multiple control functions with directional properties, determine the gaze direction of the target occupant according to the face area of the target occupant;

determining a target control function from the plurality of control functions based on the gaze direction of the target occupant;

Execute the target control function.

For example, a voice command may correspond to multiple control functions with directionality. For example, the voice command "open the window" may correspond to the windows in both directions of left and right, and multiple control functions include "open the window on the left". side window" and "open the right window"; it can also correspond to the windows in the four directions of left front, left rear, right front and right rear. The multiple control functions include "open the left front window", " Open the front right window", "Open the rear left window", "Open the rear right window". In this case, the corresponding control function can be determined in conjunction with image recognition.

In a possible implementation, when the voice command corresponds to multiple directional control functions, the gaze direction of the target occupant may be determined according to the face areas of the target occupant in N video frames.

In a possible implementation, feature extraction can be performed on the images of the face areas of the target occupant in the N video frames respectively to obtain the face features of the target occupant in the N video frames; the N facial features are fused , to obtain the face fusion features of the target occupant; input the face fusion features into the preset gaze direction recognition network for processing, and obtain the gaze direction of the target occupant, that is, the direction of sight of the eyes of the target occupant.

Wherein, the gaze direction recognition network may be, for example, a convolutional neural network, including a convolutional layer, a fully connected layer, a softmax layer, and the like. The disclosure does not limit the network structure and training method of the gaze direction recognition network.

In a possible implementation manner, the target control function may be determined from multiple control functions according to the gaze direction of the target occupant. For example, if the voice command is "open the window", and it is determined that the gaze direction of the target occupant is facing the right, then the target control function may be determined as "open the window on the right". In turn, targeted control functions can be performed, such as opening the right-hand window.

In this way, the accuracy of voice interaction can be improved, and the convenience for users can be further improved.

In a possible implementation, the identities of the occupants may not be distinguished, that is, if it is determined that there is a target occupant speaking, voice recognition is activated and a corresponding control function is executed. It is also possible to distinguish the identity of the target occupant, for example, it only responds to the driver's voice, and performs voice recognition when it is judged that the driver is speaking, but does not respond to the passenger's voice; or according to the seat area where the passenger is located, when it is judged that the passenger is speaking Perform voice recognition, and perform zone control functions for the passenger's seat zone, etc.

determining the seating area of the target occupant according to the video stream;

In the case where the voice content includes a preset voice command, according to the seating area of the target occupant, determine the area control function corresponding to the voice command;

Execute the zone control functions.

For example, the video stream includes a first video stream of the driver area, and/or a second video stream of the occupant area in the cabin, and the target occupants may include the driver and/or occupants.

In a possible implementation, for the first video stream, if the target occupant who sends out the sound signal has been determined in step S13, the target occupant can be directly determined to be the driver, and the seat area of the target occupant is the driver area .

In a possible implementation, for the second video stream, if the target occupant who sends out the sound signal has been determined in step S13, according to the position of the face area of the target occupant in the video frame of the second video stream, Determine the seating area of the passenger, such as the co-pilot area, left rear seat area, right rear seat area, etc.

For example, if the driver's area is at the left front of the cabin, if the face area of the target occupant is at the lower left position in the video frame, it can be determined that the seat area of the target occupant is the co-pilot area.

In a possible implementation, if the target occupant who sends out the sound signal has been determined in step S13, the speech recognition function can be activated to perform content recognition on the sound signal to determine the speech content corresponding to the sound signal. The implementation manner of the content identification is not limited.

In a possible implementation manner, various voice commands may be preset. If the recognized voice content includes a preset voice command, the area control function corresponding to the voice command may be determined according to the seating area of the target occupant. For example, if it is recognized that the voice content includes the voice command "open the window", and the seat area of the target occupant is the left rear seat area, then it can be determined that the corresponding area control function is "open the left rear window". In turn, this area control function can be performed, for example controlling the opening of the left rear side window.

In this way, the corresponding area control function can be executed, further improving user convenience.

According to the occupant speech detection method of the embodiment of the present disclosure, the video stream and sound signal in the cabin can be obtained; face detection is performed on the video stream to determine the face area of at least one occupant in the cabin in the video stream; according to each Occupant face area and sound signal, determine the target occupant who emits the sound signal from among the various occupants. Judging whether the occupant is speaking according to the face area and the sound signal can improve the accuracy of occupant speech detection and reduce the false alarm rate of speech recognition

According to the occupant speech detection method of the embodiment of the present disclosure, the multi-modal fusion of video images and sound signals is performed at the neural network level, which can greatly reduce the sound interference caused by non-human voice sources, and significantly reduce speech detection errors. Report rate; and, compared to logic fusion at the upper layer, fusion at the neural network level can improve the robustness of speech detection.

The occupant speech detection method according to the embodiments of the present disclosure can be applied to an intelligent cabin perception system, effectively avoiding false alarms caused by purely relying on voice signals, ensuring that voice recognition can be normally triggered, and improving user interaction experience.

It can be understood that the above-mentioned method embodiments mentioned in this disclosure can all be combined with each other to form a combined embodiment without violating the principle and logic. Due to space limitations, this disclosure will not repeat them. Those skilled in the art can understand that, in the above method in the specific implementation manner, the specific execution order of each step should be determined according to its function and possible internal logic.

In addition, the present disclosure also provides an occupant speech detection device, electronic equipment, a computer-readable storage medium, and a program, all of which can be used to implement any of the occupant speech detection methods provided in the present disclosure, and refer to the corresponding technical solutions and descriptions in the method section Corresponding records are not repeated here.

Fig. 3 shows a block diagram of an occupant speaking detection device according to an embodiment of the present disclosure. As shown in Fig. 3, the device includes:

Signal acquiring module 31, for acquiring video stream and sound signal in the cabin;

A face detection module 32, configured to perform face detection on the video stream, to determine the face area of at least one occupant in the cabin in the video stream;

The occupant determining module 33 is configured to determine the target occupant in the cabin who sends out the sound signal according to the face area of each occupant and the sound signal.

In a possible implementation manner, the video stream includes a first video stream of the driver's area;

The face detection module is used to: determine the face area of the driver in the cabin in the first video stream;

The occupant determination module is configured to: determine whether the target occupant in the vehicle cabin who sends out the sound signal is the driver according to the face area of the driver and the sound signal.

In a possible implementation manner, the video stream includes a second video stream of the occupant area;

The occupant determining module is used for: for each occupant's face area, according to the human face area and the sound signal, determine whether the target occupant who sends out the sound signal in the cabin is the person occupant corresponding to the face area.

In a possible implementation manner, the device further includes:

The seat area determination module is used to determine the seat area of the target occupant according to the video stream; the second identification module is used to perform content identification on the sound signal and determine the voice content corresponding to the sound signal; function A determining module, configured to determine an area control function corresponding to the voice instruction according to the seating area of the target occupant when the voice content includes a preset voice instruction; an area control module, configured to execute the Zone control function.

In a possible implementation manner, the occupant determination module is used for:

For the face area of each occupant, perform feature extraction on the occupant's face area in the video frame sequence to obtain the occupant's facial features; according to the facial features and from the sound The speech feature extracted from the signal is used to determine the fusion feature of the occupant; according to the fusion feature, the speech detection result of the occupant is determined;

In a possible implementation manner, the occupant determination module performs feature extraction on the face area of the occupant in the video frame sequence, including: extracting the N video frames of the occupant in the video frame sequence The face area of each frame in is carried out feature extraction, obtains the N face feature of described occupant;

The voice feature is extracted by the occupant determination module in the following manner: according to the acquisition time of the N video frames, the voice signal is segmented and the voice feature is extracted to obtain the voice signals corresponding to the N video frames respectively. N voice features.

In some embodiments, the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the methods described in the method embodiments above, and its specific implementation can refer to the description of the method embodiments above. For brevity, here No longer.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor. Computer readable storage media may be volatile or nonvolatile computer readable storage media.

An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.

An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.

An embodiment of the present disclosure also provides a computer program, including computer readable codes, and when the computer readable codes are run in an electronic device, a processor in the electronic device executes the above method.

Electronic devices may be provided as terminals, servers, or other forms of devices.

FIG. 4 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure. For example, the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.

4, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814 , and the communication component 816.

The processing component 802 generally controls the overall operations of the electronic device 800, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802 .

The memory 804 is configured to store various types of data to support operations at the electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

The power supply component 806 provides power to various components of the electronic device 800 . Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 800 .

The multimedia component 808 includes a screen providing an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device 800 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 804 or sent via communication component 816 . In some embodiments, the audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

Sensor assembly 814 includes one or more sensors for providing status assessments of various aspects of electronic device 800 . For example, the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and the keypad of the electronic device 800, the sensor component 814 can also detect the electronic device 800 or a Changes in position of components, presence or absence of user contact with electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes in electronic device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include an optical sensor, such as a complementary metal-oxide-semiconductor (CMOS) or charge-coupled device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

In an exemplary embodiment, there is also provided a non-volatile computer-readable storage medium, such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to implement the above method.

FIG. 5 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, electronic device 1900 may be provided as a server. Referring to FIG. 5 , electronic device 1900 includes processing component 1922 , which further includes one or more processors, and a memory resource represented by memory 1932 for storing instructions executable by processing component 1922 , such as application programs. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 1922 is configured to execute instructions to perform the above method.

Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input-output (I/O) interface 1958 . The electronic device 1900 can operate based on the operating system stored in the memory 1932, such as the Microsoft server operating system (Windows Server ^TM ), the graphical user interface-based operating system (Mac OS X ^TM ) introduced by Apple Inc., and the multi-user and multi-process computer operating system (Unix ^™ ), a free and open-source Unix-like operating system (Linux ^™ ), an open-source Unix-like operating system (FreeBSD ^™ ), or the like.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to implement the above method.

The present disclosure can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

The computer program product can be specifically realized by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. Wait.

Having described various embodiments of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

A method for detecting occupant speech, characterized in that, comprising:

Obtain the video stream and sound signal in the cabin;

Perform face detection on the video stream, and determine the face area of at least one occupant in the cabin in the video stream;

According to the face area of at least one occupant and the sound signal, the target occupant in the vehicle cabin who sends out the sound signal is determined.
The method according to claim 1, further comprising:

performing content recognition on the sound signal, and determining the speech content corresponding to the sound signal;

If the voice content includes a preset voice command, a control function corresponding to the voice command is executed.
The method according to claim 2, wherein when the voice content includes a preset voice command, executing the control function corresponding to the voice command includes:

In the case where the voice command corresponds to multiple control functions with directional properties, determine the gaze direction of the target occupant according to the face area of the target occupant;

determining a target control function from the plurality of control functions based on the gaze direction of the target occupant;

Execute the target control function.
The method according to any one of claims 1-3, wherein the video stream comprises a first video stream of the driver's area;

The determining the face area of at least one occupant in the cabin in the video stream includes:

determining the driver's face area in the first video stream in the cabin;

The determining the target occupant who sends out the sound signal in the cabin according to the face area of at least one occupant and the sound signal includes:

According to the face area of the driver and the sound signal, it is determined whether the target occupant in the cabin who sends out the sound signal is the driver.
The method according to any one of claims 1-4, wherein the video stream comprises a second video stream of the occupant area;

The determining the target occupant who sends out the sound signal in the cabin according to the face area of at least one occupant and the sound signal includes:

For each face area of the occupant, according to the face area and the sound signal, it is determined whether the target occupant who sends out the sound signal in the cabin is the occupant corresponding to the face area.
The method according to any one of claims 1-5, wherein the method further comprises:

determining the seating area of the target occupant according to the video stream;

performing content recognition on the sound signal, and determining the speech content corresponding to the sound signal;

In the case where the voice content includes a preset voice command, according to the seating area of the target occupant, determine the area control function corresponding to the voice command;

Execute the zone control functions.
The method according to any one of claims 1-6, characterized in that, according to the face area of at least one occupant and the sound signal, it is determined the person in the vehicle cabin that emits the sound signal Target occupants, including:

determining a video frame sequence corresponding to a time period of the sound signal in the video stream;

For the face area of any occupant,

performing feature extraction on the occupant's face area in the video frame sequence to obtain the occupant's facial features;

determining the fusion feature of the occupant according to the face feature and the speech feature extracted from the sound signal;

determining a speech detection result of the occupant according to the fusion feature;

The target occupant who sends out the sound signal is determined according to the speech detection result of at least one occupant.
The method according to claim 7, wherein the feature extraction of the occupant's face area in the video frame sequence comprises:

Performing feature extraction on the face area of at least one frame of the occupant in the N video frames of the video frame sequence to obtain N facial features of the occupant;

The speech features are extracted as follows:

According to the acquisition time of the N video frames, the audio signal is segmented and the speech features are extracted to obtain N speech features respectively corresponding to the N video frames.
The method according to claim 8, characterized in that, according to the acquisition time of the N video frames, the sound signal is segmented and the speech feature is extracted to obtain N video frames respectively corresponding to the N video frames. voice features, including:

According to the acquisition moment of the N video frames, the sound signal is segmented to obtain N speech frames respectively corresponding to the N video frames, and the acquisition moment of the nth video frame in the N video frames In the time period corresponding to the nth speech frame, n is an integer and 1≤n≤N;

Perform speech feature extraction on the N speech frames respectively to obtain N speech features.
The method according to claim 9, wherein the audio signal is segmented according to the acquisition time of the N video frames to obtain N speech frames respectively corresponding to the N video frames, include:

According to the acquisition moment of the N video frames, determine a time window length and a moving step for dividing the time window of the sound signal, and the moving step is smaller than the time window;

For the nth speech frame, according to the moving step, move the time window, and determine the time period corresponding to the nth speech frame;

Segmenting the nth speech frame from the sound signal according to the time period corresponding to the nth speech frame.
The method according to any one of claims 8-10, wherein the determining the fusion feature of the occupant according to the facial feature and the voice feature includes:

The N facial features and the N voice features are fused one-to-one to obtain N sub-fusion features;

The N sub-fusion features are spliced to obtain the fusion features of the occupant.
An occupant speech detection device, characterized in that it comprises:

The signal acquisition module is used to acquire video streams and sound signals in the cabin;

A face detection module, configured to perform face detection on the video stream, to determine the face area of at least one occupant in the cabin in the video stream;

The occupant determination module is configured to determine the target occupant in the cabin who sends out the sound signal according to the face area of at least one occupant and the sound signal.
An electronic device, characterized in that it comprises:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1-11.
A computer-readable storage medium, on which computer program instructions are stored, wherein, when the computer program instructions are executed by a processor, the method according to any one of claims 1 to 11 is implemented.
A computer program, comprising computer readable code, when the computer readable code is run in an electronic device, a processor in the electronic device executes the program for implementing any one of claims 1 to 11 Methods.