WO2023273064A1 - 对象说话检测方法及装置、电子设备和存储介质 - Google Patents

对象说话检测方法及装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2023273064A1
WO2023273064A1 PCT/CN2021/127097 CN2021127097W WO2023273064A1 WO 2023273064 A1 WO2023273064 A1 WO 2023273064A1 CN 2021127097 W CN2021127097 W CN 2021127097W WO 2023273064 A1 WO2023273064 A1 WO 2023273064A1
Authority
WO
WIPO (PCT)
Prior art keywords
target object
face
area
sound signal
driver
Prior art date
Application number
PCT/CN2021/127097
Other languages
English (en)
French (fr)
Inventor
王飞
钱晨
Original Assignee
上海商汤临港智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤临港智能科技有限公司 filed Critical 上海商汤临港智能科技有限公司
Publication of WO2023273064A1 publication Critical patent/WO2023273064A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Definitions

  • the present disclosure relates to the technical field of smart car cabins, and in particular to a method and device for detecting object speech, electronic equipment, and a storage medium.
  • Cabin intelligence includes multi-mode interaction, personalized service, safety perception, etc., which is an important direction for the current development of the automotive industry.
  • the multi-mode interaction in the cabin is intended to provide passengers with a comfortable interactive experience.
  • the means of multi-mode interaction include voice recognition, gesture recognition, etc. Among them, speech recognition occupies a significant market share in the field of vehicle interaction.
  • the present disclosure proposes a technical solution for object speech detection.
  • a method for detecting object speaking comprising: acquiring a video stream in the vehicle cabin and a sound signal collected by a vehicle microphone; Carry out face detection, determine the face area of the target object in the car in each of the video frames; determine the lip of the target object's lips according to the face area of the target object in the N video frames
  • the motion recognition result, N is an integer greater than 1; according to the lip motion recognition result and the first sound signal, determine the speech detection result of the target object, wherein the first sound signal includes the N video frames
  • the speech detection result includes that the target object is in a speaking state or in a non-speaking state.
  • the determining the speech detection result of the target object according to the lip movement recognition result and the first sound signal includes: when the lip movement recognition result is a lip movement, and the If the first sound signal includes voice, it is determined that the target object is in a speaking state.
  • the method further includes: when the target object is speaking, performing content recognition on the first sound signal, and determining the speech content corresponding to the first sound signal ;
  • the voice content includes a preset voice command, execute a control function corresponding to the voice command.
  • the target object includes a driver
  • executing the control function corresponding to the voice command includes: In the case where the voice instruction corresponds to a plurality of directional control functions, determine the gaze direction of the target object according to the face area of the target object in the N video frames; looking at a direction, determining a target control function from the plurality of control functions; and executing the target control function.
  • the video stream includes a first video stream of the driver area, and/or a second video stream of the occupant area in the cabin; the multiple video frames of the video stream Face detection is performed on each video frame of the first video stream, including: detecting the driver's face based on each first video frame in a plurality of first video frames of the first video stream; and/or based on the second video
  • Each of the plurality of second video frames of the stream detects a human face in the cabin, and determines the driver in each of the second video frames according to the position of the detected human face in the cabin human face.
  • the obtaining the video stream in the cabin includes: obtaining the first video stream of the driver's area collected by the camera of the driver detection system DMS; and/or obtaining the video stream collected by the camera of the occupant detection system OMS A second video stream of the occupant area in the cabin.
  • the method further includes: determining the first seat area of the target object according to each video frame in the plurality of video frames; As a result and the first sound signal, determining the speech detection result of the target object includes: in the case that the first sound signal includes speech, performing sound zone positioning on the first sound signal, and determining the same as the first sound signal.
  • the second seat area corresponding to the sound signal; if the lip movement recognition result is lip movement and the first seat area is the same as the second seat area, it is determined that the target object is in a speaking state.
  • the video stream includes a first video stream of a driver area
  • the target object includes a driver
  • the determined The first seat area of the target object includes: in response to detecting a human face in the driver area according to each video frame, determining that the first seat area of the target object is the driver area; and/or, the The video stream includes a second video stream of the occupant area in the vehicle cabin, the target object includes the driver and/or passengers, and according to each video frame in the plurality of video frames, determining the second video stream of the target object
  • a seat area comprising: performing face detection on each of the plurality of video frames; and determining a first seat area of the target object according to the detected face position.
  • the method further includes: when the target object is speaking, performing content recognition on the first sound signal, and determining the speech content corresponding to the first sound signal ; when the voice content includes a preset voice instruction, according to the first seating area of the target object, determine an area control function corresponding to the voice instruction; execute the area control function.
  • the video stream includes a first video stream of the driver's area
  • the target object includes the driver
  • the target is determined according to the lip movement recognition result
  • the speech detection result of the object includes: in the case that the first sound signal is determined to be from the driver's area by performing sound zone positioning according to the first sound signal, determining whether the driver has a lip movement recognition result according to the lip movement recognition result. In response to no lip movement of the driver, it is determined that the speaking detection result of the target object is: the driver is in a non-speaking state.
  • determining the lip movement recognition result of the target object's lips according to the face areas of the target object in the N video frames includes: determining the target object in the N video frames The face regions in the video frame are respectively subjected to feature extraction to obtain the N face features of the target object; the N face features are fused to obtain the face fusion features of the target object; The features are fused to perform lip movement recognition, and a lip movement recognition result of the target object is obtained.
  • performing feature extraction on the face regions of the target object in the N video frames respectively to obtain N face features of the target object includes: In the i-th video frame in the video frame, the face key point extraction is carried out to the face area of the target object in the i-th video frame to obtain a plurality of face key points, 1 ⁇ i ⁇ N; A plurality of face key points are normalized to obtain the ith face feature of the target object.
  • an object speaking detection device including: a signal acquisition module, used to acquire the video stream in the cabin, and the sound signal collected by the vehicle microphone; a face detection module, used to detect the Each video frame in a plurality of video frames of the video stream performs face detection, and determines the face area of the target object in the car in each video frame; For the face area in the N video frames, determine the lip movement recognition result of the target object's lips, where N is an integer greater than 1; the speaking detection module is used to, according to the lip movement recognition result and the first sound signal, determining a speech detection result of the target object, wherein the first sound signal includes the sound signal in a time period corresponding to N video frames, and the speech detection result includes that the target object is in a speaking state or in a silent state.
  • the speaking detection module is configured to: determine that the target object is in a speaking state when the lip movement recognition result is lip movement and the first sound signal includes voice .
  • the device further includes: a content identification module, configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal; a function execution module, configured to execute a control function corresponding to the voice command when the voice content includes a preset voice command.
  • a content identification module configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal
  • a function execution module configured to execute a control function corresponding to the voice command when the voice content includes a preset voice command.
  • the target object includes a driver, wherein the function execution module is configured to: in the case that the voice instruction corresponds to multiple control functions with directionality, according to the target object In the human face area in the N video frames, determine the gaze direction of the target object; determine the target control function from the plurality of control functions according to the gaze direction of the target object; execute the target control function Function.
  • the video stream includes a first video stream of the driver area, and/or a second video stream of the passenger area in the cabin;
  • the face detection module is configured to: based on the Each first video frame of a plurality of first video frames of the first video stream detects the driver's face; and/or each second video of a plurality of second video frames of the second video stream
  • the frame detects the human face in the vehicle cabin, and determines the driver's human face in each second video frame according to the position of the detected human face in the vehicle cabin.
  • the signal acquisition module is configured to: acquire the first video stream of the driver's area captured by the camera of the driver detection system DMS; and/or acquire the occupants in the cabin captured by the camera of the occupant detection system OMS Region's second video stream.
  • the device further includes: a seat area determining module, configured to determine a first seat area of the target object according to each video frame in the plurality of video frames; wherein, the The speaking detection module is used for: in the case that the first sound signal includes voice, perform sound region positioning on the first sound signal, and determine the second seat area corresponding to the first sound signal; If the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
  • a seat area determining module configured to determine a first seat area of the target object according to each video frame in the plurality of video frames; wherein, the The speaking detection module is used for: in the case that the first sound signal includes voice, perform sound region positioning on the first sound signal, and determine the second seat area corresponding to the first sound signal; If the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
  • the video stream includes a first video stream of a driver's area
  • the target object includes a driver
  • the seat area determining module is configured to: respond to each video frame according to the A human face is detected in the driver area, and the first seat area of the target object is determined to be the driver area; and/or the video stream includes a second video stream of the passenger area in the cabin, and the target object includes the driver And/or passengers
  • the seat area determination module is used to: perform face detection on each video frame in the plurality of video frames; determine the first seat area of the target object according to the detected face position .
  • the device further includes: a content identification module, configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal; a function determination module, used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command ; An area function executing module, configured to execute the area control function.
  • a content identification module configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal
  • a function determination module used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command
  • An area function executing module configured to execute the area control function.
  • the video stream includes a first video stream of the driver's area
  • the target object includes the driver
  • the speaking detection module is configured to: In the case where it is determined that the first sound signal comes from the area of the driver, determine whether the driver has lip movement according to the lip movement recognition result; in response to no lip movement of the driver, determine the target object
  • the result of the speaking detection is: the driver is not speaking.
  • the lip movement recognition module is configured to: perform feature extraction on the face regions of the target object in the N video frames respectively, to obtain N face features of the target object ; Fusing the N face features to obtain the face fusion features of the target object; performing lip movement recognition on the face fusion features to obtain a lip movement recognition result of the target object.
  • performing feature extraction on the face regions of the target object in the N video frames respectively to obtain N face features of the target object includes: In the i-th video frame in the video frame, the face key point extraction is carried out to the face area of the target object in the i-th video frame to obtain a plurality of face key points, 1 ⁇ i ⁇ N; A plurality of face key points are normalized to obtain the ith face feature of the target object.
  • an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to execute the above-mentioned method.
  • a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented.
  • a computer program including computer readable codes, and when the computer readable codes are run in an electronic device, a processor in the electronic device executes the above method.
  • the video stream and sound signal in the cabin can be acquired; face detection is performed on each frame of multiple video frames of the video stream to determine the face area of the object; Determine the result of lip movement recognition based on the face area in the face; judge whether the subject is speaking based on the lip movement recognition result and the sound signal, thereby improving the accuracy of the subject's speech detection and reducing the false positive rate of speech recognition.
  • Fig. 1 shows a flow chart of a method for detecting a speaking of an object according to an embodiment of the present disclosure.
  • FIG. 2 shows a schematic diagram of lip movement recognition in the object speaking detection method according to an embodiment of the present disclosure.
  • Fig. 3 shows a block diagram of an object speaking detection device according to an embodiment of the present disclosure.
  • Fig. 4 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
  • Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
  • the voice detection function In in-vehicle voice interaction, the voice detection function usually runs in real time in the vehicle, and the false alarm rate of the voice detection function needs to be kept at a very low level.
  • a signal detection method based on pure voice is usually used, and it is difficult to suppress voice false alarms, resulting in a high false alarm rate and poor user interaction experience.
  • computer vision technology can be used to detect the video stream in the cabin, determine the object in the video frame, identify the lip movement state of the object, and then use the lip movement state and the sound signal Jointly determine whether someone is speaking, thereby improving the accuracy of object speaking detection, reducing the false positive rate of speech recognition, and improving user experience.
  • the object speaking detection method may be performed by electronic equipment such as a terminal device or a server, and the terminal device may be a vehicle-mounted device, a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone , a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc.
  • the method can be implemented by calling a computer-readable instruction stored in a memory by a processor.
  • the on-board device can be the car machine, domain controller or processor in the cabin, and can also be used in DMS (Driver Monitor System, driver detection system) or OMS (Occupant Monitoring System, occupant detection system) to execute image processing.
  • Device hosts for data processing operations, etc.
  • Fig. 1 shows a flow chart of a method for detecting a subject's speaking according to an embodiment of the present disclosure. As shown in Fig. 1, the method for detecting a subject's speaking includes:
  • step S11 the video stream in the cabin and the sound signal collected by the vehicle microphone are obtained;
  • step S12 face detection is performed on each of the multiple video frames of the video stream, and the face area of the target object in the car is determined in each of the video frames;
  • step S13 according to the face area of the target object in N video frames among the plurality of video frames, determine the lip movement recognition result of the target object's lips, where N is an integer greater than 1;
  • step S14 the speech detection result of the target object is determined according to the lip movement recognition result and the first sound signal, wherein the first sound signal includes all time periods corresponding to the N video frames
  • the sound signal, the speaking detection result includes that the target object is in a speaking state or in a non-speaking state.
  • embodiments of the present disclosure may be applied to any type of vehicle, such as passenger cars, taxis, shared cars, buses, freight vehicles, subways, trains, and the like.
  • the video stream in the vehicle cabin may be collected through the vehicle camera, and the sound signal may be collected through the vehicle microphone.
  • the vehicle-mounted camera can be any camera installed in the vehicle, the number can be one or more, and the type can be DMS camera, OMS camera, common camera, etc.
  • the vehicle-mounted microphone can also be arranged at any position in the vehicle, and the number can be one or more. The present disclosure does not limit the location, quantity and type of the vehicle-mounted camera and the vehicle-mounted microphone.
  • multiple video frames may be obtained from the video stream, and the multiple video frames may be a group of continuous video frames in the video stream; they may also be obtained by sampling a video frame sequence of the video stream.
  • step S12 face detection can be performed on each video frame in a plurality of video frames, and a human face frame in each video frame is determined;
  • the face frame is tracked to determine the face frame of the object belonging to the same identity, so as to determine the face frame of the object in the car in each video frame.
  • the way of face detection can be, for example, face key point recognition, face contour detection, etc.
  • the way of face tracking can be, for example, to determine objects belonging to the same identity according to the intersection ratio of face frames in adjacent video frames .
  • face detection and tracking can be implemented in any manner in the related art, which is not limited in the present disclosure.
  • the face area of each object may be obtained.
  • the face area of the target object in N video frames among the plurality of video frames may be determined, where N is an integer greater than 1. That is, N video frames corresponding to a certain duration (for example, 2s) are selected from multiple video frames for subsequent lip movement detection.
  • the N video frames may be the latest N video frames collected in the video stream. N may be, for example, 10, 15, 20, etc., which is not limited in the present disclosure.
  • the lip shape of a person's lips changes within a certain period of time (for example, 2s), that is, the lip movement of the person's lips can be considered, and the period of time is the duration corresponding to N video frames.
  • the duration can be set as 1s, 2s or 3s, for example, which is not limited in the present disclosure.
  • the lip movement recognition result of the target object's lips may be determined according to the face area of the target object in the N video frames.
  • the lip movement recognition result includes lip movement of the target object or no lip movement.
  • the area images of the face area of the target object in the N video frames may be directly input into a preset lip movement recognition network for processing, and a lip movement recognition result is output. It is also possible to perform feature extraction on the regional images of the face area of the target object in N video frames to obtain face features; input the face features into the preset lip movement recognition network for processing, and output the lip movement recognition results.
  • the present disclosure does not limit the specific processing manner.
  • the lip movement recognition network may be, for example, a convolutional neural network, and the present disclosure does not limit the specific network structure of the lip movement recognition network.
  • the speech detection result of the target object may be jointly determined according to the lip movement recognition result of the target object and the first sound signal of the time period corresponding to the N video frames.
  • the first sound signal includes a sound signal of a time period corresponding to N video frames, for example, the time period corresponding to N video frames is the latest 2s (2s ago-now), the first The sound signal is also the sound signal of the last 2s.
  • Voice detection is performed on the first sound signal to determine whether the first sound signal includes voice, and the present disclosure does not limit the implementation of voice detection.
  • the speaking detection result of the target object includes that the target object is in a speaking state or in a non-speaking state. For example, if the lip movement recognition result shows that lip movement occurs, and the first sound signal includes speech, the target object is considered to be speaking; if the first sound signal includes speech, and the lip movement recognition result shows that no lip movement occurs, the target object is considered The subject is in a state of not speaking; if the result of lip movement recognition is lip movement, but the first sound signal does not include speech, the target object is considered to be in a state of not speaking.
  • the present disclosure does not limit the specific determination method.
  • the video stream and sound signal in the cabin it is possible to obtain the video stream and sound signal in the cabin; perform face detection on each frame of multiple video frames of the video stream to determine the face area of the object; Determine the result of lip movement recognition based on the face area in the face; judge whether the subject is speaking based on the lip movement recognition result and the sound signal, thereby improving the accuracy of the subject's speech detection and reducing the false positive rate of speech recognition.
  • step S11 the video stream in the cabin collected by the vehicle camera and the sound signal collected by the vehicle microphone can be obtained.
  • step S11 may include:
  • the on-board camera may include, for example, a driver detection system DMS camera, and/or an occupant detection system OMS camera.
  • the video stream collected by the DMS camera is the video stream of the driver's area (called the first video stream), and the video stream collected by the OMS camera is the video stream of the occupant area in the cabin (called the second video stream).
  • the video stream acquired in step S11 may include the first video stream and/or the second video stream.
  • the first video stream of the driver area and the second video stream of the passenger area in the cabin may also be collected by a camera installed in the cabin that is not dedicated to driver detection or passenger detection get.
  • the above-mentioned first video stream may be obtained by intercepting the video information of the driver area in the first video stream.
  • the video streams of different areas in the cabin can be obtained for subsequent processing respectively, thereby improving the comprehensiveness of object speech detection.
  • performing face detection on each of the multiple video frames of the video stream in step S12 may include:
  • each second video frame of multiple second video frames of the second video stream Based on each second video frame of multiple second video frames of the second video stream, detect a human face in the vehicle cabin, and determine each second video frame according to the detected position of the human face in the vehicle cabin. The driver's face in the video frame.
  • the first video stream and the second video stream may be processed respectively.
  • first video frames may be obtained from the first video stream, and the multiple first video frames may be A group of continuous video frames; it may also be a group of video frames obtained by sampling the video frame sequence of the first video stream.
  • the first video frame corresponds to a driver area, and the area includes only the driver.
  • face detection and tracking may be performed on each of the multiple first video frames to obtain the driver's face area in each first video frame.
  • a plurality of video frames may be obtained from the second video stream, and the plurality of second video frames may be A group of continuous video frames; it may also be a group of video frames obtained by sampling the video frame sequence of the second video stream.
  • the second video frame corresponds to the occupant area in the vehicle cabin, including the driver and/or passengers.
  • face detection and tracking can be performed on each second video frame in a plurality of second video frames to obtain the face area of each occupant in the cabin in each second video frame; and according to The position of each occupant's face area in each second video frame determines the driver's face and each passenger's face in each second video frame.
  • the camera for capturing the second video stream is set from the top position in front of the cabin (such as the top of the front windshield) towards the direction of the cabin
  • the The human face at the lower right position in the second video frame is determined to be the driver's human face.
  • the present disclosure does not limit the specific manner of determining each occupant.
  • the face of the driver in the first video stream can be determined, and the faces of different occupants (driver and/or passenger) in the second video stream can be determined, so that lip movement recognition and speech detection can be performed respectively and corresponding responses, further improving the accuracy of speech detection and making subsequent responses more targeted.
  • step S13 may include:
  • Lip movement recognition is performed on the face fusion feature to obtain a lip movement recognition result of the target object.
  • feature extraction may be performed on images of face regions of the target object in N video frames to obtain face features of the target object in N video frames.
  • the manner of feature extraction may be, for example, human face key point extraction.
  • the step of respectively performing feature extraction on the face regions of the target object in the N video frames may include:
  • the human face area of the target object in the i-th video frame can be
  • the facial landmarks are extracted to obtain multiple facial landmarks (landmarks-i), wherein the number of facial landmarks can be, for example, 106, which is not limited in the present disclosure.
  • the mean mean-i and standard deviation std-i can be obtained for multiple face key points (landmarks-i), and then the multiple face key points landmarks-i can be calculated by the following formula (1).
  • i is normalized to obtain the normalized multiple face key points Landmarks-i-normalize:
  • Landmarks-i-normalize (landmarks-i–mean-i)/std-i (1)
  • the normalized multiple face key points Landmarks-i-normalize can be directly used as the i-th face feature of the target object; some of them can also be selected from Landmarks-i-normalize Face key points, such as selecting the face key points in the lower half of the face or selecting the mouth key points in the face key points, as the i-th face feature of the target object. This disclosure does not limit this.
  • facial features can be made more standardized and the accuracy of subsequent lip movement recognition can be improved.
  • N video frames are respectively processed to obtain N facial features of the target object.
  • the first N-1 video frames among the N video frames may have been processed in the previous lip movement recognition.
  • feature extraction can be performed on the face area of the target object in the latest Nth video frame to obtain the Nth face features; and read the existing N-1 in the previous processing Personal face features, so as to obtain the N face features of the target object. In this way, the calculation amount can be reduced, and the processing efficiency can be improved.
  • the N facial features may be fused to obtain the fused face features of the target object. It can be sorted according to the frame order to obtain facial features landmarks-1-normalize, landmarks-2-normalize, ..., landmarks-N-normalize, and splicing or superimposing the N facial features to obtain facial fusion features. Denote it as face-video-feature.
  • lip movement recognition may be performed on face fusion features to obtain a lip movement recognition result of the target object.
  • a lip movement recognition neural network can be preset, input the face fusion features into the lip movement recognition neural network for processing, and output the lip movement recognition result of the target object.
  • the lip movement recognition neural network can be, for example, a convolutional neural network, which includes multiple fully connected layers, softmax layers, etc., and is used for binary classification of human face fusion features.
  • the face fusion feature is input into the fully connected layer of the lip movement recognition neural network, and the two-dimensional output can be obtained, corresponding to the occurrence of lip movement and the absence of lip movement respectively; after the softmax layer processing, the normalized score (score) or confidence is obtained Spend.
  • a preset threshold (for example, set to 0.8) may be set for the score or confidence of lip movement. If the preset threshold is exceeded, it is determined that the target object is in a speaking state; otherwise, it is determined that the target object is in a non-speaking state.
  • the present disclosure does not limit the network structure, training method and specific value of the preset threshold of the lip movement recognition neural network.
  • FIG. 2 shows a schematic diagram of lip movement recognition in the object speaking detection method according to an embodiment of the present disclosure.
  • face detection can be performed on the N video frames respectively, and it is determined that the target object is in the N video frames
  • the face area of the target object; the face features of the target object in the N video frames are extracted respectively to obtain N face features; the feature fusion is carried out to the N face features to obtain the face fusion feature of the target object;
  • the face fusion feature is input into the lip movement recognition neural network for processing, and the lip movement recognition result of the target object is output, including lip movement or no lip movement.
  • the lip movement recognition of the target object's lips can be realized, assisting in judging whether the object is speaking, and improving the accuracy of speech detection.
  • step S14 may include:
  • the lip movement recognition result is lip movement and the first sound signal includes speech, it is determined that the target object is in a speaking state.
  • the lip movement recognition result is lip movement and the first sound signal includes speech, it is considered that the target object is in a speaking state. That is, it is judged that the target object is speaking only when the two conditions of lip movement and detected voice are met; if only one condition is satisfied, or both conditions are not satisfied, it is judged that the target object is not speaking.
  • the object speaking detection method according to the embodiment of the present disclosure may further include:
  • the voice content includes a preset voice command
  • a control function corresponding to the voice command is executed.
  • the speech recognition function can be activated to perform content recognition on the first sound signal to determine the speech content corresponding to the first sound signal.
  • the implementation method is not limited.
  • various voice commands may be preset. If the recognized voice content includes a preset voice command, the control function corresponding to the voice command can be executed. For example, if it recognizes that the voice content includes the voice command "play music", it can control the car music player to play music; if it recognizes the voice content includes the voice command "open the left window", it can control the left window to open.
  • the voice interaction with the occupants in the vehicle can be realized, so that the user can realize various control functions through voice, which improves the convenience of the user and improves the user experience.
  • the identity of the target object may not be distinguished, that is, if the target object is judged to be speaking, speech recognition is started and a corresponding control function is executed.
  • the identity of the target object can also be distinguished, such as only responding to the driver's voice, performing voice recognition when judging that the driver is speaking, and not responding to the passenger's voice; or according to the seat area where the passenger is, when judging that the passenger is speaking Perform voice recognition, and perform zone control functions for the passenger's seat zone, etc.
  • the video stream includes the first video stream of the driver's area, and the target object includes the driver.
  • step S14 may include:
  • the speaking detection result of the target object is: the driver is in a non-speaking state.
  • the first video frame of the first video stream is an image of the driver's area.
  • Face detection and tracking are performed in step S12 to obtain the driver's face area in each first video frame.
  • lip movement recognition can be performed in step S13 to obtain the driver's lip movement recognition result, that is, lip movement occurs or does not occur.
  • sound zone positioning may be performed on the first sound signal to determine an area corresponding to the first sound signal, for example, a driver area or a passenger area.
  • the present disclosure does not limit the implementation of sound zone positioning.
  • the first sound signal is determined to be from the driver's area by performing sound zone positioning according to the first sound signal, it can be determined whether the driver has lip movement according to the lip movement recognition result; If it moves, it can be judged that the driver is not speaking, that is, the driver is in a state of not speaking. In this case, there may be no response to the voice.
  • the voice recognition function can be activated to respond to the voice.
  • the video stream includes a first video stream of a driver area, and/or a second video stream of a passenger area in a vehicle cabin, and the target object includes a driver.
  • step S12 face detection and tracking may be performed on the first video frame of the first video stream to obtain the face area of the driver in each first video frame.
  • face detection and tracking can also be carried out to the second video frame of the second video stream to determine the face area; and according to the position of the face area, determine the driver's face in each second video frame .
  • step S13 can carry out lip movement recognition in step S13, obtain the driver's lip movement recognition result, namely occur lip movement or not take place lip movement, and determine driver's speaking state in step S14; If the driver is speaking, then start The voice recognition function determines the voice content corresponding to the first sound signal.
  • the step of executing the control function corresponding to the voice command may include:
  • the voice instruction corresponds to a plurality of directional control functions, determine the gaze direction of the target object according to the face area of the target object in the N video frames;
  • a voice command may correspond to multiple control functions with directionality.
  • the voice command "open the window” may correspond to the windows in both directions of left and right, and multiple control functions include “open the window on the left”. side window” and “open the right window”; it can also correspond to the windows in the four directions of left front, left rear, right front and right rear.
  • the multiple control functions include "open the left front window”, “ Open the front right window”, “Open the rear left window”, “Open the rear right window”.
  • the corresponding control function can be determined in conjunction with image recognition.
  • the gaze direction of the target object may be determined according to the face areas of the target object in N video frames.
  • feature extraction can be performed on the images of the face areas of the target object in N video frames, respectively, to obtain the face features of the target object in N video frames; the N face features are fused , to obtain the face fusion feature of the target object; input the face fusion feature into the preset gaze direction recognition network for processing, and obtain the gaze direction of the target object, that is, the gaze direction of the target object's eyes.
  • the gaze direction recognition network may be, for example, a convolutional neural network, including a convolutional layer, a fully connected layer, a softmax layer, and the like.
  • the disclosure does not limit the network structure and training method of the gaze direction recognition network.
  • the target control function may be determined from multiple control functions according to the gaze direction of the target object. For example, if the voice command is "open the car window", and it is determined that the gaze direction of the target object is facing the right, then it may be determined that the target control function is "open the car window on the right". In turn, targeted control functions can be performed, such as opening the right-hand window.
  • the video stream includes a first video stream of the driver area, and/or a second video stream of the passenger area in the vehicle cabin, and the target objects include the driver and/or the passenger.
  • the object speaking detection method according to the embodiment of the present disclosure may further include:
  • a first seating area of the target object is determined according to each of the plurality of video frames.
  • the seat area of each occupant can be determined respectively according to the position of the face area of each occupant in each of the plurality of video frames (called for the first seating area).
  • the first seat area includes the driver area or the passenger area.
  • the passenger area may include the co-pilot area, the left rear seat area, the right rear seat area, etc.
  • the present disclosure does not limit the division of the seat area .
  • the video stream includes a first video stream of a driver area
  • the target object includes a driver
  • the determined The first seating area for the above target audience including:
  • the video stream includes a second video stream of an occupant area in the vehicle cabin, the target object includes a driver and/or passengers, and the target object is determined according to each video frame in the plurality of video frames
  • the first seating area including:
  • the first seat area of the target object is determined according to the detected face position.
  • the video stream includes the first video stream of the driver's area
  • the first seat area of the target object is the driving area.
  • the seating area complete the process of determining the first seating area.
  • face detection and tracking can be performed on each of the multiple video frames to obtain the The face area of each occupant in each video frame; according to the detected face position, the first seat area of the target object can be determined.
  • the face at the lower right position in the video frame can be determined as the driver's face, and the first seat area of the target object is determined to be the driver's area;
  • the face at the lower left position in the video frame is determined as the face of the passenger of the co-pilot, and the first seat area of the target object is determined as the co-pilot area.
  • the present disclosure does not limit the specific manner of determining each occupant.
  • the speaking state of the target object can be jointly judged based on the first seating area in step S14.
  • step S14 may include:
  • the first sound signal includes speech
  • perform sound zone positioning on the first sound signal and determine a second seat area corresponding to the first sound signal
  • the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
  • the first sound signal can be sound zone localized to determine the seat area corresponding to the first sound signal (referred to as the second seat area), such as the driver area or the passenger area .
  • the present disclosure does not limit the implementation of sound zone positioning.
  • the lip movement recognition result of the target object is lip movement
  • the first seat area is the same as the second seat area (that is, the seat area located by the image If it is consistent with the seat area located through the sound zone), it can be determined that the target object is in a speaking state; and then the voice recognition function can be activated to respond to the voice.
  • the seat area located by the image is consistent with the seat area located by the sound zone, and lip movement occurs, it is judged that the subject is speaking, which further improves the accuracy of the subject's speaking detection;
  • the speaking object in order to perform targeted control functions, further improves user convenience.
  • the object speaking detection method according to the embodiment of the present disclosure may further include:
  • the voice content includes a preset voice command
  • the speech recognition function can be activated to perform content recognition on the first sound signal, and determine the speech content corresponding to the first sound signal.
  • the implementation method is not limited.
  • the area control function corresponding to the voice command may be determined according to the first seating area of the target object. For example, if it is recognized that the voice content includes the voice command "open the window", and the first seat area of the target object is the left rear seat area, then it can be determined that the corresponding area control function is "open the left rear window". In turn, this area control function can be performed, for example controlling the opening of the left rear side window.
  • the video stream and sound signal in the cabin can be obtained; face detection is performed on each frame of multiple video frames of the video stream to determine the face area of the object; The face area in the N video frames determines the lip movement recognition result; judges whether the subject is speaking based on the lip movement recognition result and the sound signal, thereby improving the accuracy of the subject's speech detection and reducing the false positive rate of speech recognition.
  • the object speaking detection method can be applied to the intelligent cabin perception system, and can analyze the lip movement status of the occupants in the cabin based on the intelligent video analysis algorithm, and judge whether someone is speaking in the cabin based on the voice signal , so as to effectively avoid false alarms caused by relying solely on voice signals, ensure that voice recognition can be triggered normally, and improve user interaction experience.
  • the present disclosure also provides an object speaking detection device, electronic equipment, a computer-readable storage medium, and a program, all of which can be used to implement any object speech detection method provided in the present disclosure, and refer to the corresponding technical solutions and descriptions in the method section Corresponding records are not repeated here.
  • FIG. 3 shows a block diagram of an object speaking detection device according to an embodiment of the present disclosure. As shown in FIG. 3 , the device includes:
  • Signal acquiring module 31 is used for acquiring the video flow in the cabin, and the sound signal that vehicle-mounted microphone collects;
  • the human face detection module 32 is used for carrying out human face detection to each video frame in a plurality of video frames of the video stream, and determines the human face area of the target object in the car in each video frame;
  • a lip movement recognition module 33 configured to determine the lip movement recognition result of the target object's lips according to the face area of the target object in N video frames among the plurality of video frames, where N is an integer greater than 1 ;
  • the speech detection module 34 is configured to determine the speech detection result of the target object according to the lip movement recognition result and the first sound signal, wherein the first sound signal includes a time period corresponding to the N video frames
  • the sound signal of the speech detection result includes that the target object is in a speaking state or in a non-speaking state.
  • the speaking detection module is configured to: determine that the target object is in a speaking state when the lip movement recognition result is lip movement and the first sound signal includes voice .
  • the device further includes:
  • a content recognition module configured to perform content recognition on the first sound signal when the target object is speaking, and determine the speech content corresponding to the first sound signal;
  • a function executing module configured to execute a control function corresponding to the voice command when the voice content includes a preset voice command.
  • the target object includes a driver, wherein the function execution module is configured to: in the case that the voice instruction corresponds to multiple control functions with directionality, according to the target object In the human face area in the N video frames, determine the gaze direction of the target object; determine the target control function from the plurality of control functions according to the gaze direction of the target object; execute the target control function Function.
  • the video stream includes a first video stream of the driver area, and/or a second video stream of the passenger area in the cabin;
  • the face detection module is configured to: based on the Each of the plurality of first video frames of the first video stream detects a driver's face;
  • each second video frame of multiple second video frames of the second video stream Based on each second video frame of multiple second video frames of the second video stream, detect a human face in the vehicle cabin, and determine each second video frame according to the detected position of the human face in the vehicle cabin. The driver's face in the video frame.
  • the signal acquisition module is configured to: acquire the first video stream of the driver's area captured by the camera of the driver detection system DMS; and/or acquire the occupants in the cabin captured by the camera of the occupant detection system OMS Region's second video stream.
  • the device further includes: a seat area determining module, configured to determine a first seat area of the target object according to each video frame in the plurality of video frames;
  • the speaking detection module is used for: in the case that the first sound signal includes speech, perform sound region positioning on the first sound signal, and determine the second seat area corresponding to the first sound signal; If the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
  • the video stream includes a first video stream of a driver area
  • the target object includes a driver
  • the seat area determination module is configured to: in response to detecting a human face in the driver area according to each of the video frames, determining the first seating area of the target object as the driver area;
  • the video stream includes a second video stream of an occupant area in the vehicle cabin, the target object includes a driver and/or a passenger, and the seat area determining module is configured to: each video in the plurality of video frames Frame face detection; determine the first seat area of the target object according to the detected face position.
  • the device further includes: a content identification module, configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal; a function determination module, used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command ; An area function executing module, configured to execute the area control function.
  • a content identification module configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal
  • a function determination module used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command
  • An area function executing module configured to execute the area control function.
  • the video stream includes a first video stream of the driver's area
  • the target object includes the driver
  • the speaking detection module is configured to: In the case where the zone location determines that the first sound signal comes from the driver's area, determine whether the driver has lip movement according to the lip movement recognition result; in response to the driver not having lip movement, determine the target object The result of the speaking detection is: the driver is not speaking.
  • the lip movement recognition module is configured to: perform feature extraction on the face regions of the target object in the N video frames respectively, to obtain N face features of the target object ; Fusing the N face features to obtain the face fusion features of the target object; performing lip movement recognition on the face fusion features to obtain a lip movement recognition result of the target object.
  • performing feature extraction on the face regions of the target object in the N video frames respectively to obtain N face features of the target object includes: for the N In the i-th video frame in the i-th video frame, the face key point extraction is carried out to the face area of the target object in the i-th video frame to obtain a plurality of face key points, 1 ⁇ i ⁇ N; A plurality of face key points are normalized to obtain the ith face feature of the target object.
  • the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the methods described in the method embodiments above, and its specific implementation can refer to the description of the method embodiments above. For brevity, here No longer.
  • Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.
  • Computer readable storage media may be volatile or nonvolatile computer readable storage media.
  • An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.
  • An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
  • An embodiment of the present disclosure also provides a computer program, including computer readable codes, and when the computer readable codes are run in an electronic device, a processor in the electronic device executes the above method.
  • Electronic devices may be provided as terminals, servers, or other forms of devices.
  • FIG. 4 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure.
  • the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.
  • electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814 , and the communication component 816.
  • the processing component 802 generally controls the overall operations of the electronic device 800, such as those associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802 .
  • the memory 804 is configured to store various types of data to support operations at the electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like.
  • the memory 804 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • the power supply component 806 provides power to various components of the electronic device 800 .
  • Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 800 .
  • the multimedia component 808 includes a screen providing an output interface between the electronic device 800 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.
  • the audio component 810 is configured to output and/or input audio signals.
  • the audio component 810 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device 800 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 804 or sent via communication component 816 .
  • the audio component 810 also includes a speaker for outputting audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
  • Sensor assembly 814 includes one or more sensors for providing status assessments of various aspects of electronic device 800 .
  • the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and the keypad of the electronic device 800, the sensor component 814 can also detect the electronic device 800 or a Changes in position of components, presence or absence of user contact with electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes in electronic device 800 .
  • Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 814 may also include an optical sensor, such as a complementary metal-oxide-semiconductor (CMOS) or charge-coupled device (CCD) image sensor, for use in imaging applications.
  • CMOS complementary metal-oxide-semiconductor
  • CCD charge-coupled device
  • the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
  • the electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof.
  • the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wide Band
  • Bluetooth Bluetooth
  • electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A programmable gate array
  • controller microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • a non-volatile computer-readable storage medium such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to implement the above method.
  • FIG. 5 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.
  • electronic device 1900 may be provided as a server.
  • electronic device 1900 includes processing component 1922 , which further includes one or more processors, and a memory resource represented by memory 1932 for storing instructions executable by processing component 1922 , such as application programs.
  • the application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions.
  • the processing component 1922 is configured to execute instructions to perform the above method.
  • Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input-output (I/O) interface 1958 .
  • the electronic device 1900 can operate based on the operating system stored in the memory 1932, such as the Microsoft server operating system (Windows Server TM ), the graphical user interface-based operating system (Mac OS X TM ) introduced by Apple Inc., and the multi-user and multi-process computer operating system (Unix TM ), a free and open-source Unix-like operating system (Linux TM ), an open-source Unix-like operating system (FreeBSD TM ), or the like.
  • Microsoft server operating system Windows Server TM
  • Mac OS X TM graphical user interface-based operating system
  • Unix TM multi-user and multi-process computer operating system
  • Linux TM free and open-source Unix-like operating system
  • FreeBSD TM open-source Unix-like operating system
  • a non-transitory computer-readable storage medium such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to implement the above method.
  • the present disclosure can be a system, method and/or computer program product.
  • a computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disc read only memory
  • DVD digital versatile disc
  • memory stick floppy disk
  • mechanically encoded device such as a printer with instructions stored thereon
  • a hole card or a raised structure in a groove and any suitable combination of the above.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA)
  • FPGA field programmable gate array
  • PDA programmable logic array
  • These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the computer program product can be specifically realized by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. Wait.
  • a software development kit Software Development Kit, SDK

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种对象说话检测方法及装置、电子设备和存储介质,所述方法包括:获取车舱内的视频流,以及车载麦克风采集的声音信号;对视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在每一个视频帧中的人脸区域;根据目标对象在N个视频帧中的人脸区域,确定目标对象嘴唇的唇动识别结果,N为大于1的整数;根据唇动识别结果以及第一声音信号,确定目标对象的说话检测结果,其中,第一声音信号包括与N个视频帧对应的时间段的声音信号,说话检测结果包括目标对象处于说话状态或处于未说话状态。

Description

对象说话检测方法及装置、电子设备和存储介质
本公开要求在2021年6月30日提交中国专利局、申请号为202110735963.6、申请名称为“对象说话检测方法及装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及智能车舱技术领域,尤其涉及一种对象说话检测方法及装置、电子设备和存储介质。
背景技术
车舱智能化包括多模交互,个性化服务,安全感知等方面的智能化,是当前汽车行业发展的重要方向。车舱多模交互意在为乘客提供舒适的交互体验,多模交互的手段包括语音识别、手势识别等。其中,语音识别在车载交互领域占有重大的市场份额。
然而,车舱内存在多处声源,如音响、开车产生的噪音、车舱外噪音等,对语音识别造成了非常强的干扰。
发明内容
本公开提出了一种对象说话检测技术方案。
根据本公开的一方面,提供了一种对象说话检测方法,包括:获取车舱内的视频流,以及车载麦克风采集的声音信号;对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;根据所述目标对象在N个所述视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与所述N个视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
在一种可能的实现方式中,所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:在所述唇动识别结果为发生唇动,且所述第一声音信号包括语音的情况下,确定所述目标对象处于说话状态。
在一种可能的实现方式中,所述方法还包括:在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能。
在一种可能的实现方式中,所述目标对象包括驾驶员,其中,所述在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能,包括:在所述语音指令对应具有方向性的多个控制功能的情况下,根据所述目标对象在所述N个视频帧中的人脸区域,确定所述目标对象的注视方向;根据所述目标对象的注视方向,从所述多个控制功能中确定出目标控制功能;执行所述目标控制功能。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,和/或车舱内乘员区域的第二视频流;所述对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,包括:基于所述第一视频流的多个第一视频帧中的每一个第一视频帧检测驾驶员的人脸;和/或基于所述第二视频流的多 个第二视频帧中的每一个第二视频帧检测车舱内的人脸,并根据检测到的车舱内的人脸的位置确定所述每一个第二视频帧中的驾驶员的人脸。
在一种可能的实现方式中,所述获取车舱内的视频流,包括:获取驾驶员检测系统DMS摄像头采集的驾驶员区域的第一视频流;和/或获取乘员检测系统OMS摄像头采集的车舱内乘员区域的第二视频流。
在一种可能的实现方式中,所述方法还包括:根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域;其中,所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:在所述第一声音信号包括语音的情况下,对所述第一声音信号进行音区定位,确定与所述第一声音信号对应的第二座位区域;在所述唇动识别结果为发生唇动,且所述第一座位区域与所述第二座位区域相同的情况下,确定所述目标对象处于说话状态。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:响应于根据所述每一个视频帧在驾驶员区域检测到人脸,确定所述目标对象的第一座位区域为驾驶员区域;和/或,所述视频流包括车舱内乘员区域的第二视频流,所述目标对象包括驾驶员和/或乘客,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:对所述多个视频帧中的每一个视频帧进行人脸检测;根据检测到的人脸位置确定所述目标对象的第一座位区域。
在一种可能的实现方式中,所述方法还包括:在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;在所述语音内容包括预设的语音指令的情况下,根据所述目标对象的第一座位区域,确定与所述语音指令对应的区域控制功能;执行所述区域控制功能。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员;所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:在根据所述第一声音信号进行音区定位确定所述第一声音信号来自驾驶员区域的情况下,根据所述唇动识别结果确定所述驾驶员是否发生唇动;响应于所述驾驶员未发生唇动,确定所述目标对象的说话检测结果为:所述驾驶员处于未说话状态。
在一种可能的实现方式中,根据所述目标对象在N个所述视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,包括:对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征;对所述N个人脸特征进行融合,得到所述目标对象的人脸融合特征;对所述人脸融合特征进行唇动识别,得到所述目标对象的唇动识别结果。
在一种可能的实现方式中,所述对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征,包括:针对N个所述视频帧中的第i个视频帧,对所述目标对象在第i个视频帧中的人脸区域进行人脸关键点提取,得到多个人脸关键点,1≤i≤N;对所述多个人脸关键点进行归一化处理,得到所述目标对象的第i个人脸特征。
根据本公开的一方面,提供了一种对象说话检测装置,包括:信号获取模块,用于获取车舱内的视频流,以及车载麦克风采集的声音信号;人脸检测模块,用于对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;唇动识别模块,用 于根据所述目标对象在N个所述视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;说话检测模块,用于根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与N个所述视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
在一种可能的实现方式中,所述说话检测模块用于:在所述唇动识别结果为发生唇动,且所述第一声音信号包括语音的情况下,确定所述目标对象处于说话状态。
在一种可能的实现方式中,所述装置还包括:内容识别模块,用于在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;功能执行模块,用于在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能。
在一种可能的实现方式中,所述目标对象包括驾驶员,其中,所述功能执行模块用于:在所述语音指令对应具有方向性的多个控制功能的情况下,根据所述目标对象在所述N个视频帧中的人脸区域,确定所述目标对象的注视方向;根据所述目标对象的注视方向,从所述多个控制功能中确定出目标控制功能;执行所述目标控制功能。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,和/或车舱内乘员区域的第二视频流;所述人脸检测模块,用于:基于所述第一视频流的多个第一视频帧中的每一个第一视频帧检测驾驶员的人脸;和/或基于所述第二视频流的多个第二视频帧中的每一个第二视频帧检测车舱内的人脸,并根据检测到的车舱内的人脸的位置确定所述每一个第二视频帧中的驾驶员的人脸。
在一种可能的实现方式中,所述信号获取模块用于:获取驾驶员检测系统DMS摄像头采集的驾驶员区域的第一视频流;和/或获取乘员检测系统OMS摄像头采集的车舱内乘员区域的第二视频流。
在一种可能的实现方式中,所述装置还包括:座位区域确定模块,用于根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域;其中,所述说话检测模块用于:在所述第一声音信号包括语音的情况下,对所述第一声音信号进行音区定位,确定与所述第一声音信号对应的第二座位区域;在所述唇动识别结果为发生唇动,且所述第一座位区域与所述第二座位区域相同的情况下,确定所述目标对象处于说话状态。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员,以及所述座位区域确定模块用于:响应于根据所述每一个视频帧在驾驶员区域检测到人脸,确定所述目标对象的第一座位区域为驾驶员区域;和/或所述视频流包括车舱内乘员区域的第二视频流,所述目标对象包括驾驶员和/或乘客,以及所述座位区域确定模块用于:对所述多个视频帧中的每一个视频帧进行人脸检测;根据检测到的人脸位置确定所述目标对象的第一座位区域。
在一种可能的实现方式中,所述装置还包括:内容识别模块,用于在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;功能确定模块,用于在所述语音内容包括预设的语音指令的情况下,根据所述目标对象的第一座位区域,确定与所述语音指令对应的区域控制功能;区域功能执行模块,用于执行所述区域控制功能。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员;所述说话检测模块,用于:在根据所述第一声音信号进行音区定位确定所述第一声音信号来自驾驶员区域的情况下,根据所述唇动识别结果确定所述驾驶员是否发生唇动;响应于所述驾驶员未发生唇动, 确定所述目标对象的说话检测结果为:所述驾驶员处于未说话状态。
在一种可能的实现方式中,所述唇动识别模块用于:对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征;对所述N个人脸特征进行融合,得到所述目标对象的人脸融合特征;对所述人脸融合特征进行唇动识别,得到所述目标对象的唇动识别结果。
在一种可能的实现方式中,所述对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征,包括:针对N个所述视频帧中的第i个视频帧,对所述目标对象在第i个视频帧中的人脸区域进行人脸关键点提取,得到多个人脸关键点,1≤i≤N;对所述多个人脸关键点进行归一化处理,得到所述目标对象的第i个人脸特征。
根据本公开的一方面,提供了一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法。
根据本公开的一方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。
根据本公开的一方面,提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行上述方法。
在本公开实施例中,能够获取车舱内的视频流和声音信号;对视频流的多个视频帧中每一帧进行人脸检测,确定对象的人脸区域;根据对象在N个视频帧中的人脸区域,确定唇动识别结果;根据唇动识别结果及声音信号共同判断对象是否在说话,从而提高对象说话检测的准确性,降低语音识别的误报率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1示出根据本公开实施例的对象说话检测方法的流程图。
图2示出本公开的实施例的对象说话检测方法的唇动识别的示意图。
图3示出根据本公开实施例的对象说话检测装置的框图。
图4示出根据本公开实施例的一种电子设备的框图。
图5示出根据本公开实施例的一种电子设备的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
另外,为了更好地说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
在车载语音交互中,语音检测功能通常在车机中实时运行,需要将语音检测功能的误报率保持在非常低的水平。相关技术中,通常采用基于纯语音的信号检测手段,抑制语音误报的难度较高,导致误报率较高,用户交互体验较差。
根据本公开实施例的对象说话检测方法,能够通过计算机视觉技术,对车舱内的视频流进行检测,确定出视频帧中的对象,识别出对象的唇动状态,根据唇动状态和声音信号共同判断是否有人在说话,从而提高对象说话检测的准确性,降低语音识别的误报率,提升用户体验。
根据本公开实施例的对象说话检测方法可以由终端设备或服务器等电子设备执行,终端设备可以为车载设备、用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等,所述方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
其中,车载设备可以是车舱内的车机、域控制器或者处理器,还可以是DMS(Driver Monitor System,驾驶员检测系统)或者OMS(Occupant Monitoring System,乘员检测系统)中用于执行图像等数据处理操作的设备主机等。
图1示出根据本公开实施例的对象说话检测方法的流程图,如图1所示,所述对象说话检测方法包括:
在步骤S11中,获取车舱内的视频流,以及车载麦克风采集的声音信号;
在步骤S12中,对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;
在步骤S13中,根据所述目标对象在所述多个视频帧中的N个视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;
在步骤S14中,根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与所述N个视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
举例来说,本公开实施例可以应用于任意类型的车辆,例如乘用车、出租车、共享汽车、公交车、货运车辆、地铁、火车等。
在一种可能的实现方式中,在步骤S11中,可通过车载摄像头采集车舱内的视频流,并通过车载麦克风采集声音信号。其中,车载摄像头可以为设置于车辆中的任意摄像头,数量可以为一个或多个,类型可以为DMS摄像头、OMS摄像头、普通摄像头等。车载麦克风也可以设置在车辆中的任意位置,数量可以为一个或多个。本公开对车载摄像头及车载麦克风的设置位置、数量及类型不作限制。
在一种可能的实现方式中,可从视频流中获取多个视频帧,该多个视频帧可为视频流中连续的一组视频帧;也可为对视频流的视频帧序列进行采样得到的一组视频帧。本公开对此不作限制。
在一种可能的实现方式中,在步骤S12中,可对多个视频帧中的每一个视频帧进行人脸检测,确定每一个视频帧中的人脸框;并对各个视频帧中的人脸框进行跟踪,确定属于同一身份的对象的人脸框,从而确定出车内的对象在每一个视频帧中的人脸框。
其中,人脸检测的方式可例如为人脸关键点识别、人脸轮廓检测等;人脸跟踪的方式可例如为,根据相邻视频帧中人脸框的交并比,确定属于同一身份的对象。本领域技术人员应当理解,可采用相关技术中的任意方式实现人脸检测及跟踪,本公开对此不作限制。
在一种可能的实现方式中,视频帧中可能存在一个或多个对象的人脸,经步骤S12处理后,可得到各个对象的人脸区域。针对待分析的任一个对象(可称为目标对象),可确定目标对象在多个视频帧中的N个视频帧中的人脸区域,N为大于1的整数。也即,从多个视频帧中选取对应一定时长(例如2s)的N个视频帧,以便后续进行唇动检测。在实时检测的情况下,该N个视频帧可为视频流中最新采集的N个视频帧。N可例如取值为10、15、20等,本公开对此不作限制。
其中,人的嘴唇在一定时长(例如2s)内发生嘴唇形状变化,即可认为人的嘴唇发生了唇动,该时长即为N个视频帧所对应的时长。该时长可例如设定为1s、2s或3s,本公开对此不作限制。
在一种可能的实现方式中,在步骤S13中,可根据目标对象在N个视频帧中的人脸区域,确定目标对象嘴唇的唇动识别结果。该唇动识别结果包括目标对象发生唇动或未发生唇动。
在一种可能的实现方式中,可将目标对象在N个视频帧中的人脸区域的区域图像,直接输入到预设的唇动识别网络中处理,输出唇动识别结果。也可先对目标对象在N个视频帧中人脸区域的区域图像进行特征提取,得到人脸特征;将人脸特征输入到预设的唇动识别网络中处理,输出唇动识别结果。本公开对具体的处理方式不作限制。
其中,唇动识别网络可例如为卷积神经网络,本公开对唇动识别网络的具体网络结构不作限制。
在一种可能的实现方式中,在步骤S14中,可根据目标对象的唇动识别结果和与所述N个视频帧对应的时间段的第一声音信号,共同确定目标对象的说话检测结果。
在一种可能的实现方式中,第一声音信号包括与N个视频帧对应的时间段的声音信号,例如,N个视频帧对应的时间段为最近的2s(2s前-现在),第一声音信号也为最近2s的声音信号。对第一声音信号进行语音检测,可确定第一声音信号中是否包括语音,本公开对语音检测的实现方式不作限制。
在一种可能的实现方式中,目标对象的说话检测结果包括目标对象处于说话状态或处于未说话状态。例如,如果唇动识别结果为发生唇动,且第一声音信号包括语音,则认为目标对象处于说话状态;如果第一声音信号包括语音,而唇动识别结果为未发生唇动,则认为目标对象处于未说话状态;如果唇动识别结果为发生唇动,而第一声音信号不包括语音,则认为目标对象处于未说话状态。本公开对具体的判定方式不作限制。
根据本公开的实施例,能够获取车舱内的视频流和声音信号;对视频流的多个视频帧中每一帧进行人脸检测,确定对象的人脸区域;根据对象在N个视频帧中的人脸区域,确定唇动识别结果;根据唇动识别结果及声音信号共同判断对象是否在说话,从而提高对象说话检测的准确性,降低语音识别的误报率。
下面对本公开的实施例的对象说话检测方法进行展开说明。
如前所述,在步骤S11中,可获取车载摄像头采集的、车舱内的视频流,以及车载麦克风采集的声音信号。
在一种可能的实现方式中,步骤S11可包括:
获取驾驶员检测系统DMS摄像头采集的驾驶员区域的第一视频流;和/或
获取乘员检测系统OMS摄像头采集的车舱内乘员区域的第二视频流。
也就是说,车载摄像头可例如包括驾驶员检测系统DMS摄像头,和/或乘员检测系统OMS摄像头。DMS摄像头采集的视频流为驾驶员区域的视频流(称为第一视频流),OMS摄像头采集的视频流为车舱内乘员区域的视频流(称为第二视频流)。这样,步骤S11中获取的视频流可包括第一视频流和/或第二视频流。
在另一种可能的实现方式中,驾驶员区域的第一视频流和车舱内乘员区域的第二视频流还可以由设置在车舱内的非专用于驾驶员检测或乘员检测的摄像头采集获得。或者,在第二视频流包含驾驶员区域的视频信息的情况下,可通过对第一视频流中驾驶员区域的视频信息进行截取来获得上述第一视频流。
通过这种方式,能够获取车舱内不同区域的视频流,以便后续分别进行处理,从而提高对象说话检测的全面性。
在一种可能的实现方式中,步骤S12中对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,可包括:
基于所述第一视频流的多个第一视频帧中的每一个第一视频帧检测驾驶员的人脸;和/或
基于所述第二视频流的多个第二视频帧中的每一个第二视频帧检测车舱内的人脸,并根据检测到的车舱内的人脸的位置确定所述每一个第二视频帧中的驾驶员的人脸。
也就是说,在步骤S11获取的视频流包括第一视频流和/或第二视频流的情况下,可分别对第一视频流和第二视频流进行处理。
在一种可能的实现方式中,针对第一视频流,可从第一视频流中获取多个视频帧(称为第一视频帧),该多个第一视频帧可为第一视频流中连续的一组视频帧;也可为对第一视频流的视频帧序列进行采样得到的一组视频帧。
在一种可能的实现方式中,第一视频帧对应于驾驶员区域,该区域仅包括驾驶员。在该情况下,可对多个第一视频帧中的每一个第一视频帧进行人脸检测及追踪,得到驾驶员在每一个第一视频帧中的人脸区域。
在一种可能的实现方式中,针对第二视频流,可从第二视频流中获取多个视频帧(称为第二视频帧),该多个第二视频帧可为第二视频流中连续的一组视频帧;也可为对第二视频流的视频帧序列进行采样得到的一组视频帧。
在一种可能的实现方式中,第二视频帧对应于车舱内乘员区域,包括驾驶员和/或乘客。在该情况下,可对多个第二视频帧中的每一个第二视频帧进行人脸检测及追踪,得到车舱内的各个乘员在每一个第二视频帧中的人脸区域;并根据各个乘员在每一个第二视频帧中的人脸区域的位置,确定每一个第二视频帧中的驾驶员的人脸,以及各个乘客的人脸。例如,在驾驶员区域处于车舱的左前部、采 集第二视频流的摄像头为由车舱前方的顶部位置(例如前挡风玻璃顶部)朝向车舱内的方向设置的情况下,可将处于第二视频帧中右下位置的人脸,确定为驾驶员的人脸。本公开对各个乘员的具体确定方式不作限制。
通过这种方式,能够确定出第一视频流中驾驶员的人脸,确定出第二视频流中不同乘员(驾驶员和/或乘客)的人脸,从而能够分别进行唇动识别、说话检测及相应的响应,进一步提高检测的说话准确性,并使得后续的响应更有针对性。
在一种可能的实现方式中,可在步骤S13中进行唇动识别。其中,步骤S13可包括:
对所述目标对象在所述N个视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征;
对所述N个人脸特征进行融合,得到所述目标对象的人脸融合特征;
对所述人脸融合特征进行唇动识别,得到所述目标对象的唇动识别结果。
举例来说,可对目标对象在N个视频帧中人脸区域的图像分别进行特征提取,得到目标对象在N个视频帧中的人脸特征。其中,特征提取的方式可例如为人脸关键点提取。
在一种可能的实现方式中,对所述目标对象在所述N个视频帧中的人脸区域分别进行特征提取的步骤可包括:
针对所述N个视频帧中的第i个视频帧,对所述目标对象在第i个视频帧中的人脸区域进行人脸关键点提取,得到多个人脸关键点,1≤i≤N;
对所述多个人脸关键点进行归一化处理,得到所述目标对象的第i个人脸特征。
也就是说,对于N个视频帧中任意一个(称为第i个视频帧(frame-i),1≤i≤N),可对目标对象在第i个视频帧中的人脸区域进行人脸关键点提取,得到多个人脸关键点(landmarks-i),其中,人脸关键点的数量可例如为106个,本公开对此不作限制。
在一种可能的实现方式中,可对多个人脸关键点(landmarks-i)求取均值mean-i与标准差std-i,再通过下面的公式(1)对多个人脸关键点landmarks-i进行归一化,得到归一化后的多个人脸关键点Landmarks-i-normalize:
Landmarks-i-normalize=(landmarks-i–mean-i)/std-i        (1)
在一种可能的实现方式中,可将归一化后的多个人脸关键点Landmarks-i-normalize,直接作为目标对象的第i个人脸特征;也可从Landmarks-i-normalize中选取部分的人脸关键点,例如选取人脸下半部分的人脸关键点或选取人脸关键点中的嘴部关键点,作为目标对象的第i个人脸特征。本公开对此不作限制。
通过关键点提取及归一化,能够使得人脸特征更加规范,提高后续唇动识别的精度。
在一种可能的实现方式中,对N个视频帧分别进行处理,可得到目标对象的N个人脸特征。
在一种可能的实现方式中,在实时检测的情况下,该N个视频帧中的前N-1个视频帧可能已经在之前的唇动识别中进行了处理。在本次的唇动识别中,可以对目标对象在最新的第N个视频帧中的人脸区域进行特征提取,得到第N个人脸特征;并读取之前的处理中已有的N-1个人脸特征,从而得到目标对象的N个人脸特征。这样,能够减小计算量,提高处理效率。
在一种可能的实现方式中,在得到目标对象的N个人脸特征后,可对该N个人脸特征进行融合, 得到目标对象的人脸融合特征。可按照帧顺序进行排序,得到人脸特征landmarks-1-normalize、landmarks-2-normalize、…、landmarks-N-normalize,对该N个人脸特征进行拼接或叠加,即可得到人脸融合特征,记为face-video-feature。
在一种可能的实现方式中,可对人脸融合特征进行唇动识别,得到目标对象的唇动识别结果。可预设有唇动识别神经网络,将人脸融合特征输入唇动识别神经网络中处理,输出目标对象的唇动识别结果。
其中,该唇动识别神经网络可例如为卷积神经网络,包括多个全连接层、softmax层等,用于对人脸融合特征进行二分类。人脸融合特征输入唇动识别神经网络的全连接层,可得到二维的输出,分别对应发生唇动和未发生唇动;经过softmax层处理后,得到归一化的得分(score)或置信度。
在一种可能的实现方式中,可设置有发生唇动的得分或置信度的预设阈值(例如设置为0.8)。如果超过该预设阈值,则确定目标对象处于说话状态;反之,则确定目标对象处于未说话状态。本公开对唇动识别神经网络的网络结构、训练方式及预设阈值的具体取值均不作限制。
图2示出本公开的实施例的对象说话检测方法的唇动识别的示意图。如图2所示,对于待处理的N个视频帧:视频帧1、视频帧2、…、视频帧N,可分别对N个视频帧进行人脸检测,确定目标对象在N个视频帧中的人脸区域;对目标对象在N个视频帧中的人脸区域分别进行人脸特征提取,得到N个人脸特征;对N个人脸特征进行特征融合,得到目标对象的人脸融合特征;将人脸融合特征输入到唇动识别神经网络中进行处理,输出目标对象的唇动识别结果,包括发生唇动或未发生唇动。
通过这种方式,能够基于计算机视觉技术,实现对目标对象嘴唇的唇动识别,辅助判断对象是否在说话,提高说话检测的准确性。
在一种可能的实现方式中,在确定唇动识别结果后,可在步骤S14中进行说话检测。其中,步骤S14可包括:
在所述唇动识别结果为发生唇动,且所述第一声音信号包括语音的情况下,确定所述目标对象处于说话状态。
也就是说,基于唇动识别结果以及对应时间段的第一声音信号,共同判断对象是否在说话。如果唇动识别结果为发生唇动,且第一声音信号包括语音,则认为目标对象处于说话状态。也即,发生唇动与检测到语音这两个条件都满足,才判断目标对象在说话;如果仅满足一个条件,或两个条件都不满足,则判断目标对象未说话。
通过这种方式,能够有效抑制语音误报的情况,降低语音识别的误报率,提升用户体验。
在一种可能的实现方式中,在步骤S14之后,根据本公开实施例的对象说话检测方法还可包括:
在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;
在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能。
举例来说,如果在步骤S14中判断目标对象处于说话状态,则可启动语音识别功能,对第一声音信号进行内容识别,确定与第一声音信号对应的语音内容,本公开对语音内容识别的实现方式不作限制。
在一种可能的实现方式中,可预设有各种语音指令。如果识别出的语音内容包括预设的语音指令, 则可执行与该语音指令对应的控制功能。例如,识别出语音内容包括语音指令“播放音乐”,则可控制车载的音乐播放设备播放音乐;识别出语音内容包括语音指令“打开左侧车窗”,则可控制左侧车窗打开。
通过这种方式,能够实现与车内乘员之间的语音交互,使得用户能够通过语音实现各种控制功能,提高用户使用的便利性,提升用户体验。
在一种可能的实现方式中,可不对目标对象的身份进行区分,也即判断目标对象在说话,就启动语音识别并执行相应的控制功能。也可对目标对象的身份进行区分,例如仅响应驾驶员的语音,在判断驾驶员在说话时进行语音识别,而不响应乘客的语音;或者根据乘客所在的座位区域,在判断乘客在说话时进行语音识别,并执行乘客所在的座位区域的区域控制功能等。
下面分别进行说明。
在一种可能的实现方式中,视频流包括驾驶员区域的第一视频流,目标对象包括驾驶员。其中,步骤S14可包括:
在根据所述第一声音信号进行音区定位确定所述第一声音信号来自驾驶员区域的情况下,根据所述唇动识别结果确定所述驾驶员是否发生唇动;
响应于所述驾驶员未发生唇动,确定所述目标对象的说话检测结果为:所述驾驶员处于未说话状态。
举例来说,第一视频流的第一视频帧为驾驶员区域的图像。在步骤S12中进行人脸检测及追踪,可得到驾驶员在每一个第一视频帧中的人脸区域。进而可在步骤S13中进行唇动识别,得到驾驶员的唇动识别结果,即发生唇动或未发生唇动。
在一种可能的实现方式中,可对第一声音信号进行音区定位,确定第一声音信号所对应的区域,例如驾驶员区域或乘员区域。本公开对音区定位的实现方式不作限制。
在一种可能的实现方式中,如果根据第一声音信号进行音区定位确定第一声音信号来自驾驶员区域,则可根据唇动识别结果确定驾驶员是否发生唇动;如果驾驶员未发生唇动,则可判断驾驶员未说话,即驾驶员处于未说话状态。在该情况下,可不对语音进行响应。
在一种可能的实现方式中,如果根据第一声音信号进行音区定位确定第一声音信号来自驾驶员区域,且根据唇动识别结果确定驾驶员发生唇动,则可判断驾驶员说话,即驾驶员处于说话状态。在该情况下,可启动语音识别功能,对语音进行响应。
通过这种方式,能够进一步降低语音识别的误报率。
在一种可能的实现方式中,视频流包括驾驶员区域的第一视频流,和/或车舱内乘员区域的第二视频流,目标对象包括驾驶员。
如前所述,在步骤S12中可对第一视频流的第一视频帧进行人脸检测及追踪,得到驾驶员在每一个第一视频帧中的人脸区域。在步骤S12中还可对第二视频流的第二视频帧进行人脸检测及追踪,确定人脸区域;并根据人脸区域的位置,确定每一个第二视频帧中的驾驶员的人脸。进而可在步骤S13中进行唇动识别,得到驾驶员的唇动识别结果,即发生唇动或未发生唇动,并在步骤S14中确定驾驶员的说话状态;如果驾驶员在说话,则启动语音识别功能,确定与第一声音信号对应的语音内容。
在一种可能的实现方式中,在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能的步骤,可包括:
在所述语音指令对应具有方向性的多个控制功能的情况下,根据所述目标对象在所述N个视频帧中的人脸区域,确定所述目标对象的注视方向;
根据所述目标对象的注视方向,从所述多个控制功能中确定出目标控制功能;
执行所述目标控制功能。
举例来说,语音指令可能对应于具有方向性的多个控制功能,例如,语音指令“打开车窗”可对应于左侧和右侧两个方向的车窗,多个控制功能包括“打开左侧的车窗”和“打开右侧的车窗”;也可对应于左前、左后、右前、右后四个方向的车窗,多个控制功能包括“打开左前侧的车窗”、“打开右前侧的车窗”、“打开左后侧的车窗”、“打开右后侧的车窗”。在该情况下,可结合图像识别确定相应的控制功能。
在一种可能的实现方式中,在语音指令对应具有方向性的多个控制功能的情况下,可根据目标对象在N个视频帧中的人脸区域,确定目标对象的注视方向。
在一种可能的实现方式中,可对目标对象在N个视频帧中人脸区域的图像分别进行特征提取,得到目标对象在N个视频帧中的人脸特征;对N个人脸特征进行融合,得到目标对象的人脸融合特征;将人脸融合特征输入到预设的注视方向识别网络中处理,得到目标对象的注视方向,也即目标对象的眼睛的视线方向。
其中,该注视方向识别网络可例如为卷积神经网络,包括卷积层、全连接层、softmax层等。本公开对注视方向识别网络的网络结构及训练方式均不作限制。
在一种可能的实现方式中,可根据目标对象的注视方向,从多个控制功能中确定出目标控制功能。例如,语音指令为“打开车窗”,并确定出目标对象的注视方向为朝向右侧,则可确定目标控制功能为“打开右侧的车窗”。进而,可执行目标控制功能,例如打开右侧的车窗。
通过这种方式,能够提高语音交互的准确性,进一步提高用户使用的便利性。
在一种可能的实现方式中,视频流包括驾驶员区域的第一视频流,和/或车舱内乘员区域的第二视频流,目标对象包括驾驶员和/或乘员。
在一种可能的实现方式中,根据本公开实施例的对象说话检测方法还可包括:
根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域。
也就是说,在目标对象包括驾驶员和/或乘员的情况下,可根据各个乘员在多个视频帧中的每一个视频帧中的人脸区域的位置,分别确定各个乘员的座位区域(称为第一座位区域)。其中,第一座位区域包括驾驶员区域或乘客区域,以小轿车场景为例,乘客区域可包括副驾驶区域、左后座位区域、右后座位区域等,本公开对座位区域的划分方式不作限制。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:
响应于根据所述每一个视频帧在驾驶员区域检测到人脸,确定所述目标对象的第一座位区域为驾驶员区域;
和/或
所述视频流包括车舱内乘员区域的第二视频流,所述目标对象包括驾驶员和/或乘客,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:
对所述多个视频帧中的每一个视频帧进行人脸检测;
根据检测到的人脸位置确定所述目标对象的第一座位区域。
举例来说,在视频流包括驾驶员区域的第一视频流的情况下,响应于根据所述每一个视频帧在驾驶员区域检测到人脸,可直接确定目标对象的第一座位区域为驾驶员区域,完成第一座位区域的确定过程。
在一种可能的实现方式中,在视频流包括车舱内乘员区域的第二视频流的情况下,可对多个视频帧中的每一个视频帧进行人脸检测及追踪,得到车舱内的各个乘员在每一个视频帧中的人脸区域;根据检测到的人脸位置,可确定目标对象的第一座位区域。
例如,在驾驶员区域处于车舱的左前部的情况下,可将处于视频帧中右下位置的人脸,确定为驾驶员的人脸,确定目标对象的第一座位区域为驾驶员区域;将处于视频帧中左下位置的人脸,确定为副驾驶的乘客的人脸,确定目标对象的第一座位区域为副驾驶区域。本公开对各个乘员的具体确定方式不作限制。
通过这种方式,能够实现乘员的座位区域的判定。
在确定目标对象的第一座位区域后,可在步骤S14中基于第一座位区域共同判断目标对象的说话状态。
在一种可能的实现方式中,步骤S14可包括:
在所述第一声音信号包括语音的情况下,对所述第一声音信号进行音区定位,确定与所述第一声音信号对应的第二座位区域;
在所述唇动识别结果为发生唇动,且所述第一座位区域与所述第二座位区域相同的情况下,确定所述目标对象处于说话状态。
举例来说,如果第一声音信号包括语音,则可对第一声音信号进行音区定位,确定第一声音信号所对应的座位区域(称为第二座位区域),例如驾驶员区域或乘员区域。本公开对音区定位的实现方式不作限制。
在一种可能的实现方式中,对于任一目标对象,如果该目标对象的唇动识别结果为发生唇动,且第一座位区域与第二座位区域相同(也即通过图像定位出的座位区域与通过音区定位出的座位区域一致),则可确定目标对象处于说话状态;进而可启动语音识别功能,对语音进行响应。
通过这种方式,在图像定位出的座位区域与音区定位出的座位区域一致,且发生唇动的条件下,判断对象在说话,进一步提高对象说话检测的准确性;并且,能够区分出在说话的对象,以便执行针对性的控制功能,进一步提高了用户使用的便利性。
在一种可能的实现方式中,根据本公开实施例的对象说话检测方法还可包括:
在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;
在所述语音内容包括预设的语音指令的情况下,根据所述目标对象的第一座位区域,确定与所述 语音指令对应的区域控制功能;
执行所述区域控制功能。
举例来说,如果在步骤S14中确定目标对象处于说话状态,则可启动语音识别功能,对第一声音信号进行内容识别,确定与第一声音信号对应的语音内容,本公开对语音内容识别的实现方式不作限制。
在一种可能的实现方式中,如果识别出的语音内容包括预设的语音指令,则可根据目标对象的第一座位区域,确定与语音指令对应的区域控制功能。例如,识别出语音内容包括语音指令“打开车窗”,且目标对象的第一座位区域为左后座位区域,则可确定对应的区域控制功能为“打开左后侧车窗”。进而,可执行该区域控制功能,例如控制左后侧车窗打开。
通过这种方式,能够执行相应的区域控制功能,进一步提高用户使用的便利性。
根据本公开实施例的对象说话检测方法,能够获取车舱内的视频流和声音信号;对视频流的多个视频帧中每一帧进行人脸检测,确定对象的人脸区域;根据对象在N个视频帧中的人脸区域,确定唇动识别结果;根据唇动识别结果及声音信号共同判断对象是否在说话,从而提高对象说话检测的准确性,降低语音识别的误报率。
根据本公开实施例的对象说话检测方法,能够应用于智能车舱感知系统中,能够基于智能视频分析算法,分析车舱内乘员的唇动状态,并结合语音信号判断车舱内是否有人在说话,从而有效规避单纯依靠语音信号导致的误报情形,保证语音识别可以被正常触发,提升用户交互体验。
可以理解,本公开提及的上述各个方法实施例,在不违背原理逻辑的情况下,均可以彼此相互结合形成结合后的实施例,限于篇幅,本公开不再赘述。本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
此外,本公开还提供了对象说话检测装置、电子设备、计算机可读存储介质、程序,上述均可用来实现本公开提供的任一种对象说话检测方法,相应技术方案和描述和参见方法部分的相应记载,不再赘述。
图3示出根据本公开实施例的对象说话检测装置的框图,如图3所示,所述装置包括:
信号获取模块31,用于获取车舱内的视频流,以及车载麦克风采集的声音信号;
人脸检测模块32,用于对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;
唇动识别模块33,用于根据所述目标对象在所述多个视频帧中的N个视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;
说话检测模块34,用于根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与所述N个视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
在一种可能的实现方式中,所述说话检测模块用于:在所述唇动识别结果为发生唇动,且所述第一声音信号包括语音的情况下,确定所述目标对象处于说话状态。
在一种可能的实现方式中,所述装置还包括:
内容识别模块,用于在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别, 确定与所述第一声音信号对应的语音内容;
功能执行模块,用于在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能。
在一种可能的实现方式中,所述目标对象包括驾驶员,其中,所述功能执行模块用于:在所述语音指令对应具有方向性的多个控制功能的情况下,根据所述目标对象在所述N个视频帧中的人脸区域,确定所述目标对象的注视方向;根据所述目标对象的注视方向,从所述多个控制功能中确定出目标控制功能;执行所述目标控制功能。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,和/或车舱内乘员区域的第二视频流;所述人脸检测模块,用于:基于所述第一视频流的多个第一视频帧中的每一个第一视频帧检测驾驶员的人脸;和/或
基于所述第二视频流的多个第二视频帧中的每一个第二视频帧检测车舱内的人脸,并根据检测到的车舱内的人脸的位置确定所述每一个第二视频帧中的驾驶员的人脸。
在一种可能的实现方式中,所述信号获取模块用于:获取驾驶员检测系统DMS摄像头采集的驾驶员区域的第一视频流;和/或获取乘员检测系统OMS摄像头采集的车舱内乘员区域的第二视频流。
在一种可能的实现方式中,所述装置还包括:座位区域确定模块,用于根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域;
其中,所述说话检测模块用于:在所述第一声音信号包括语音的情况下,对所述第一声音信号进行音区定位,确定与所述第一声音信号对应的第二座位区域;在所述唇动识别结果为发生唇动,且所述第一座位区域与所述第二座位区域相同的情况下,确定所述目标对象处于说话状态。
在一种可能的实现方式中,
所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员,以及所述座位区域确定模块用于:响应于根据所述每一个视频帧在驾驶员区域检测到人脸,确定所述目标对象的第一座位区域为驾驶员区域;和/或
所述视频流包括车舱内乘员区域的第二视频流,所述目标对象包括驾驶员和/或乘客,以及所述座位区域确定模块用于:对所述多个视频帧中的每一个视频帧进行人脸检测;根据检测到的人脸位置确定所述目标对象的第一座位区域。
在一种可能的实现方式中,所述装置还包括:内容识别模块,用于在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;功能确定模块,用于在所述语音内容包括预设的语音指令的情况下,根据所述目标对象的第一座位区域,确定与所述语音指令对应的区域控制功能;区域功能执行模块,用于执行所述区域控制功能。
在一种可能的实现方式中,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员;所述说话检测模块,用于:在根据所述第一声音信号进行音区定位确定所述第一声音信号来自驾驶员区域的情况下,根据所述唇动识别结果确定所述驾驶员是否发生唇动;响应于所述驾驶员未发生唇动,确定所述目标对象的说话检测结果为:所述驾驶员处于未说话状态。
在一种可能的实现方式中,所述唇动识别模块用于:对所述目标对象在所述N个视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征;对所述N个人脸特征进行融合,得到所 述目标对象的人脸融合特征;对所述人脸融合特征进行唇动识别,得到所述目标对象的唇动识别结果。
在一种可能的实现方式中,所述对所述目标对象在所述N个视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征,包括:针对所述N个视频帧中的第i个视频帧,对所述目标对象在第i个视频帧中的人脸区域进行人脸关键点提取,得到多个人脸关键点,1≤i≤N;对所述多个人脸关键点进行归一化处理,得到所述目标对象的第i个人脸特征。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是易失性或非易失性计算机可读存储介质。
本公开实施例还提出一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法。
本公开实施例还提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行上述方法。
本公开实施例还提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行上述方法。
电子设备可以被提供为终端、服务器或其它形态的设备。
图4示出根据本公开实施例的一种电子设备800的框图。例如,电子设备800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等终端。
参照图4,电子设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在电子设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。
多媒体组件808包括在所述电子设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触 摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当电子设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当电子设备800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为电子设备800提供各个方面的状态评估。例如,传感器组件814可以检测到电子设备800的打开/关闭状态,组件的相对定位,例如所述组件为电子设备800的显示器和小键盘,传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变,用户与电子设备800接触的存在或不存在,电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如互补金属氧化物半导体(CMOS)或电荷耦合装置(CCD)图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络,如无线网络(WiFi),第二代移动通信技术(2G)或第三代移动通信技术(3G),或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器804,上述计算机程序指令可由电子设备800的处理器820执行以完成上述方法。
图5示出根据本公开实施例的一种电子设备1900的框图。例如,电子设备1900可以被提供为一服务器。参照图5,电子设备1900包括处理组件1922,其进一步包括一个或多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
电子设备1900还可以包括一个电源组件1926被配置为执行电子设备1900的电源管理,一个有线或无线网络接口1950被配置为将电子设备1900连接到网络,和一个输入输出(I/O)接口1958。电子设备1900可以操作基于存储在存储器1932的操作系统,例如微软服务器操作系统(Windows Server TM),苹果公司推出的基于图形用户界面操作系统(Mac OS X TM),多用户多进程的计算机操作系统(Unix TM),自由和开放原代码的类Unix操作系统(Linux TM),开放原代码的类Unix操作系统(FreeBSD TM)或类似。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由电子设备1900的处理组件1922执行以完成上述方法。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是(但不限于)电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合, 都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (16)

  1. 一种对象说话检测方法,其特征在于,包括:
    获取车舱内的视频流,以及车载麦克风采集的声音信号;
    对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;
    根据所述目标对象在N个所述视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;
    根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与所述N个视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:
    在所述唇动识别结果为发生唇动,且所述第一声音信号包括语音的情况下,确定所述目标对象处于说话状态。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;
    在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能。
  4. 根据权利要求3所述的方法,其特征在于,所述目标对象包括驾驶员,
    其中,所述在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能,包括:
    在所述语音指令对应具有方向性的多个控制功能的情况下,根据所述目标对象在所述N个视频帧中的人脸区域,确定所述目标对象的注视方向;
    根据所述目标对象的注视方向,从所述多个控制功能中确定出目标控制功能;
    执行所述目标控制功能。
  5. 根据权利要求1-4中任意一项所述的方法,其特征在于,所述视频流包括驾驶员区域的第一视频流,和/或车舱内乘员区域的第二视频流;
    所述对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,包括:
    基于所述第一视频流的多个第一视频帧中的每一个第一视频帧检测驾驶员的人脸;和/或
    基于所述第二视频流的多个第二视频帧中的每一个第二视频帧检测车舱内的人脸,并
    根据检测到的车舱内的人脸的位置确定所述每一个第二视频帧中的驾驶员的人脸。
  6. 根据权利要求1-5中任意一项所述的方法,其特征在于,所述获取车舱内的视频流,包括:
    获取驾驶员检测系统DMS摄像头采集的驾驶员区域的第一视频流;和/或
    获取乘员检测系统OMS摄像头采集的车舱内乘员区域的第二视频流。
  7. 根据权利要求1-6中任意一项所述的方法,其特征在于,所述方法还包括:
    根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域;
    其中,所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:
    在所述第一声音信号包括语音的情况下,对所述第一声音信号进行音区定位,确定与所述第一声音信号对应的第二座位区域;
    在所述唇动识别结果为发生唇动,且所述第一座位区域与所述第二座位区域相同的情况下,确定所述目标对象处于说话状态。
  8. 根据权利要求7所述的方法,其特征在于,
    所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:
    响应于根据所述每一个视频帧在驾驶员区域检测到人脸,确定所述目标对象的第一座位区域为驾驶员区域;
    和/或
    所述视频流包括车舱内乘员区域的第二视频流,所述目标对象包括驾驶员和/或乘客,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:
    对所述多个视频帧中的每一个视频帧进行人脸检测;
    根据检测到的人脸位置确定所述目标对象的第一座位区域。
  9. 根据权利要求7或8所述的方法,其特征在于,所述方法还包括:
    在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;
    在所述语音内容包括预设的语音指令的情况下,根据所述目标对象的第一座位区域,确定与所述语音指令对应的区域控制功能;
    执行所述区域控制功能。
  10. 根据权利要求1-6中任意一项所述的方法,其特征在于,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员;
    所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:在根据所述第一声音信号进行音区定位确定所述第一声音信号来自驾驶员区域的情况下,根据所述唇动识别结果确定所述驾驶员是否发生唇动;
    响应于所述驾驶员未发生唇动,确定所述目标对象的说话检测结果为:所述驾驶员处于未说话状态。
  11. 根据权利要求1-10中任意一项所述的方法,其特征在于,根据所述目标对象在N个所述视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,包括:
    对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征;
    对所述N个人脸特征进行融合,得到所述目标对象的人脸融合特征;
    对所述人脸融合特征进行唇动识别,得到所述目标对象的唇动识别结果。
  12. 根据权利要求11所述的方法,其特征在于,所述对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征,包括:
    针对N个所述视频帧中的第i个视频帧,对所述目标对象在第i个视频帧中的人脸区域进行人脸关键点提取,得到多个人脸关键点,1≤i≤N;
    对所述多个人脸关键点进行归一化处理,得到所述目标对象的第i个人脸特征。
  13. 一种对象说话检测装置,其特征在于,包括:
    信号获取模块,用于获取车舱内的视频流,以及车载麦克风采集的声音信号;
    人脸检测模块,用于对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;
    唇动识别模块,用于根据所述目标对象在所述多个视频帧中的N个视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;
    说话检测模块,用于根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与所述N个视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
  14. 一种电子设备,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至12中任意一项所述的方法。
  15. 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至12中任意一项所述的方法。
  16. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至12中的任一权利要求所述的方法。
PCT/CN2021/127097 2021-06-30 2021-10-28 对象说话检测方法及装置、电子设备和存储介质 WO2023273064A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110735963.6A CN113486760A (zh) 2021-06-30 2021-06-30 对象说话检测方法及装置、电子设备和存储介质
CN202110735963.6 2021-06-30

Publications (1)

Publication Number Publication Date
WO2023273064A1 true WO2023273064A1 (zh) 2023-01-05

Family

ID=77937073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127097 WO2023273064A1 (zh) 2021-06-30 2021-10-28 对象说话检测方法及装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN113486760A (zh)
WO (1) WO2023273064A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486760A (zh) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 对象说话检测方法及装置、电子设备和存储介质
CN114299944B (zh) * 2021-12-08 2023-03-24 天翼爱音乐文化科技有限公司 视频处理方法、系统、装置及存储介质
CN114615534A (zh) * 2022-01-27 2022-06-10 海信视像科技股份有限公司 显示设备及音频处理方法
CN115410566A (zh) * 2022-03-10 2022-11-29 北京罗克维尔斯科技有限公司 一种车辆控制方法、装置、设备及存储介质
CN114734942A (zh) * 2022-04-01 2022-07-12 深圳地平线机器人科技有限公司 调节车载音响音效的方法及装置
CN115063867A (zh) * 2022-06-30 2022-09-16 上海商汤临港智能科技有限公司 说话状态识别方法及模型训练方法、装置、车辆、介质
CN115880744B (zh) * 2022-08-01 2023-10-20 北京中关村科金技术有限公司 一种基于唇动的视频角色识别方法、装置及存储介质
CN116259100B (zh) * 2022-09-28 2024-09-03 北京中关村科金技术有限公司 基于唇动跟踪的识别方法、装置、存储介质及电子设备
CN118571219B (zh) * 2024-08-02 2024-10-15 成都赛力斯科技有限公司 座舱内人员对话增强方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
CN105700676A (zh) * 2014-12-11 2016-06-22 现代自动车株式会社 可佩戴眼镜及其控制方法、以及车辆控制系统
CN109410957A (zh) * 2018-11-30 2019-03-01 福建实达电脑设备有限公司 基于计算机视觉辅助的正面人机交互语音识别方法及系统
CN110544491A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种实时关联说话人及其语音识别结果的方法及装置
CN110750152A (zh) * 2019-09-11 2020-02-04 云知声智能科技股份有限公司 一种基于唇部动作的人机交互方法和系统
CN113486760A (zh) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 对象说话检测方法及装置、电子设备和存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108320739B (zh) * 2017-12-22 2022-03-01 景晖 根据位置信息辅助语音指令识别方法和装置
CN108831462A (zh) * 2018-06-26 2018-11-16 北京奇虎科技有限公司 车载语音识别方法及装置
CN110857067B (zh) * 2018-08-24 2023-04-07 上海汽车集团股份有限公司 一种人车交互装置和人车交互方法
CN109814448A (zh) * 2019-01-16 2019-05-28 北京七鑫易维信息技术有限公司 一种车载多模态控制方法及系统
CN110545396A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种基于定位去噪的语音识别方法及装置
CN110503957A (zh) * 2019-08-30 2019-11-26 上海依图信息技术有限公司 一种基于图像去噪的语音识别方法及装置
CN111240477A (zh) * 2020-01-07 2020-06-05 北京汽车研究总院有限公司 一种车载人机交互方法、系统和具有该系统的车辆
CN111341350A (zh) * 2020-01-18 2020-06-26 南京奥拓电子科技有限公司 人机交互控制方法、系统、智能机器人及存储介质
CN112655000B (zh) * 2020-04-30 2022-10-25 华为技术有限公司 车内用户定位方法、车载交互方法、车载装置及车辆
CN112102546A (zh) * 2020-08-07 2020-12-18 浙江大华技术股份有限公司 一种人机交互控制方法、对讲呼叫方法及相关装置
CN112026790B (zh) * 2020-09-03 2022-04-15 上海商汤临港智能科技有限公司 车载机器人的控制方法及装置、车辆、电子设备和介质
CN112397065A (zh) * 2020-11-04 2021-02-23 深圳地平线机器人科技有限公司 语音交互方法、装置、计算机可读存储介质及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
CN105700676A (zh) * 2014-12-11 2016-06-22 现代自动车株式会社 可佩戴眼镜及其控制方法、以及车辆控制系统
CN109410957A (zh) * 2018-11-30 2019-03-01 福建实达电脑设备有限公司 基于计算机视觉辅助的正面人机交互语音识别方法及系统
CN110544491A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种实时关联说话人及其语音识别结果的方法及装置
CN110750152A (zh) * 2019-09-11 2020-02-04 云知声智能科技股份有限公司 一种基于唇部动作的人机交互方法和系统
CN113486760A (zh) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 对象说话检测方法及装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN113486760A (zh) 2021-10-08

Similar Documents

Publication Publication Date Title
WO2023273064A1 (zh) 对象说话检测方法及装置、电子设备和存储介质
WO2022183661A1 (zh) 事件检测方法、装置、电子设备、存储介质及程序产品
WO2023273063A1 (zh) 乘员说话检测方法及装置、电子设备和存储介质
WO2022048119A1 (zh) 车辆控制方法及装置、电子设备、存储介质和车辆
JP7526897B2 (ja) 車室内乗員検出方法及び装置、電子機器並びに記憶媒体
WO2023071174A1 (zh) 车内人员检测方法及装置、电子设备和存储介质
US10108334B2 (en) Gesture device, operation method for same, and vehicle comprising same
CN112124073B (zh) 基于酒精检测的智能驾驶控制方法及装置
CN112026790B (zh) 车载机器人的控制方法及装置、车辆、电子设备和介质
WO2023273060A1 (zh) 危险动作的识别方法及装置、电子设备和存储介质
CN110532957B (zh) 人脸识别方法及装置、电子设备和存储介质
JP2021536069A (ja) 信号表示灯の状態検出方法及び装置、運転制御方法及び装置
CN114678021B (zh) 音频信号的处理方法、装置、存储介质及车辆
WO2022183663A1 (zh) 事件检测方法、装置、电子设备、存储介质及程序产品
CN112667084B (zh) 车载显示屏的控制方法及装置、电子设备和存储介质
WO2023231211A1 (zh) 语音识别方法、装置、电子设备、存储介质及产品
US20220206567A1 (en) Method and apparatus for controlling vehicle display screen, and storage medium
CN114407630A (zh) 车门控制方法及装置、电子设备和存储介质
CN114332941A (zh) 基于乘车对象检测的报警提示方法、装置及电子设备
WO2023071175A1 (zh) 关联车内的人和物的方法及装置、电子设备和存储介质
CN114495072A (zh) 乘员状态检测方法及装置、电子设备和存储介质
CN113505674A (zh) 人脸图像处理方法及装置、电子设备和存储介质
CN113449693A (zh) 乘梯行为检测方法及装置、电子设备和存储介质
CN113361361B (zh) 与乘员交互的方法及装置、车辆、电子设备和存储介质
CN113989889B (zh) 遮光板调节方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21947991

Country of ref document: EP

Kind code of ref document: A1