WO2024032159A1 - 多人机交互场景下的说话对象检测 - Google Patents
多人机交互场景下的说话对象检测 Download PDFInfo
- Publication number
- WO2024032159A1 WO2024032159A1 PCT/CN2023/101635 CN2023101635W WO2024032159A1 WO 2024032159 A1 WO2024032159 A1 WO 2024032159A1 CN 2023101635 W CN2023101635 W CN 2023101635W WO 2024032159 A1 WO2024032159 A1 WO 2024032159A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- features
- audio
- speaker
- frame data
- person
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 45
- 230000003993 interaction Effects 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 58
- 230000001815 facial effect Effects 0.000 claims abstract description 46
- 238000000605 extraction Methods 0.000 claims abstract description 30
- 230000008569 process Effects 0.000 claims abstract description 6
- 238000013135 deep learning Methods 0.000 claims description 40
- 230000004927 fusion Effects 0.000 claims description 20
- 238000010801 machine learning Methods 0.000 claims description 18
- 239000000284 extract Substances 0.000 claims description 13
- 230000001755 vocal effect Effects 0.000 claims description 11
- 230000001360 synchronised effect Effects 0.000 claims description 7
- 230000002123 temporal effect Effects 0.000 claims description 3
- 239000013598 vector Substances 0.000 description 13
- 238000010586 diagram Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/76—Television signal recording
- H04N5/91—Television signal processing therefor
- H04N5/92—Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback
- H04N5/9201—Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback involving the multiplexing of an additional signal and the video signal
- H04N5/9202—Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback involving the multiplexing of an additional signal and the video signal the additional signal being a sound signal
Definitions
- the invention belongs to the field of computer technology, and particularly relates to speaking object detection in a multi-person computer interaction scenario.
- the robot In the process of language interaction, there must be one party who is the speaker, and the other party is the object of speech, that is, the object to whom the speaker expects a response. In particular, during human-computer interaction, the robot will reply after receiving the voice message.
- embodiments of the present invention provide a device and method for speaking object detection in a multi-person computer interaction scenario.
- a device for speaking object detection in a multi-person computer interaction scenario involves a crowd including multiple people and at least one robot.
- the device includes: an audio and video collection module for real-time collection of time-stamped video frame data and time-stamped audio frame data, multiple video frames and the audio frame data included in the video frame data.
- the multiple audio frames included in are synchronized according to the timestamp; the text generation module generates text information with timestamps based on the audio frame data; the face processing module detects the video frame data through machine vision methods, including face in each video frame, and track the same person in multiple video frames to obtain face sequence data; a text feature extraction module is used to extract the time-stamped text from the time-stamped text through machine learning or deep learning methods The information extracts text semantic features; the audio feature extraction module is used to extract human voice audio features from the audio frame data through machine learning or deep learning methods; the face feature extraction module is used to extract from the audio frame data through machine learning or deep learning methods; The face sequence data extracts facial features of a person, which include temporal features and spatial features of the person's face; the speaker detection module uses machine learning or deep learning methods, based on the facial features in the face sequence data.
- the person's facial features and the vocal audio features identify the speaker at the current moment in the crowd to obtain the speaker information at the current moment; the speaking object recognition module uses Use machine learning or deep learning methods to identify the speech at the current moment in the crowd based on scene features, the text semantic features, the human voice audio features, and the facial features of the person in the face sequence data.
- the speaker's speaking partner is used to detect whether the speaker's speaking partner at the current moment is a robot.
- the scene features include speaker information and speaking object information at the previous moment.
- the scene features can be stored in the scene database for calling by the speaking object recognition module.
- the audio and video collection module includes: a video collection module, used to use a camera to collect time-stamped video frame data in real time; and an audio collection module, used to use a microphone to collect time-stamped audio frame data.
- the video frame data is stored in a video frame database in chronological order; the audio frame data is stored in an audio frame database in chronological order.
- the face processing module includes: a face detection module, which uses a deep learning method to detect faces in video frames included in the video frame data, and detects faces detected in two or more video frames. The same face is given a unique fixed identifier to represent this person; the face tracking module is used to track the same person in multiple video frames based on the detection results output by the face detection module to obtain timestamps face sequence data.
- the face sequence data with timestamps is stored in the face database.
- the speaker detection module includes: a first multi-modal fusion module, used to fuse the facial features of the person and the audio audio features of the human voice into a first multi-modal fusion module based on the face sequence data according to timestamps. modal features; a speaking status detection module, used to input the first multi-modal features into a deep learning network to predict the speaking status of each person in the crowd at the current moment one by one, thereby determining the speaker at the current moment. and corresponding speaker information.
- the speaker information at the current moment is stored in the speaker database.
- the speaker database may store the speaker information by timestamp.
- the speaking object recognition module includes: a second multi-modal fusion module, used to combine the above-mentioned facial features of the person, the vocal audio features, the text semantic features, The scene features are integrated into second multi-modal features; the speaking object detection module is used to input the second multi-modal features into the deep learning network to predict one by one whether each person in the crowd and each robot is the speaking object of the speaker at the current moment and the speaking object information of the current moment is determined accordingly.
- the speaking object information at the current moment is stored in a speaking object database for calling by other modules, or is output as a result.
- the speaking object database may store the speaking object information by timestamp.
- the text generation module includes a speech recognition module, configured to generate time-stamped text information corresponding to multiple levels based on the audio frame data.
- the multiple levels include word level, sentence level, Conversation topic level, etc.
- a text database is used to store the text information in chronological order and in the hierarchy.
- a method for speaking object detection in a multi-person computer interaction scenario involving a crowd including multiple people and at least one robot.
- the method includes the following steps: Step S1, the audio and video collection module collects video frame data with time stamps in real time, for example, using a camera, and collects audio frame data with time stamps, for example, using a microphone, in which the video frame data The multiple video frames included and the multiple audio frames included in the audio frame data are synchronized according to the time stamp; step S2, the text generation module generates word-level, sentence-level text by performing speech recognition on the audio frame data in real time.
- Step S3 the face processing module detects the video through machine vision methods
- the frame data includes faces in each video frame, and the same person is tracked in multiple video frames to obtain face sequence data, and the facial feature extraction module extracts the facial features of the person from the face sequence data.
- the audio feature extraction module extracts human voice audio features from the audio frame data; step S4, the speaker detection module uses machine learning or deep learning methods to identify people in the crowd based on the facial features of the person and the human voice audio features.
- the speaker at the current moment is used to obtain the speaker information at the current moment; step S5, the speaking object recognition module uses machine learning or deep learning methods to determine the speaker based on the scene features, the text semantic features, the human voice audio features and The facial features of the person are used to identify the person to whom the speaker at the current moment is speaking in the crowd, so as to detect whether the person to whom the speaker at the current moment is speaking is a robot.
- the scene features include speaker information and speaking object information at the previous moment.
- the video frame data can be published in the form of a Robot Operating System (ROS) topic, and the video frame data can be obtained in real time by subscribing to the image topic; the audio frame data can also be Publish via ROS topic, and obtain audio frame data in real time by subscribing to the audio topic.
- ROS Robot Operating System
- YOLO You Only Look Once
- Deep SORT Deep Simple Online Realtime Tracking
- the tracking result is to assign an ID to each character, and throughout the process, the ID of each character is unique and fixed.
- step S4 may include the following specific steps: performing fusion coding on the facial features of the person and the audio audio features of the human voice based on the face sequence data according to timestamps to obtain the first multi-modal feature; using depth A learning method to predict the speaker in the crowd at the current moment based on the first multi-modal feature.
- the step S5 may include the following specific steps: performing fusion coding on the scene features, the text semantic features, the vocal audio features and the character facial features based on the face sequence data according to timestamps , that is, perform multi-modal feature fusion to obtain the second multi-modal feature; use a deep learning method to predict one by one based on the second multi-modal feature that each person in the crowd is the speaker of the current moment. object probability.
- the Transformer method is used to perform the encoding and decoding.
- the speaking object can be predicted in a multi-person computer interaction scenario where the number of people changes at any time.
- a multi-modal fusion module to associate feature information of different dimensions, information useful for judging the speaking object can be extracted.
- the prediction efficiency during use can be effectively improved.
- Figure 1 is a schematic diagram of a scene in which multiple people interact with a robot according to an embodiment of the present invention
- Figure 2 is a schematic module diagram of a speaking object detection device in a multi-person computer interaction scenario according to an embodiment of the present invention
- Figure 3 is a flow chart of a speaking object detection method in a multi-person computer interaction scenario according to an embodiment of the present invention
- Figure 4 is a schematic diagram of an optional model architecture of the speaking object recognition module according to an embodiment of the present invention.
- Figure 1 shows a schematic diagram of an example of an interaction scenario between multiple people and a robot.
- squares represent objects in the scene; isosceles triangles represent people in the scene, and the top corners can be used to identify the orientation of the characters; and circles marked with R represent robots.
- the human-computer interaction in this scenario involves four people and a robot. Persons in the field should understand that Figure 1 is only an example of a multi-person computer interaction scenario, and the number of people and robots actually participating in human-computer interaction is not limited to this and can change at any time.
- Figure 2 shows a functional module diagram of a device for speaking object detection in a multi-person computer interaction scenario according to an embodiment of the present invention.
- the device includes an audio and video collection module 110, a text generation module 120, a face processing module 130, a text feature extraction module 140, an audio feature extraction module 150, a face feature extraction module 160, and a speaker detection module 170.
- speaking object recognition module 180 the device includes an audio and video collection module 110, a text generation module 120, a face processing module 130, a text feature extraction module 140, an audio feature extraction module 150, a face feature extraction module 160, and a speaker detection module 170.
- the audio and video collection module 110 can collect time-stamped video frame data (where the video frame data includes video frames such as color images) in real time, for example, using a camera, and collect time-stamped audio frame data, for example, using a microphone.
- video frame data and audio frame data can be stored in the video frame database 101 or the audio frame database 102 respectively in time sequence.
- a plurality of video frames included in the video frame data and a plurality of audio frames included in the audio frame data are synchronized according to the time stamp. In other words, video and audio captured at the same moment should be synchronized based on timestamps.
- the text generation module 120 can generate corresponding word-level, sentence-level, text based on audio frame data, for example, through speech recognition. Time-stamped text information at different levels such as conversation topic level. In some embodiments, as shown in Figure 2, the above text information can be stored in the text database 104.
- the face processing module 130 can detect human faces in video frames such as color images through machine vision methods, and track the same person in multiple video frames to obtain face sequence data.
- face sequence data can be stored in the face database 103.
- the plurality of video frames may be a plurality of consecutive video frames, for example, they may be a plurality of video frames continuously captured by a camera within a specific length of time.
- the multiple video frames may also be multiple discontinuous video frames. In this way, even if someone exits the scene and comes back again, person tracking can still be effectively implemented.
- the text feature extraction module 140 can extract time-stamped text semantic features by inputting time-stamped text information corresponding to different levels into a natural language deep learning network.
- the text can be viewed as a word sequence and encoded using a word encoder such as GloVe to obtain a text semantic feature vector of a specific length (for example, 128 dimensions).
- the audio feature extraction module 150 can extract time-stamped human voice audio features by inputting the time-stamped audio frame data into the deep learning network.
- the audio frame data can be first divided into overlapping audio segments, and then feature extraction is performed on the audio segments to obtain Mel-Frequency Cepstral Coefficients (MFCC) as input for further audio feature extraction.
- MFCC Mel-Frequency Cepstral Coefficients
- the MFCC can be input into a deep learning network, and a vocal audio feature vector of a specific length (eg, 128 dimensions) is generated based on the input MFCC.
- the facial feature extraction module 160 can extract time-stamped facial features by inputting face sequence data into the deep learning network.
- the character's facial features may include temporal and spatial features of the character's face. For example, by viewing the face sequence data of each character as a sequence of graphic blocks, converting the sequence of image blocks into a visual feature code through a deep learning network, and then adding the visual feature code and the position code, you can Get the corresponding facial features of the character.
- the facial features of a person can be characterized as a feature vector of a specific length (for example, 128 dimensions).
- the speaker detection module 170 can identify the speaker at the current moment in the crowd based on the facial features of the person in the face sequence data and the vocal audio features through machine learning or deep learning methods to obtain the current moment. speaker information.
- the speaker information at the current moment can be stored in the speaker database 105 .
- speaker database 105 may store speaker information by timestamp.
- the speaking object recognition module 180 can identify people in the crowd based on scene features, the text semantic features, the human voice audio features, and the facial features of the people in the face sequence data through machine learning or deep learning methods.
- the speaker's speaking partner at the current moment is used to detect whether the speaker's speaking partner at the current moment is a robot.
- the speaking object information may be stored in the speaking object database 106 .
- the audio and video collection module 110 may include a video collection module 111, an audio collection module Block 112.
- the video capture module 111 can capture time-stamped video frames, such as color images, in real time, for example, using a camera.
- the audio collection module 112 can collect time-stamped audio frame data, for example, using a microphone.
- the video frame database 101 can be used to store video frame data with timestamps in chronological order for call by other modules, such as the face processing module 130; the audio frame database 102 can also be used to store audio frames with timestamps in chronological order. Data to be called by other modules such as the text generation module 120, the audio feature extraction module 150, etc.
- the face processing module 130 may include a face detection module 131 and a face tracking module 132 .
- the face detection module 131 can use a deep learning method to detect faces in the video frames included in the video frame data, and assign a unique fixed face to the same face detected in two or more video frames. to represent the character; the face tracking module 132 can track the same character in multiple video frames based on the detection results output by the face detection module 131 to obtain time-stamped face sequence data. By giving the same face a unique and fixed identifier, even if the character disappears from the scene and reappears, the original ID can still be used to represent the character.
- the face database 103 can be used to store face sequence data with timestamps for calls by other modules, such as the face feature extraction module 160 .
- the speaker detection module 170 may include a first multi-modal fusion module 171 and a speaking state detection module 172 .
- the first multi-modal fusion module 171 can fuse the above-mentioned facial features and human voice audio features into the first multi-modal features based on the face sequence data according to time stamps;
- the speaking state detection module 172 can fuse the above-mentioned first multi-modal features
- the dynamic features are input into the deep learning network, and the speaking status of each person in the crowd at the current moment is predicted one by one, thereby determining the speaker at the current moment and the corresponding speaker information.
- the speaker database 105 can be used to store the speaker information at the current moment for calls by other modules, such as the speaking object identification module 180 .
- a splicing method can be used to fuse facial features and vocal audio features into the first multi-modal feature.
- the first multimodal feature obtained through feature splicing will be a 256-dimensional vector.
- the speaking object recognition module 180 may include a second multi-modal fusion module 181 and a speaking object detection module 182 .
- the second multi-modal fusion module 181 can fuse the above-mentioned character facial features, human voice audio features, text semantic features, and scene features from the scene database 107 into the second multi-modal feature based on the face sequence data according to time stamps.
- the speaking object detection module 182 can predict one by one whether each person in the crowd and each robot is the speaking object of the speaker at the current moment by inputting the above-mentioned second multi-modal features into the deep learning network and respond accordingly. Determine the speaking partner information at the current moment.
- the speaking object database 106 can be used to store the speaking object information at the current moment for calling by other modules, such as the scene data 107 .
- the speaker information at the current moment can also be directly output as the result.
- the scene database 107 can store speaker information and speaking object information at the previous moment for use by the speaking object recognition module 180 .
- the text generation module 120 may include a speech recognition module 121 .
- the speech recognition module 121 can generate time-stamped text information corresponding to different levels such as word level, sentence level, dialogue topic level, etc. by performing speech recognition based on audio frame data.
- the text database 104 can be used to store the above time-stamped text information in chronological order and hierarchy for calling by other modules, such as the text feature extraction module 140 .
- Figure 3 shows a schematic flowchart of a method for speaking object detection in a multi-computer interaction scenario according to an embodiment of the present invention. As shown in Figure 3, the method may include the following steps S1 to S5.
- step S1 the audio and video collection module 110 collects video frame data with time stamps in real time, for example, using a camera, and collects audio information with time stamps, for example, using a microphone.
- the multiple video frames included in the video frame data and the multiple audio frames included in the audio frame data may be stored in a video frame database or an audio frame database in time sequence. In this way, the video and audio collected at the same time can be synchronized based on the timestamp.
- the video frame at the current moment may refer to a color image obtained in real time during actual operation.
- the color images collected by the monocular camera are published in the form of ROS topics, so that color images can be obtained in real time by subscribing to the image topic.
- the audio information collected by the array microphone can also be published as a ROS topic, so that the audio information can be obtained in real time by subscribing to the audio topic.
- step S2 the text generation module 120 performs speech recognition on the audio frame data in real time to generate text information with time stamps at different levels such as word level, sentence level, conversation topic level, etc., and the text feature extraction module 140 generates text information from Extract text semantic features from text information with timestamps.
- the above text information can be stored in the text database 104.
- step S3 the face processing module 130 detects faces in the video frame data through machine vision and tracks the same person in multiple video frames to obtain face sequence data; and the face feature extraction module 160 extracts human facial features from the face sequence data, and uses the audio feature extraction module 150 to extract vocal audio features from the audio frame data.
- YOLO can be used for face detection
- Deep SORT's model can be used for multi-target tracking.
- the result of tracking is that each person is assigned an ID, and throughout the entire process, each person's ID is unique and fixed.
- step S4 the speaker detection module 170 identifies the speaker at the current moment in the crowd based on the facial features of the person and the audio features of the human voice through a machine learning or deep learning method to obtain the speech at the current moment. or information.
- step S4 may further include: performing fusion coding on the facial features of the person and the audio features of the human voice based on the face sequence data according to timestamps, that is, performing multi-modal feature fusion to obtain the first multi-modal feature.
- steps S4 may further include: performing fusion coding on the facial features of the person and the audio features of the human voice based on the face sequence data according to timestamps, that is, performing multi-modal feature fusion to obtain the first multi-modal feature.
- features using a deep learning method, predict the speaker at the current moment in the crowd based on the first multi-modal feature.
- Step S5 The speaking object recognition module 180 uses machine learning or deep learning methods to identify the current moment in the crowd based on scene features, the text semantic features, the human voice audio features, and the person's facial features.
- the speaker's speaking partner to detect whether the speaker's speaking partner at the current moment is a robot.
- step S5 may further include: performing fusion coding on the scene features, the text semantic features, the vocal audio features and the character facial features based on the face sequence data according to timestamps, that is, performing multiple encodings.
- Modal features are fused to obtain second multi-modal features; using deep learning methods, based on the second multi-modal features, predict one by one the probability that each person in the crowd is the speaker of the speaker at the current moment. .
- a deep learning method for prediction based on the first/second multi-modal features may be performed using a Transformer model that is well known to those skilled in the art.
- the Transformer model includes input, encoder, decoder and output.
- the input of the Transformer model is the encoded sequence.
- the frame images are generally divided into blocks and then arranged into an image sequence, and the acquisition time of each frame image is used as an element of the image sequence.
- text information a piece of text will first be lemmatized into a word sequence, and then word encoding is performed on each lemma in the word sequence to generate a text encoding sequence.
- audio frame data it also needs to be encoded into an audio sequence before it can be used as input to the Transformer model.
- the encoder in the Transformer model mainly consists of 6 layers of encoding modules.
- Each coding module mainly includes a multi-head self-attention mechanism layer and a fully connected feed-forward layer, and both have added residual connections and Normalization.
- the sequence encoding of the layer above the multi-head self-attention mechanism layer is used as input, and the q, k, and v values in the query key-value triplet (query, key, value) are generated through the fully connected layer.
- the q, k, and v values may all be feature vectors with a length of 64. Between sequences, each q is used to perform attention on each k.
- the calculation formula is as follows:
- d k represents the length of the feature vector, which is equal to 64.
- the decoder in the Transformer model mainly consists of 6 layers of decoding modules.
- Each decoding module includes 2 multi-head self-attention mechanism layers and a fully connected forward propagation layer.
- the input to the decoder includes the output of the encoder And the last output of the decoder.
- the output of the decoder is the output of the Transformer model.
- the application of the Transformer model in the embodiments of the present application will be roughly introduced by taking the prediction of a speaking object based on the second multi-modal feature as an example.
- the input data includes the speaker's face image sequence, the face image sequence of other people, the audio frame data of the corresponding time period, and the text information of the corresponding time period.
- the corresponding facial feature vectors, human voice audio feature vectors, and text speech feature vectors are obtained; then, in the multimodal fusion module, all features are The vectors are spliced to achieve multi-modal fusion, thereby obtaining the second multi-modal features corresponding to the speaker and other characters; then, the second multi-modal features obtained by fusion are encoded through the Transformer encoder to obtain the speaker and the second multi-modal encoding feature vector of each other character; finally, by passing the second multi-modal encoding feature vector into the Transformer decoder, the probability that each other character is the speaker's speaking partner is predicted.
- the prediction made by the Transformer decoder can be sequential prediction. For example, you can first predict the probability of the robot being the talking target, and then predict the probability of each other character being the talking target.
- the result of speaking object prediction for the previous character can be re-entered into the Transformer decoder and used as input when the Transformer decoder predicts the speaking object for the next character.
- the Transformer decoder predicts the characters in the crowd except the speaker one by one.
- the first output result of the Transformer decoder is the probability that the robot is the speaking object, and the subsequent output results are the probability that each other character is the speaking object.
- the corresponding robot or character is considered to be the speaking object.
- the probability represented by the first output result is greater than the preset threshold, it indicates that the robot is the speaker of the current moment.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
公开了用于多人机交互场景下说话对象检测的装置和方法。根据所述方法的一个示例,在实时地采集带时间戳的视频帧数据和带时间戳的音频帧数据之后,可通过语音识别、文本特征提取、音频特征提取、和人脸特征提取获得对应的文本语义特征、人声音频特征、和人物面部特征等信息。然后,可基于通过融合所述人物面部特征和所述人声音频特征得到的第一多模态特征,识别人群中当前时刻的说话者;还可基于通过融合场景特征、所述文本语义特征、所述人物面部特征和所述人声音频特征得到的第二多模态特征,识别人群中当前时刻的说话者的说话对象,并判断所述说话对象是否是机器人,以有效提升机器人在人机交互过程中的表现。
Description
本发明属于计算机技术领域,尤其涉及多人机交互场景下的说话对象检测。
在语言交互过程中,必有一方是说话者,另一方是说话的对象,即说话者预期得到回应的对象。特别地,在人机交互过程中,机器人会在接收到语音信息后进行回复。
例如,当单人与机器人交互时,在人说话时,机器人必然是对应的说话对象。因此,机器人可以直接处理接收到的语音信息,然后进行回复。这样的功能已经在一些智能终端进行使用,且有较好的效果。
但是,人群与机器人的交互比单人与机器人的交互更加复杂。由于同时存在人与人、人与机器人之间的交互,机器人无法判断正在说话的人物是否在对自己说话,于是只能机械地对接收到的每一句话进行回复,严重影响了使用者之间的对话与体验。在这样的情况下,人们只能通过重复使用唤醒词与机器人进行多轮对话,降低了对话的效率。
发明内容
为解决上述技术问题,本发明实施例提供了一种用于多人机交互场景下说话对象检测的装置及方法。
根据本发明一实施例,一种用于多人机交互场景下说话对象检测的装置,所述多人机交互涉及包括多个人的人群以及至少一个机器人。其中,所述装置包括:音视频采集模块,用于实时采集带时间戳的视频帧数据和带时间戳的音频帧数据,所述视频帧数据中包括的多个视频帧和所述音频帧数据中包括的多个音频帧根据时间戳进行同步;文本生成模块,通过基于所述音频帧数据生成带有时间戳的文本信息;人脸处理模块,通过机器视觉的方法检测所述视频帧数据包括的各个视频帧中的人脸,并在多个视频帧中跟踪同一个人物以获得人脸序列数据;文本特征提取模块,用于通过机器学习或深度学习方法,从所述带时间戳的文本信息提取文本语义特征;音频特征提取模块,用于通过机器学习或深度学习方法,从所述音频帧数据提取人声音频特征;人脸特征提取模块,用于通过机器学习或深度学习方法,从所述人脸序列数据提取人物面部特征,所述人物面部特征包括人物面部的时序特征和空间特征;说话人检测模块,利用机器学习或深度学习方法,基于所述人脸序列数据中的所述人物面部特征以及所述人声音频特征识别所述人群中当前时刻的说话者,以获得所述当前时刻的说话者信息;说话对象识别模块,利
用机器学习或深度学习方法,基于场景特征、所述文本语义特征、所述人声音频特征以及所述人脸序列数据中的所述人物面部特征,识别所述人群中所述当前时刻的说话者的说话对象,以检测所述当前时刻的说话者的说话对象是否是机器人。其中,所述场景特征包括前一时刻的说话者信息和说话对象信息。此外,所述场景特征可存储于场景数据库中,以供说话对象识别模块调用。
进一步的,所述音视频采集模块包括:视频采集模块,用于实时地使用相机采集带时间戳的视频帧数据;音频采集模块,用于使用麦克风采集带时间戳的音频帧数据。可选地,所述视频帧数据按时间顺序存储于视频帧数据库;所述音频帧数据按时间顺序存储于音频帧数据库。
进一步的,所述人脸处理模块包括:人脸检测模块,使用深度学习方法检测所述视频帧数据所包括的视频帧中的人脸,并对在两个或更多个视频帧中检测到的同一个人脸赋予一个唯一固定的标识以表示这个人物;人脸跟踪模块,用于基于所述人脸检测模块输出的检测结果,在多个视频帧中跟踪同一个人物,以获得带时间戳的人脸序列数据。通过对相同的人脸赋予唯一固定的标识,即便人物在场景视野中消失后再次出现,仍能使用原有的id来表示这个人物。可选地,将带有时间戳的人脸序列数据存储于人脸数据库中。
进一步的,所述说话人检测模块包括:第一多模态融合模块,用于基于所述人脸序列数据按时间戳将所述人物面部特征、所述人声音频特征融合成第一多模态特征;说话状态检测模块,用于将所述第一多模态特征输入到深度学习网络中,以逐一预测所述人群中各人物当前时刻的说话状态,从而确定所述当前时刻的说话者以及相应的说话者信息。可选地,当前时刻的说话者信息存储于说话者数据库中。例如,所述说话者数据库可按时间戳存储所述说话者信息。
进一步的,所述说话对象识别模块包括:第二多模态融合模块,用于基于所述人脸序列数据按时间戳将上述人物面部特征、所述人声音频特征、所述文本语义特征、所述场景特征融合成第二多模态特征;说话对象检测模块,用于将上述第二多模态特征输入到深度学习网络中,以逐一预测所述人群中各人物以及各所述机器人是否为所述当前时刻的说话者的说话对象并相应确定所述当前时刻的说话对象信息。可选地,所述当前时刻的说话对象信息存储于说话对象数据库,以供其他模块调用,或作为结果输出。例如,所述说话对象数据库可按时间戳存储所述说话对象信息。
进一步的,所述文本生成模块包括语音识别模块,用于基于所述音频帧数据生成分别对应多个层级的带有时间戳的文本信息。其中,所述多个层级包括词语级、句子级、
对话主题级等。可选地,采用文本数据库按时间顺序和所述层级存储所述文本信息。
根据本发明的另一实施例,一种用于多人机交互场景下说话对象检测的方法,所述多人机交互涉及包括多个人的人群以及至少一个机器人。其中,所述方法包括如下步骤:步骤S1,由音视频采集模块实时地例如使用相机采集带时间戳的视频帧数据,以及例如使用麦克风采集带时间戳的音频帧数据,所述视频帧数据中包括的多个视频帧和所述音频帧数据中包括的多个音频帧根据时间戳进行同步;步骤S2,由文本生成模块实时地通过对音频帧数据进行语音识别生成带有词语级、句子级、对话主题级等不同层级的时间戳的文本信息,以及由文本特征提取模块从带有时间戳的文本信息提取文本语义特征;步骤S3,由人脸处理模块通过机器视觉的方法检测所述视频帧数据包括的各个视频帧中的人脸,并在多个视频帧中跟踪同一个人物以获得人脸序列数据,以及由人脸特征提取模块从所述人脸序列数据提取人物面部特征、由音频特征提取模块从所述音频帧数据提取人声音频特征;步骤S4,由说话人检测模块通过机器学习或深度学习方法,基于所述人物面部特征与所述人声音频特征识别所述人群中当前时刻的说话者,以获得所述当前时刻的说话者信息;步骤S5,由说话对象识别模块通过机器学习或深度学习方法,基于场景特征、所述文本语义特征、所述人声音频特征以及所述人物面部特征,识别所述人群中所述当前时刻的说话者的说话对象,以检测所述当前时刻的说话者的说话对象是否是机器人。其中,所述场景特征包括前一时刻的说话者信息和说话对象信息。
进一步地,在所述步骤S1中,所述视频帧数据可以机器人操作系统(Robot Operating System,ROS)主题的方式发布,通过订阅图像主题的方式实时获取视频帧数据;所述音频帧数据也可以ROS主题的方式发布,通过订阅音频主题的方式实时获取音频帧数据。在所述步骤S2中,可使用YOLO(You Only Look Once,你只看一次)进行人脸检测,并可使用深度简单在线实时跟踪(Deep Simple Online Realtime Tracking,Deep SORT)的模型进行多目标跟踪,跟踪的结果为对每一个人物赋予一个ID,且在整个过程中,每个人物的ID唯一且固定。
进一步地,所述步骤S4可包括如下具体步骤:基于所述人脸序列数据按时间戳对所述人物面部特征和所述人声音频特征进行融合编码,得到第一多模态特征;使用深度学习方法,基于所述第一多模态特征预测所述人群中当前时刻的说话者。
进一步地,所述步骤S5可包括如下具体步骤:基于所述人脸序列数据按时间戳对所述场景特征、所述文本语义特征、所述人声音频特征以及所述人物面部特征进行融合编码,即进行多模态特征融合,得到第二多模态特征;使用深度学习方法,基于所述第二多模态特征逐一预测所述人群中每个人物是所述当前时刻的说话者的说话对象的概率。
可选地,使用Transformer方法进行所述编码以及所述解码。
根据本发明实施例的用于多人机交互场景下说话对象检测的装置及方法,能够在人数随时变化的多人机交互场景中进行说话对象的预测。具体地,通过使用多模态融合模块把不同维度的特征信息进行关联,能够提取出对说话对象判断有用的信息。并且,通过使用深度学习方法进行预测,而不需要复杂的人工特征提取处理,能有效提高使用过程中的预测效率。
图1为根据本发明实施例的多人与机器人交互的场景示意图;
图2为根据本发明实施例的多人机交互场景下说话对象检测装置的模块示意图;
图3为根据本发明实施例的多人机交互场景下说话对象检测方法的流程图;
图4为根据本发明实施例的说话对象识别模块的可选模型架构示意图。
为了更好地了解本发明的目的、结构及功能,下面结合附图,对本发明实施例的用于多人机交互场景下进行说话对象检测的装置及方法做进一步详细的描述。
图1所示为多人与机器人交互场景的一个示例的示意图。在图1中,正方形表示场景中的物品;等腰三角形表示场景中的人物,且顶角可用于标识人物的朝向;而标记有R的圆形表示机器人。如图1所示,该场景中的人机交互涉及4个人以及一个机器人。本领域人员应能理解,图1仅为多人机交互场景的示例,实际参与人机交互的人数和机器人数应不限于此,并可随时变化。
图2所示为根据本发明实施例的用于多人机交互场景下说话对象检测的装置的功能模块图。如图2所示,该装置包括音视频采集模块110、文本生成模块120、人脸处理模块130、文本特征提取模块140、音频特征提取模块150、人脸特征提取模块160、说话人检测模块170、说话对象识别模块180。
其中,音视频采集模块110能实时地例如使用相机采集带时间戳的视频帧数据(其中,视频帧数据包括例如彩色图像的视频帧),以及例如使用麦克风采集带时间戳的音频帧数据。在一些实施例中,如图2所示,可将视频帧数据和音频帧数据按照时间顺序分别储存到视频帧数据库101或音频帧数据库102中。此外,所述视频帧数据中包括的多个视频帧和所述音频帧数据中包括的多个音频帧根据所述时间戳进行同步。换言之,同一时刻采集到的视频和音频应根据时间戳进行同步。
文本生成模块120能例如通过语音识别基于音频帧数据生成对应词语级、句子级、
对话主题级等不同层级的带有时间戳的文本信息。在一些实施例中,如图2所示,可将上述文本信息存入到文本数据库104。
人脸处理模块130能通过机器视觉的方法检测例如彩色图像的视频帧中的人脸,并在多个视频帧中跟踪同一个人物以获得人脸序列数据。在一些实施例中,如图2所示,可将人脸序列数据存入人脸数据库103中。其中,所述多个视频帧可为连续的多个视频帧,例如,可为特定时间长度内相机连续拍摄到的多个视频帧。但是,所述多个视频帧也可为不连续的多个视频帧,这样,即使有人从场景中退出后再次回来,仍可以有效实现人物跟踪。
文本特征提取模块140能通过将对应不同层级的带有时间戳的文本信息输入到自然语言的深度学习网络中,提取带时间戳的文本语义特征。在一些实施例中,获得文本信息之后,可以将文本看成一个词序列并使用例如GloVe的词编码器进行编码,以得到特定长度(例如128维)的文本语义特征向量。
音频特征提取模块150能通过将带时间戳的音频帧数据输入到深度学习网络中,提取带时间戳的人声音频特征。例如,可通过首先将音频帧数据切分为有重叠的音频片段,再将音频片段进行特征提取获得梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)作为进一步进行音频特征提取的输入。例如,可将MFCC输入深度学习网络,基于输入的MFCC生成特定长度(例如128维)的人声音频特征向量。
人脸特征提取模块160能通过将人脸序列数据输入到深度学习网络中,提取带时间戳的人物面部特征。其中,人物面部特征可包括人物面部的时序、空间特征。例如,通过将每个人物的人脸序列数据看成一个图形块序列,并通过深度学习网络将该图像块序列转换成视觉特征编码,然后将该视觉特征编码与位置编码进行相加,即可得到相应的人物面部特征。并且,人物面部特征可表征为特定长度(例如128维)的特征向量。
说话人检测模块170能通过机器学习或深度学习方法,基于所述人脸序列数据中的所述人物面部特征以及所述人声音频特征识别人群中当前时刻的说话者,以获得所述当前时刻的说话者信息。在一些实施例中,如图2所示,可将当前时刻的说话者信息存入到说话者数据库105中。例如,说话者数据库105可按时间戳存储说话者信息。
说话对象识别模块180能通过机器学习或深度学习方法,基于场景特征、所述文本语义特征、所述人声音频特征以及所述人脸序列数据中的所述人物面部特征,识别所述人群中所述当前时刻的说话者的说话对象,以检测当前时刻说话者的说话对象是否是机器人。在一些实施例中,如图2所示,可将说话对象信息存入到说话对象数据库106中。
具体地,如图2所示,音视频采集模块110可包括视频采集模块111、音频采集模
块112。其中,视频采集模块111能实时地例如使用相机采集带时间戳的例如彩色图像的视频帧。音频采集模块112能例如使用麦克风采集带时间戳的音频帧数据。此外,可采用视频帧数据库101按时间顺序存储带时间戳的视频帧数据,以供例如人脸处理模块130的其他模块调用;还可采用音频帧数据库102按时间顺序存储带时间戳的音频帧数据,以供例如文本生成模块120、音频特征提取模块150等的其他模块调用。
具体地,如图2所示,人脸处理模块130可包括人脸检测模块131、人脸跟踪模块132。其中,人脸检测模块131能使用深度学习方法检测所述视频帧数据所包括的视频帧中的人脸,并对在两个或更多个视频帧中检测到的同一个人脸赋予一个唯一固定的标识以表示这个人物;人脸跟踪模块132能基于所述人脸检测模块131输出的检测结果,在多个视频帧中跟踪同一个人物,以获得带时间戳的人脸序列数据。通过对相同的人脸赋予唯一固定的标识,即便人物在场景视野中消失后再次出现,仍能使用原有的id来表示这个人物。在一些实施例中,如图2所示,可采用人脸数据库103存储带有时间戳的人脸序列数据,以供例如人脸特征提取模块160的其他模块调用。
具体地,如图2所示,说话人检测模块170可包括第一多模态融合模块171、说话状态检测模块172。第一多模态融合模块171能基于所述人脸序列数据按时间戳将上述人物面部特征、人声音频特征融合成第一多模态特征;说话状态检测模块172能将上述第一多模态特征输入到深度学习网络中,并逐一预测人群中各人物当前时刻的说话状态,从而确定所述当前时刻的说话者以及相应的说话者信息。在一些实施例中,如图2所示,可采用说话者数据库105存储当前时刻的说话者信息,以供例如说话对象识别模块180的其他模块调用。
此外,在一些实施例中,可使用拼接的方法将人物面部特征、人声音频特征融合成第一多模态特征。例如,在人物面部特征和人声音频特征均为128维向量的情况下,通过特征拼接而得到的第一多模态特征将为256维的向量。
具体地,如图2所示,说话对象识别模块180可包括第二多模态融合模块181、说话对象检测模块182。第二多模态融合模块181能基于所述人脸序列数据按时间戳将上述人物面部特征、人声音频特征、文本语义特征、以及来自场景数据库107的场景特征融合成第二多模态特征;说话对象检测模块182能通过将上述第二多模态特征输入到深度学习网络中,逐一预测所述人群中各人物以及各所述机器人是否为所述当前时刻的说话者的说话对象并相应确定所述当前时刻的说话对象信息。在一些实施例中,如图2所示,可采用说话对象数据库106存储当前时刻的说话对象信息,以供例如场景数据107的其他模块调用。或者,也可将当前时刻的说话对象信息作为结果直接输出。
此外,如图2所示,场景数据库107能存储前一时刻的说话者信息、说话对象信息,以供说话对象识别模块180使用。
具体地,如图2所示,文本生成模块120可包括语音识别模块121。语音识别模块121能通过基于音频帧数据进行语音识别来生成对应词语级、句子级、对话主题级等不同层级的带有时间戳的文本信息。在一些实施例中,如图2所示,可采用文本数据库104按时间顺序和层级来存储上述带时间戳的文本信息,以供例如文本特征提取模块140的其他模块调用。
图3所示为根据本发明实施例的用于多人机交互场景下说话对象检测的方法的示意性流程图。如图3所示,该方法可包括以下步骤S1至S5。
在步骤S1,由音视频采集模块110实时地例如使用相机采集带时间戳的视频帧数据,以及例如使用麦克风采集带时间戳的音频信息。其中,所述视频帧数据中包括的多个视频帧和所述音频帧数据中包括的多个音频帧可按照时间顺序储存到视频帧数据库或音频帧数据库中。这样,同一时刻采集到的视频和音频可根据时间戳进行同步。
具体地,当前时刻的视频帧可以是指实际运行中实时获取的彩色图像。例如,在使用机器人操作系统(Robot Operating System,ROS)的机器人系统中,单目相机采集的彩色图像以ROS主题的方式发布,从而可通过订阅图像主题的方式实时获取彩色图像。阵列麦克风采集的音频信息也可以ROS主题的方式发布,从而可通过订阅音频主题的方式实时获取音频信息。
在步骤S2,由文本生成模块120实时地通过对音频帧数据进行语音识别,生成带有词语级、句子级、对话主题级等不同层级的时间戳的文本信息,并由文本特征提取模块140从带有时间戳的文本信息提取文本语义特征。在一些实施例中,可将上述文本信息存入到文本数据库104中。
在步骤S3,由人脸处理模块130通过机器视觉的方法检测视频帧数据中的人脸,并在多个视频帧中跟踪同一个人物,以获得人脸序列数据;以及由人脸特征提取模块160从所述人脸序列数据提取人物面部特征,由音频特征提取模块150从所述音频帧数据提取人声音频特征。
在一个示例性实施例中,可以使用YOLO进行人脸检测,并使用Deep SORT的模型进行多目标跟踪。跟踪的结果为,为每一个人赋予一个ID,且在整个过程中,每个人的ID唯一且固定。
在步骤S4,由说话人检测模块170通过机器学习或深度学习方法,基于所述人物面部特征与所述人声音频特征识别人群中当前时刻的说话者,以获得所述当前时刻的说话
者信息。
具体地,步骤S4可进一步包括:基于所述人脸序列数据按时间戳对所述人物面部特征和所述人声音频特征进行融合编码,即进行多模态特征融合,得到第一多模态特征;使用深度学习方法,基于所述第一多模态特征预测人群中当前时刻的说话者。
步骤S5,由说话对象识别模块180通过机器学习或深度学习方法,基于场景特征、所述文本语义特征、所述人声音频特征以及所述人物面部特征,识别所述人群中所述当前时刻的说话者的说话对象,以检测当前时刻的说话者的说话对象是否是机器人。
具体地,步骤S5可进一步包括:基于所述人脸序列数据按时间戳对所述场景特征、所述文本语义特征、所述人声音频特征以及所述人物面部特征进行融合编码,即进行多模态特征融合,得到第二多模态特征;使用深度学习方法,基于所述第二多模态特征,逐一预测所述人群中每个人物是所述当前时刻的说话者的说话对象的概率。
在一个示例性实施例中,可以使用本领域技术人员熟知的Transformer模型来执行基于第一/第二多模态特征进行预测的深度学习方法。一般来说,Transformer模型包括输入、编码器、解码器与输出。
其中,Transformer模型的输入是编码的序列。例如,对于视频帧数据来说,一般会将帧图像进行分块,然后排列成一个图像序列,并且每个帧图像的采集时间作为该图像序列的一个元素。对于文本信息来说,一段文字首先会被词元化为一个词序列,然后通过对词序列中的每个词元进行词编码,生成文本编码序列。对于音频帧数据来说,也需要被编码成音频序列后才能作为Transformer模型的输入。
然后,Transformer模型中的编码器主要由6层编码模块组成。每个编码模块主要包括一个多头自注意力机制层(multi-head self-attention mechanism)和一个全连接前向传播层(fully connected feed-forward),且都加了残差连接(residual connection)和归一化(normalization)。其中,多头自注意力机制层以上一层的序列编码作为输入,并通过全连接层生成查询键值三元组(query,key,value)中的q、k、v值。所述q、k、v值可均为长度为64的特征向量。序列之间通过用每个q对每个k做attention,计算公式如下:
其中,dk表示特征向量的长度,等于64。
类似地,Transformer模型中的解码器主要由6层解码模块组成。每个解码模块包括2个多头自注意力机制层以及一个全连接前向传播层。解码器的输入包括编码器的输出
以及解码器上一次的输出。特别地,解码器的输出即为Transformer模型的输出。
下文,将基于第二多模态特征预测说话对象为例,大致介绍Transformer模型在本申请实施例中的应用。
如图4所示,为了有效识别出说话者的说话对象,输入数据包括说话者人脸图像序列、其他人物的人脸图像序列、对应时间段的音频帧数据、对应时间段的文本信息。首先,通过分别对图像信息、音频信息、文本信息进行特征提取,获得对应的人物面部特征向量、人声音频特征向量以及文本语音特征向量;接着,在多模态融合模块中,将所有的特征向量进行拼接,实现多模态的融合,从而得到说话者以及各其他人物对应的第二多模态特征;然后,将融合得到的第二多模态特征通过Transformer编码器进行编码,获得说话者以及各其他人物的第二多模态编码特征向量;最后,通过将该第二多模态编码特征向量传入Transformer解码器中,来预测每个其他人物为说话者的说话对象的概率。其中,Transformer解码器进行的预测可以是顺序预测。例如,可首先预测机器人为说话对象的概率,之后对每个其他人物进行为说话对象的概率预测。在一些实施例中,如图4所示,可将对前一个人物进行说话对象预测的结果重新输入到Transformer解码器中,作为Transformer解码器对下一个人物进行说话对象预测时的输入。换言之,在进行说话对象识别时,由Transformer解码器对人群中除说话者之外的人物进行逐一预测。Transformer解码器的第一个输出结果为机器人为说话对象的概率,之后的输出结果依次为其他每个人物为说话对象的概率。当Transformer解码器的输出结果表示的概率大于预设的阈值时,则认为对应的机器人或人物为说话对象。例如,当第一个输出结果表示的概率大于所述预设的阈值时,则表明机器人为当前时刻的说话者的说话对象。
可以理解,本发明是通过一些实施例进行描述的,本领域技术人员知悉的,在不脱离本发明的精神和范围的情况下,可以对这些特征和实施例进行各种改变或等效替换。另外,在本发明的教导下,可以对这些特征和实施例进行修改以适应具体的情况及材料而不会脱离本发明的精神和范围。因此,本发明不受此处所公开的具体实施例的限制,所有落入本申请的权利要求范围内的实施例都属于本发明所保护的范围内。
Claims (17)
- 一种用于多人机交互场景下说话对象检测的装置,所述多人机交互涉及包括多个人的人群以及至少一个机器人,其特征在于,所述装置包括:音视频采集模块(110),用于实时采集带时间戳的视频帧数据和带时间戳的音频帧数据,其中,所述视频帧数据中包括的多个视频帧和所述音频帧数据中包括的多个音频帧根据所述时间戳进行同步;文本生成模块(120),用于基于所述音频帧数据生成带有时间戳的文本信息;人脸处理模块(130),用于通过机器视觉的方法检测所述视频帧数据包括的各个视频帧中的人脸,并在多个视频帧中跟踪同一个人物以获得人脸序列数据;文本特征提取模块(140),用于通过机器学习或深度学习方法,从所述带时间戳的文本信息提取文本语义特征;音频特征提取模块(150),用于通过机器学习或深度学习方法,从所述音频帧数据提取人声音频特征;人脸特征提取模块(160),用于通过机器学习或深度学习方法,从所述人脸序列数据提取人物面部特征,所述人物面部特征包括人物面部的时序特征和空间特征;说话人检测模块(170),用于通过利用机器学习或深度学习方法,基于所述人脸序列数据中的所述人物面部特征以及所述人声音频特征识别所述人群中当前时刻的说话者,以获得所述当前时刻的说话者信息;说话对象识别模块(180),用于通过利用机器学习或深度学习方法,基于场景特征、所述文本语义特征、所述人声音频特征以及所述人脸序列数据中的所述人物面部特征,识别所述人群中所述当前时刻的说话者的说话对象,以检测所述当前时刻的说话者的说话对象是否是机器人,其中,所述场景特征包括前一时刻的说话者信息和说话对象信息。
- 根据权利要求1所述的装置,其特征在于,所述音视频采集模块(110)包括:视频采集模块(111),用于实时地使用相机采集所述带时间戳的视频帧数据;音频采集模块(112),用于使用麦克风采集所述带时间戳的音频帧数据。
- 根据权利要求1或2所述的装置,其特征在于,还包括:视频帧数据库(101),用于按时间顺序存储所述视频帧数据;音频帧数据库(102),用于按时间顺序存储所述音频帧数据。
- 根据权利要求1所述的装置,其特征在于,所述人脸处理模块(130)包括:人脸检测模块(131),使用深度学习方法检测所述视频帧数据所包括的视频帧中的人脸,并对在两个或更多个视频帧中检测到的同一个人脸赋予一个唯一固定的标识以表 示这个人物;人脸跟踪模块(132),用于基于所述人脸检测模块(131)输出的检测结果,在多个所述视频帧中跟踪同一个人物,以获得带时间戳的人脸序列数据。
- 根据权利要求4所述的装置,其特征在于,还包括:人脸数据库(103),用于存储所述带有时间戳的人脸序列数据。
- 根据权利要求1所述的装置,其特征在于,所述说话人检测模块(170)包括:第一多模态融合模块(171),用于基于所述人脸序列数据按时间戳将所述人物面部特征、所述人声音频特征融合成第一多模态特征;说话状态检测模块(172),用于将所述第一多模态特征输入到深度学习网络中,以逐一预测所述人群中各人物当前时刻的说话状态,从而确定所述当前时刻的说话者以及相应的说话者信息。
- 根据权利要求6所述的装置,其特征在于,还包括:说话者数据库(105),用于按时间戳存储所述说话者信息。
- 根据权利要求1所述的装置,其特征在于,所述说话对象识别模块(180)包括:第二多模态融合模块(181),用于基于所述人脸序列数据按时间戳将所述人物面部特征、所述人声音频特征、所述文本语义特征、所述场景特征融合成第二多模态特征;说话对象检测模块(182),用于将所述第二多模态特征输入到深度学习网络中,以逐一预测所述人群中各人物以及各所述机器人是否为所述当前时刻的说话者的说话对象并相应确定所述当前时刻的说话对象信息。
- 根据权利要求1或8所述的装置,其特征在于,还包括:说话对象数据库(106),用于按时间戳存储所述说话对象信息。
- 根据权利要求1或8所述的装置,其特征在于,还包括:场景数据库(107),用于存储所述场景特征。
- 根据权利要求1所述的装置,其特征在于,所述文本生成模块(120)包括语音识别模块(121);所述语音识别模块(121)用于基于所述音频帧数据生成分别对应多个层级的带有时间戳的文本信息,其中,所述多个层级包括词语级、句子级、对话主题级。
- 根据权利要求11所述的装置,其特征在于,还包括:文本数据库(104),用于按时间顺序和所述层级存储所述文本信息。
- 一种用于多人机交互场景下说话对象检测的方法,所述多人机交互涉及包括多个人的人群以及至少一个机器人,其特征在于,所述方法包括:步骤S1,由音视频采集模块(110)实时地采集带时间戳的视频帧数据和带时间戳的音频帧数据,其中,所述视频帧数据中包括的多个视频帧和所述音频帧数据中包括的多个音频帧根据所述时间戳进行同步;步骤S2,由文本生成模块(120)实时地基于所述音频帧数据生成带有时间戳的文本信息,以及由文本特征提取模块(140)从带有时间戳的文本信息提取文本语义特征;步骤S3,由人脸处理模块(130)通过机器视觉的方法检测所述视频帧数据包括的各个视频帧中的人脸,并在多个视频帧中跟踪同一个人物以获得人脸序列数据,以及由人脸特征提取模块(160)从所述人脸序列数据提取人物面部特征、由音频特征提取模块(150)从所述音频帧数据提取人声音频特征;步骤S4,由说话人检测模块(170)通过机器学习或深度学习方法,基于所述人物面部特征与所述人声音频特征识别所述人群中当前时刻的说话者,以获得所述当前时刻的说话者信息;步骤S5,说话对象识别模块(180)通过机器学习或深度学习方法,基于场景特征、所述文本语义特征、所述人声音频特征以及所述人物面部特征,识别所述人群中所述当前时刻的说话者的说话对象,以检测所述当前时刻的说话者的说话对象是否是机器人,其中,所述场景特征包括前一时刻的说话者信息和说话对象信息。
- 根据权利要求13所述的方法,其特征在于,在所述步骤S1中,所述视频帧数据以ROS主题的方式发布,通过订阅图像主题的方式实时获取所述视频帧数据;所述音频帧数据以ROS主题的方式发布,通过订阅音频主题的方式实时获取所述音频帧数据;在所述步骤S2中,使用YOLO进行人脸检测,并使用Deep SORT的模型进行多目标跟踪,所述跟踪的结果为对每一个人物赋予一个ID,且在整个过程中,每个人物的ID唯一且固定。
- 根据权利要求13所述的方法,其特征在于,所述步骤S4包括如下具体步骤:基于所述人脸序列数据按时间戳对所述人物面部特征和所述人声音频特征进行融合编码,得到第一多模态特征;使用深度学习方法,基于所述第一多模态特征预测所述人群中当前时刻的说话者。
- 根据权利要求13所述的方法,其特征在于,所述步骤S5包括如下具体步骤:基于所述人脸序列数据按时间戳对所述场景特征、所述文本语义特征、所述人声音 频特征以及所述人物面部特征进行融合编码,得到第二多模态特征;使用深度学习方法,基于所述第二多模态特征逐一预测所述人群中每个人物是所述当前时刻的说话者的说话对象的概率。
- 根据权利要求15或16所述的方法,其特征在于,使用Transformer模型进行所述深度学习方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023548657A JP2024532640A (ja) | 2022-08-12 | 2023-06-21 | マルチヒューマンコンピュータインタラクションシーンでの話し相手の検出 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210966740.5 | 2022-08-12 | ||
CN202210966740.5A CN115376187A (zh) | 2022-08-12 | 2022-08-12 | 一种多人机交互场景下说话对象检测装置及方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024032159A1 true WO2024032159A1 (zh) | 2024-02-15 |
Family
ID=84064895
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/101635 WO2024032159A1 (zh) | 2022-08-12 | 2023-06-21 | 多人机交互场景下的说话对象检测 |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP2024532640A (zh) |
CN (1) | CN115376187A (zh) |
WO (1) | WO2024032159A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854535A (zh) * | 2024-03-08 | 2024-04-09 | 中国海洋大学 | 基于交叉注意力的视听语音增强方法及其模型搭建方法 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115376187A (zh) * | 2022-08-12 | 2022-11-22 | 之江实验室 | 一种多人机交互场景下说话对象检测装置及方法 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107230476A (zh) * | 2017-05-05 | 2017-10-03 | 众安信息技术服务有限公司 | 一种自然的人机语音交互方法和系统 |
CN111078010A (zh) * | 2019-12-06 | 2020-04-28 | 智语科技(江门)有限公司 | 一种人机交互方法、装置、终端设备及可读存储介质 |
CN113408385A (zh) * | 2021-06-10 | 2021-09-17 | 华南理工大学 | 一种音视频多模态情感分类方法及系统 |
CN114519880A (zh) * | 2022-02-09 | 2022-05-20 | 复旦大学 | 基于跨模态自监督学习的主动说话人识别方法 |
CN114819110A (zh) * | 2022-06-23 | 2022-07-29 | 之江实验室 | 一种实时识别视频中说话人的方法及装置 |
CN115376187A (zh) * | 2022-08-12 | 2022-11-22 | 之江实验室 | 一种多人机交互场景下说话对象检测装置及方法 |
-
2022
- 2022-08-12 CN CN202210966740.5A patent/CN115376187A/zh active Pending
-
2023
- 2023-06-21 WO PCT/CN2023/101635 patent/WO2024032159A1/zh unknown
- 2023-06-21 JP JP2023548657A patent/JP2024532640A/ja active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107230476A (zh) * | 2017-05-05 | 2017-10-03 | 众安信息技术服务有限公司 | 一种自然的人机语音交互方法和系统 |
CN111078010A (zh) * | 2019-12-06 | 2020-04-28 | 智语科技(江门)有限公司 | 一种人机交互方法、装置、终端设备及可读存储介质 |
CN113408385A (zh) * | 2021-06-10 | 2021-09-17 | 华南理工大学 | 一种音视频多模态情感分类方法及系统 |
CN114519880A (zh) * | 2022-02-09 | 2022-05-20 | 复旦大学 | 基于跨模态自监督学习的主动说话人识别方法 |
CN114819110A (zh) * | 2022-06-23 | 2022-07-29 | 之江实验室 | 一种实时识别视频中说话人的方法及装置 |
CN115376187A (zh) * | 2022-08-12 | 2022-11-22 | 之江实验室 | 一种多人机交互场景下说话对象检测装置及方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854535A (zh) * | 2024-03-08 | 2024-04-09 | 中国海洋大学 | 基于交叉注意力的视听语音增强方法及其模型搭建方法 |
CN117854535B (zh) * | 2024-03-08 | 2024-05-07 | 中国海洋大学 | 基于交叉注意力的视听语音增强方法及其模型搭建方法 |
Also Published As
Publication number | Publication date |
---|---|
CN115376187A (zh) | 2022-11-22 |
JP2024532640A (ja) | 2024-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tao et al. | End-to-end audiovisual speech recognition system with multitask learning | |
Tao et al. | Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection | |
WO2024032159A1 (zh) | 多人机交互场景下的说话对象检测 | |
CN110751208B (zh) | 一种基于自权重差分编码器进行多模态特征融合的服刑人员情感识别方法 | |
WO2021082941A1 (zh) | 视频人物识别方法、装置、存储介质与电子设备 | |
CN112088402B (zh) | 用于说话者识别的联合神经网络 | |
Chen | Audiovisual speech processing | |
CN111488433A (zh) | 一种适用于银行的提升现场体验感的人工智能交互系统 | |
US10388325B1 (en) | Non-disruptive NUI command | |
EP4392972A1 (en) | Speaker-turn-based online speaker diarization with constrained spectral clustering | |
CN114186069A (zh) | 基于多模态异构图注意力网络的深度视频理解知识图谱构建方法 | |
Birmingham et al. | Group-level focus of visual attention for improved next speaker prediction | |
Vayadande et al. | Lipreadnet: A deep learning approach to lip reading | |
CN114360491A (zh) | 语音合成方法、装置、电子设备及计算机可读存储介质 | |
CN117854507A (zh) | 语音识别方法、装置、电子设备及存储介质 | |
CN117809679A (zh) | 一种服务器、显示设备及数字人交互方法 | |
CN116074629A (zh) | 一种基于多模态数据的视频会议发言人跟踪装置、系统及方法 | |
JP6754154B1 (ja) | 翻訳プログラム、翻訳装置、翻訳方法、及びウェアラブル端末 | |
Salman et al. | Comparison of Deepfakes Detection Techniques | |
CN115238048A (zh) | 一种联合意图识别和槽填充的快速交互方法 | |
Agirrezabal et al. | Multimodal detection and classification of head movements in face-to-face conversations: Exploring models, features and their interaction | |
Jyoti et al. | Salient face prediction without bells and whistles | |
Kulkarni et al. | Towards Automated Lip Reading: Developing Marathi Lip Reading Datasets and Neural Network Frameworks | |
Tian | Improved Gazing Transition Patterns for Predicting Turn-Taking in Multiparty Conversation | |
CN115512419A (zh) | 一种视频的识别方法、系统、电子设备以及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2023548657 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23851386 Country of ref document: EP Kind code of ref document: A1 |