WO2023226755A1 - 一种基于人-物时空交互行为的情感识别方法 - Google Patents

一种基于人-物时空交互行为的情感识别方法 Download PDF

Info

Publication number
WO2023226755A1
WO2023226755A1 PCT/CN2023/093128 CN2023093128W WO2023226755A1 WO 2023226755 A1 WO2023226755 A1 WO 2023226755A1 CN 2023093128 W CN2023093128 W CN 2023093128W WO 2023226755 A1 WO2023226755 A1 WO 2023226755A1
Authority
WO
WIPO (PCT)
Prior art keywords
human
interaction behavior
spatio
interaction
temporal
Prior art date
Application number
PCT/CN2023/093128
Other languages
English (en)
French (fr)
Inventor
李新德
胡川飞
Original Assignee
东南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东南大学 filed Critical 东南大学
Priority to US18/244,225 priority Critical patent/US20240037992A1/en
Publication of WO2023226755A1 publication Critical patent/WO2023226755A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • G06V10/987Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns with the intervention of an operator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the fields of computer vision and pattern recognition, and in particular to an emotion recognition method based on human-object spatio-temporal interaction behavior.
  • the purpose of the present invention is to provide an emotion recognition method based on human-object spatio-temporal interaction behavior in order to overcome the influence of data factors on emotion recognition results and improve the accuracy of emotion recognition results. Avoided interactive behaviors are used as data sources to build a more accurate and reliable emotion recognition method.
  • An emotion recognition method based on human-object spatio-temporal interaction behavior which specifically includes the following steps:
  • Step S1 Collect video data of the interaction process between people and objects
  • Step S2 Data label the locations of people and objects, as well as the interactive behaviors and emotions displayed by people;
  • Step S3 Construct a feature extraction model based on deep learning, extract the interaction behavior characteristics of people and objects in the spatiotemporal dimension, and detect the location and category of human-object interaction behavior;
  • Step S4 Map the detected interaction behavior categories into vector form through the word vector model
  • Step S5 Build a fusion model based on deep learning and fuse the interaction behavior vector and Spatiotemporal interaction behavior characteristics, identifying the emotions expressed by interacting people.
  • human-object spatio-temporal interaction behavior is used as the data basis for emotion recognition for the first time, overcoming the impact of the data sources used by existing recognition methods that are affected by target subjectivity and unreliable collection methods.
  • it not only uses a video of human-object interaction to directly establish the recognition model, but also introduces the process of human-object interaction detection (S3, S4), and fuses the characteristics of human-object interaction with the vectorized detection results ( S5), perform emotion recognition based on the fusion of feature level and semantic level to make the recognition results more interpretable.
  • the collection scenes involved in the video data in step S1 include bedrooms, kitchens, bathrooms, study rooms of residential buildings, information desks, and ticket purchase offices in shopping malls;
  • the interactive behavior refers to the actions of people using objects, including Drink from a cup, read books, answer the phone, operate the TV remote control, operate the computer, turn the sheets, brush your teeth with a toothbrush, wash your face with a towel, push/close the door, push the shopping cart, and hold the queuing railing.
  • the behaviors listed here include representative human-object interaction behaviors including emotions in daily life, work, personal cleaning, etc. The advantage of this setting method is that it is universal.
  • the data annotation in step S2 involves three stages.
  • a target detection network is used to generate the initial positions and object categories of people and objects in the video data, and then the generated initial positions and categories are manually corrected to correct inaccurate ones.
  • the detection results are corrected to obtain accurate position and category information, and finally the interactive behaviors and emotions displayed by the people in the video data are annotated;
  • the positions of people and objects refer to the smallest distance parallel to the video image containing the person or object.
  • a rectangular frame is represented by the center coordinates, length and width of the rectangle; labeling the interaction behavior refers to marking the interaction category and the location of the corresponding person or object;
  • the emotions include happiness, frustration, anxiety, anger, surprise, and fear , excited and neutral; said Neutral means no obvious emotion shown.
  • This plan explains the three stages of the data annotation process, which can be considered as the data set production process to complete the human-object interaction behavior emotion recognition method.
  • the advantage is that based on the automatic detection of the algorithm in the first stage, combined with the manual correction and annotation in the second and third stages, a semi-automatic data annotation process is formed, which improves the efficiency of data set production.
  • the feature extraction model in step S3 uses a target detection network pre-trained on a general data set, and is fine-tuned on the collected video data to detect accurate positions and interaction categories of people and interacting objects.
  • the fine-tuning refers to freezing most of the learnable parameters of the network on the basis of pre-training on a general data set, and only retraining the last two layers of the network on the training data.
  • the spatio-temporal dimension in step S3 refers to a three-dimensional tensor of fixed time length, including one time dimension and two spatial dimensions; the time length is defined by the number of video frames.
  • the fusion interaction behavior vector in step S5 refers to the interaction behavior in the form of a vector in step S4.
  • the spatiotemporal interaction behavior characteristics in step S5 refer to the interaction behavior characteristics between people and objects in the spatiotemporal dimension in step S3.
  • identifying the emotion expressed by the interacting person in step S5 is to classify the fusion features output by the fusion model. It takes full advantage of the large data volume of the general data set, and at the same time reduces the training time of the feature extraction model on the human-object interaction detection task.
  • the present invention has the following beneficial effects:
  • this invention uses the behavior of people and objects in space and time as the modeling basis for the emotion recognition method. Utilize the objectivity and ease of collection of human-object interaction behaviors to overcome the impact of target subjectivity and unreliable collection methods on emotion recognition modeling;
  • the present invention models the emotion recognition model in the spatio-temporal dimension, takes advantage of the continuity of spatio-temporal information, represents the temporal causal connection of human-object interaction actions, and improves the accuracy of the emotion recognition model;
  • the present invention incorporates semantic-level information of human-object interaction, further enhancing the accuracy of the recognition results of the emotion recognition model and the interpretability of modeling based on human-object interaction.
  • Figure 1 is a schematic flow diagram of the present invention.
  • Figure 2 is a schematic diagram of the data annotation process in an example of the present invention.
  • An emotion recognition method based on human-object spatio-temporal interaction behavior is implemented in a residential bedroom as a scene, specifically including the following steps:
  • Step S1 Collect video data of the interaction process between people and objects.
  • the scene is a residential bedroom.
  • Interactions in the video data include taking a cup Children drink water, read books, answer calls, operate computers, push/close doors and other interactive behaviors.
  • using human-object interaction behavior as the data source greatly reduces the difficulty of collecting video data. Facial signals need to ensure that the face is not blocked, physiological signals require contact sensors, and human-object interaction behaviors only need to include human interaction parts and interactive objects, which relaxes the collection restrictions of data sources, allowing the present invention to have Wider application scenarios.
  • Step S2 Data label the locations of people and objects, as well as the interactive behaviors and emotions displayed by people.
  • the annotation process is divided into three stages, as shown in Figure 2.
  • the FasterRCNN target detection network is used to generate the initial positions of people and objects and object categories for all collected video data.
  • annotation tools to manually correct the initial position and category, correct the inaccurate initial detection results, and obtain accurate position and category information.
  • all collected video data were annotated with interactive behaviors and emotions, including happiness, frustration, anxiety, anger, surprise, fear, excitement, and neutrality.
  • Step S3 Construct a feature extraction model based on deep learning, extract the interaction behavior characteristics of people and objects in the spatiotemporal dimension, and detect the location and category of human-object interaction behavior.
  • the target detection network based on 3D-DETR is used as the feature extraction model, and a fine-tuning strategy is adopted, that is, the network weights of the model pre-trained on the V-COCO data set are partially retained, and only the network weights collected in this example are The last two layers of the model are trained on the data set, which is used to extract the interaction behavior characteristics of people and objects in the spatiotemporal dimension and detect the location and category of human-object interaction behavior.
  • the fine-tuning strategy improves the training efficiency of the feature extraction model in this example on the human-object interaction behavior data set.
  • the dimension of interactive behavior features is 2048, when The length T is 20 video frames.
  • Step S4 Map the detected interaction behavior categories into vector form through the word vector model.
  • the Chinese BERT model trained on the Chinese Wikipedia corpus is used as the word vector model to map the detected interaction behavior categories into vector form. For example, map the Chinese phrase "take a cup and drink water” into a one-dimensional vector.
  • the pre-training task is a whole-word masking task, and the vector dimension is 768.
  • Step S5 Construct a fusion model based on deep learning, fuse interaction behavior vectors and spatiotemporal interaction behavior features, and identify the emotions expressed by interacting people.
  • the multimodal Transformer model is used as the fusion model to fuse interaction behavior vectors and spatiotemporal interaction behavior features.
  • the fusion interaction behavior vector is used as the Query of the model, and the spatiotemporal interaction behavior characteristics are used as the Key and Value.
  • a Softmax classifier composed of a single fully connected layer is constructed to classify the emotions of the fused features, and the emotion corresponding to the maximum value of the classifier node is taken as the final emotion recognition result.
  • An emotion recognition method based on human-object spatio-temporal interaction behavior is implemented with the ticket purchase office as the scene, specifically including the following steps:
  • Step S1 Collect video data of the interaction process between people and objects.
  • the scene is the ticket office.
  • the interactive behaviors in the video data include drinking from a cup, reading books, answering the phone, pushing/closing the door, holding the queuing railing, etc.
  • Step S2 Data label the locations of people and objects, as well as the interactive behaviors and emotions displayed by people.
  • the annotation process is divided into three stages, as shown in Figure 2.
  • the FasterRCNN target detection network is used to generate the initial positions of people and objects and object categories for all collected video data.
  • annotation tools to manually correct the initial position and category, correct the inaccurate initial detection results, and obtain accurate position and category information.
  • all collected video data were annotated with interactive behaviors and emotions, including happiness, frustration, anxiety, anger, surprise, fear, excitement, and neutrality.
  • Step S3 Construct a feature extraction model based on deep learning, extract the interaction behavior characteristics of people and objects in the spatiotemporal dimension, and detect the location and category of human-object interaction behavior.
  • the target detection network based on 3D-DETR is used as the feature extraction model, and a fine-tuning strategy is adopted, that is, the network weights of the model pre-trained on the V-COCO data set are partially retained, and only the network weights collected in this example are The last two layers of the model are trained on the data set, which is used to extract the interaction behavior characteristics of people and objects in the spatiotemporal dimension and detect the location and category of human-object interaction behavior.
  • the fine-tuning strategy improves the training efficiency of the feature extraction model in this example on the human-object interaction behavior data set.
  • the dimension of interactive behavior features is 2048, and the time length T is 20 video frames.
  • Step S4 Map the detected interaction behavior categories into vector form through the word vector model.
  • the Chinese BERT model trained on the Chinese Wikipedia corpus is used as the word vector model to map the detected interaction behavior categories into vector form. For example, map the Chinese phrase "hold the queue railing" into a one-dimensional vector.
  • the pre-training task is a whole-word masking task, and the vector dimension is 768.
  • Step S5 Build a fusion model based on deep learning and fuse the interaction behavior vector and Spatiotemporal interaction behavior characteristics, identifying the emotions expressed by interacting people.
  • the multimodal Transformer model is used as the fusion model to fuse interaction behavior vectors and spatiotemporal interaction behavior features.
  • the fusion interaction behavior vector is used as the Query of the model, and the spatiotemporal interaction behavior characteristics are used as the Key and Value.
  • a Softmax classifier composed of a single fully connected layer is constructed to classify the emotions of the fused features, and the emotion corresponding to the maximum value of the classifier node is taken as the final emotion recognition result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于人-物时空交互行为的情感识别方法,其过程为:采集人与物体交互行为过程的视频数据;对人、物体的位置以及人所表现的交互行为和情感进行数据标注;构建基于深度学习的特征提取模型,抽取人与物体在时空维度的交互行为特征,并进行人-物交互行为的位置和类别检测;通过词向量模型,将检测得到的交互行为类别映射为向量形式;最后构建基于深度学习的融合模型,融合交互行为向量和时空交互行为特征,识别交互人所表现的情感。本发明采用了人与物体在时空中的交互信息,对识别目标情感提供了客观、连续的判断依据,避免了情感识别结果受目标主观性和采集方式的影响,更准确地识别出目标的真实情感状态。

Description

一种基于人-物时空交互行为的情感识别方法 技术领域
本发明涉及计算机视觉和模式识别领域,尤其是涉及一种基于人-物时空交互行为的情感识别方法。
背景技术
随着人工智能技术的发展,赋予机器理解人类情感的能力逐渐成为了研究热点,其极大延展了智能设备在人类社会各领域的应用深度。例如,在由机器提供的引导服务中,通过对询问者言行的观察、识别和理解,判断其内心的真实情感,实现如同人类般自然、生动且亲切的引导互动,使询问者感受到智能设备的自然、顺畅、有温度。因此,构建准确的情感识别技术对推动机器的智能化、类人化有着重要的实际意义。
现有的情感识别方法中,通过采集目标的面部图像、说话语音、生理信号用于建立情感识别模型的建模依据。然而,基于上述数据源的情感识别方法的可靠性通常受限于目标表现的主观性和采集方式的可靠性。具体而言,面部图像和说话语音通常可以被认为是流露人类情感的直观线索。但在一些特殊情景,人类会存在从众和伪装心理,混淆了基于主观表现的面部或语音建模的情感识别方法,对目标真实情感的识别产生偏差。相对而言,生理信号,如心率、呼吸率、皮肤电和脑电信号,一般是不易受目标主观伪装的客观线索。但生理信号的采集多为接触式传感器,这会使得目标产生被侵入感,使得生理信 号掺杂了不确定的非情感相关因素。此外,接触式采集方式大大缩小了情感识别方法的应用广度。
综上所述,因为现有情感识别方法在建模时,采用的数据源会受到目标主观性和采集方式不可靠的影响,从而导致情感识别结果的准确度较低。
发明内容
本发明的目的就是为了克服情感识别结果受数据因素的影响,提升情感识别结果的准确率而提供了一种基于人-物时空交互行为的情感识别方法,以日常生活中人与物体之间不可避免的交互行为作为数据源,构建一种更加准确、可靠的情感识别方法。
为实现上述目的,本发明提供如下技术方案:
一种基于人-物时空交互行为的情感识别方法,具体包括以下步骤:
步骤S1:采集人与物体交互行为过程的视频数据;
步骤S2:对人、物体的位置以及人所表现的交互行为和情感进行数据标注;
步骤S3:构建基于深度学习的特征提取模型,抽取人与物体在时空维度的交互行为特征,并进行人-物交互行为的位置和类别检测;
步骤S4:通过词向量模型,将检测得到的交互行为类别映射为向量形式;
步骤S5:构建基于深度学习的融合模型,融合交互行为向量和 时空交互行为特征,识别交互人所表现的情感。该方案中,首次采用人-物时空交互行为作为情感识别的数据依据,克服了现有识别方法所使用的数据源受到目标主观性和采集方式不可靠的影响。其次,不仅是采用一段人-物交互视频直接建立识别模型,而是引入人-物交互检测这一过程(S3、S4),并将人-物交互的特征和向量化的检测结果进行融合(S5),在特征级和语义级融合的基础上进行情感识别,使识别结果更具有解释性。
优选的,所述步骤S1中的视频数据所涉及的采集场景包括居民住宅的卧室、厨房、卫生间、书房以及商场询问台、购票处;所述的交互行为是指人对物体的使用动作包括拿杯子喝水、翻阅书籍、接听电话、操作电视遥控器、操作电脑、翻动床单、握牙刷刷牙、使用毛巾洗脸、推/关房门、推动购物车、扶握排队栏杆。这里所列出的行为囊括了生活中的起居、工作、个人清洁等包含情感的代表性人-物交互行为,该设置方式的好处是具有普适性。
优选的,所述步骤S2的数据标注涉及三个阶段,首先采用目标检测网络生成视频数据中人和物体的初始位置以及物体类别,然后对生成的初始位置和类别进行人工校正,对不准确的检测结果进行修正,得到准确的位置和类别信息,最后对视频数据中人所表现的交互行为和情感进行标注;所述的人、物体的位置是指包含人或物体的平行于视频图像的最小矩形框,由矩形中心坐标和长宽表示;所述的交互行为进行标注指的是标出交互类别和对应人、物的位置;所述的情感包括高兴、沮丧、焦躁、愤怒、惊喜、恐惧、兴奋以及中性;所述 的中性是指没有明显的情绪流露。该方案中解释了数据标注过程的三个阶段,该三个阶段可以认为是完成人-物交互行为情感识别方法的数据集制作过程。其好处在于:在第一个阶段算法自动检测的基础上,结合第二、第三个阶段的人工校正和标注,形成了半自动化的数据标注过程,提高了数据集的制作效率。
优选的,所述步骤S3中的特征提取模型采用在通用数据集上预训练的目标检测网络,在采集的视频数据上进行微调,检测准确的人与交互物体的位置以及交互类别。
优选的,所述的微调是指在通用数据集预训练的基础上,冻结网络的大部分可学习参数,在训练数据上只对网络的最后两层进行重新训练。
优选的,所述步骤S3中的时空维度是指一个固定时间长度的三维张量,包含一个时间维度和两个空间维度;所述的时间长度是由视频帧的数量进行定义。
优选的,所述步骤S5中的融合交互行为向量是指所述步骤S4中向量形式的交互行为。
优选的,所述步骤S5中的时空交互行为特征是指所述步骤S3中人与物体在时空维度的交互行为特征。
优选的,所述步骤S5中的识别交互人所表现的情感是对融合模型输出后的融合特征进行分类。充分利用了通用数据集的大数据量优点,同时,减少了特征提取模型在人-物交互检测任务上的训练时间。
本发明与现有技术相比,具有以下的有益效果:
1)本发明在数据源的选择上,采用了人与物体在时空中的行为作为情感识别方法的建模依据。利用人-物交互行为的客观性和易采集性,克服情感识别建模所受到的目标主观性和采集方式不可靠的影响;
2)本发明在时空维度上建模情感识别模型,发挥了时空信息的连续性,表征了人-物交互动作在时序上的因果联系,提升了情感识别模型的准确性;
3)本发明融入了人-物交互的语义级信息,进一步加强了情感识别模型识别结果的准确性,以及以人-物交互为依据建模的可解释性。
附图说明
图1为本发明的流程示意图。
图2为本发明实例中的数据标注流程示意图。
具体实施方式
下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施,如图1所示,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。
实施例1:
一种基于人-物时空交互行为的情感识别方法以住宅卧室为场景的实施,具体包括以下步骤:
步骤S1:采集人与物体交互行为过程的视频数据。
在本实例中,场景是住宅卧室。视频数据中的交互行为包括拿杯 子喝水、翻阅书籍、接听电话、操作电脑、推/关房门等交互行为。相比于面部或生理信号作为数据源,采用人-物交互行为为数据源,大大降低了视频数据的采集难度。面部信号需保证面部不受遮挡,生理信号则需要接触式的传感器,而人-物交互行为仅需要包含人的交互部位与交互物体即可,放宽了数据源的采集限制,使本发明能够具备更加广泛的应用场景。
步骤S2:对人、物体的位置以及人所表现的交互行为和情感进行数据标注。
在本实例中,该标注过程分为三个阶段,如图2所示。首先,采用FasterRCNN目标检测网络对所有采集的视频数据,生成人和物体的初始位置以及物体类别。然后,使用标注工具对初始位置和类别进行人工校正,修正不准确的初始检测结果,得到准确的位置和类别信息。最后,对所有采集的视频数据进行交互行为和情感标注,其中情感包括高兴、沮丧、焦躁、愤怒、惊喜、恐惧、兴奋以及中性。
步骤S3:构建基于深度学习的特征提取模型,抽取人与物体在时空维度的交互行为特征,并进行人-物交互行为的位置和类别检测。
在本实例中,采用基于3D-DETR的目标检测网络作为特征提取模型,并采用微调策略,即对该模型在V-COCO数据集上预训练的网络权重进行部分保留,仅在本实例所采集的数据集上训练模型的最后两层,用于抽取人与物体在时空维度的交互行为特征并进行人-物交互行为的位置和类别检测。微调策略提高了本实例中的特征提取模型在人-物交互行为数据集上的训练效率。交互行为特征维数为2048,时 间长度T为20帧视频帧。
步骤S4:通过词向量模型,将检测得到的交互行为类别映射为向量形式。
在本实例中,采用在中文维基百科语料库训练的中文BERT模型作为词向量模型,将检测得到的交互行为类别映射为向量形式。如,将“拿杯子喝水”这个中文短语映射为一个一维向量。其中,预训练任务是全词掩码任务,向量维数为768。
步骤S5:构建基于深度学习的融合模型,融合交互行为向量和时空交互行为特征,识别交互人所表现的情感。
在本实例中,采用多模态Transformer模型作为融合模型,融合交互行为向量和时空交互行为特征。其中,融合交互行为向量作为模型的Query,时空交互行为特征作为Key和Value。最后,构建一个单层全连接层组成的Softmax分类器,对融合后特征进行情感分类,取分类器节点最大值所对应的情感作为最终的情感识别结果。
实施例2:
一种基于人-物时空交互行为的情感识别方法以购票处为场景的实施,具体包括以下步骤:
步骤S1:采集人与物体交互行为过程的视频数据。
在本实例中,场景是购票处。视频数据中的交互行为包括拿杯子喝水、翻阅书籍、接听电话、推/关房门、扶握排队栏杆等交互行为。
步骤S2:对人、物体的位置以及人所表现的交互行为和情感进行数据标注。
在本实例中,该标注过程分为三个阶段,如图2所示。首先,采用FasterRCNN目标检测网络对所有采集的视频数据,生成人和物体的初始位置以及物体类别。然后,使用标注工具对初始位置和类别进行人工校正,修正不准确的初始检测结果,得到准确的位置和类别信息。最后,对所有采集的视频数据进行交互行为和情感标注,其中情感包括高兴、沮丧、焦躁、愤怒、惊喜、恐惧、兴奋以及中性。
步骤S3:构建基于深度学习的特征提取模型,抽取人与物体在时空维度的交互行为特征,并进行人-物交互行为的位置和类别检测。
在本实例中,采用基于3D-DETR的目标检测网络作为特征提取模型,并采用微调策略,即对该模型在V-COCO数据集上预训练的网络权重进行部分保留,仅在本实例所采集的数据集上训练模型的最后两层,用于抽取人与物体在时空维度的交互行为特征并进行人-物交互行为的位置和类别检测。微调策略提高了本实例中的特征提取模型在人-物交互行为数据集上的训练效率。交互行为特征维数为2048,时间长度T为20帧视频帧。
步骤S4:通过词向量模型,将检测得到的交互行为类别映射为向量形式。
在本实例中,采用在中文维基百科语料库训练的中文BERT模型作为词向量模型,将检测得到的交互行为类别映射为向量形式。如,将“扶握排队栏杆”这个中文短语映射为一个一维向量。其中,预训练任务是全词掩码任务,向量维数为768。
步骤S5:构建基于深度学习的融合模型,融合交互行为向量和 时空交互行为特征,识别交互人所表现的情感。
在本实例中,采用多模态Transformer模型作为融合模型,融合交互行为向量和时空交互行为特征。其中,融合交互行为向量作为模型的Query,时空交互行为特征作为Key和Value。最后,构建一个单层全连接层组成的Softmax分类器,对融合后特征进行情感分类,取分类器节点最大值所对应的情感作为最终的情感识别结果。
此外,需要说明的是,本说明书中所描述的具体实施例,所取名称可以不同,本说明书中所描述的以上内容仅仅是对本发明结构所做的举例说明。凡依据本发明构思的构造、特征及原理所做的等小变化或者简单变化,均包括于本发明的保护范围内。本发明所属技术领域的技术人员可以对所描述的具体实例做各种各样的修改或补充或采用类似的方法,只要不偏离本发明的结构或者超越本权利要求书所定义的范围,均应属于本发明的保护范围。

Claims (10)

  1. 一种基于人-物时空交互行为的情感识别方法,其特征在于,具体包括以下步骤:
    步骤S1:采集人与物体交互行为过程的视频数据;
    步骤S2:对人、物体的位置以及人所表现的交互行为和情感进行数据标注;
    步骤S3:构建基于深度学习的特征提取模型,抽取人与物体在时空维度的交互行为特征,并进行人-物交互行为的位置和类别检测;
    步骤S4:通过词向量模型,将检测得到的交互行为类别映射为向量形式;
    步骤S5:构建基于深度学习的融合模型,融合交互行为向量和时空交互行为特征,识别交互人所表现的情感;
    将人-物交互的特征和向量化的检测结果进行融合(S5),在特征级和语义级融合的基础上进行情感识别,使识别结果更具有解释性;
    采用多模态Transformer模型作为融合模型,融合交互行为向量和时空交互行为特征,其中,融合交互行为向量作为模型的Query,时空交互行为特征作为Key和Value,最后,构建一个单层全连接层组成的Softmax分类器,对融合后特征进行情感分类,取分类器节点最大值所对应的情感作为最终的情感识别结果。
  2. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S1中的视频数据所涉及的采集场景包括居民住宅的卧室、厨房、卫生间、书房以及商场询问台、购票处。
  3. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S1中的交互行为是指人对物体的使用动作包括拿杯子喝水、翻阅书籍、接听电话、操作电视遥控器、操作电脑、翻动床单、握牙刷刷牙、使用毛巾洗脸、推/关房门、推动购物车、扶握排队栏杆。
  4. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S2中的数据标注涉及三个阶段,首先采用目标检测网络生成视频数据中人和物体的初始位置以及物体类别,然后对生成的初始位置和类别进行人工校正,对不准确的检测结果进行修正,得到准确的位置和类别信息,最后对视频数据中人所表现的交互行为和情感进行标注。
  5. 根据权利要求4所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述的人、物体的位置是指包含人或物体的平行于视频图像的最小矩形框,由矩形中心坐标和长宽表示;
    所述的交互行为进行标注指的是标出交互类别和对应人、物的位置;
    所述的情感包括高兴、沮丧、焦躁、愤怒、惊喜、恐惧、兴奋以及中性;
    所述的中性是指没有明显的情绪流露。
  6. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S3中的特征提取模型采用在通用数据集上预训练的目标检测网络,在采集的视频数据上进行微调,检测准确的人与交互物体的位置以及交互类别;
    所述的微调是指在通用数据集预训练的基础上,冻结网络的大部分可学习参数,在训练数据上只对网络的最后两层进行重新训练。
  7. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S3中的时空维度是指一个固定时间长度的三维张量,包含一个时间维度和两个空间维度;
    所述的时间长度是由视频帧的数量进行定义。
  8. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S5中的融合交互行为向量是指步骤S4中向量形式的交互行为。
  9. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S5中的时空交互行为特征是指步骤S3中人与物体在时空维度的交互行为特征。
  10. 根据权利要求1所述的一种基于人-物时空交互行为的情感识别方法,其特征在于,所述步骤S5中的识别交互人所表现的情感是对融合模型输出后的融合特征进行分类。
PCT/CN2023/093128 2022-05-26 2023-05-10 一种基于人-物时空交互行为的情感识别方法 WO2023226755A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/244,225 US20240037992A1 (en) 2022-05-26 2023-09-09 Method for emotion recognition based on human-object time-space interaction behavior

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210583163.1 2022-05-26
CN202210583163.1A CN114926837B (zh) 2022-05-26 2022-05-26 一种基于人-物时空交互行为的情感识别方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/244,225 Continuation US20240037992A1 (en) 2022-05-26 2023-09-09 Method for emotion recognition based on human-object time-space interaction behavior

Publications (1)

Publication Number Publication Date
WO2023226755A1 true WO2023226755A1 (zh) 2023-11-30

Family

ID=82810385

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093128 WO2023226755A1 (zh) 2022-05-26 2023-05-10 一种基于人-物时空交互行为的情感识别方法

Country Status (3)

Country Link
US (1) US20240037992A1 (zh)
CN (1) CN114926837B (zh)
WO (1) WO2023226755A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926837B (zh) * 2022-05-26 2023-08-04 东南大学 一种基于人-物时空交互行为的情感识别方法
CN116186310B (zh) * 2023-05-04 2023-06-30 苏芯物联技术(南京)有限公司 一种融合ai通用助手的ar空间标注及展示方法
CN116214527B (zh) * 2023-05-09 2023-08-11 南京泛美利机器人科技有限公司 一种增强人机协作适应性的三体协同智能决策方法和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8219438B1 (en) * 2008-06-30 2012-07-10 Videomining Corporation Method and system for measuring shopper response to products based on behavior and facial expression
CN112381072A (zh) * 2021-01-11 2021-02-19 西南交通大学 一种基于时空信息及人、物交互的人体异常行为检测方法
CN112464875A (zh) * 2020-12-09 2021-03-09 南京大学 一种视频中的人-物交互关系检测方法及装置
CN113392781A (zh) * 2021-06-18 2021-09-14 山东浪潮科学研究院有限公司 一种基于图神经网络的视频情感语义分析方法
CN114926837A (zh) * 2022-05-26 2022-08-19 东南大学 一种基于人-物时空交互行为的情感识别方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005199403A (ja) * 2004-01-16 2005-07-28 Sony Corp 情動認識装置及び方法、ロボット装置の情動認識方法、ロボット装置の学習方法、並びにロボット装置
CN108664932B (zh) * 2017-05-12 2021-07-09 华中师范大学 一种基于多源信息融合的学习情感状态识别方法
CN108805087B (zh) * 2018-06-14 2021-06-15 南京云思创智信息科技有限公司 基于多模态情绪识别系统的时序语义融合关联判断子系统
US11861940B2 (en) * 2020-06-16 2024-01-02 University Of Maryland, College Park Human emotion recognition in images or video
CN112784798B (zh) * 2021-02-01 2022-11-08 东南大学 一种基于特征-时间注意力机制的多模态情感识别方法
CN113592251B (zh) * 2021-07-12 2023-04-14 北京师范大学 一种多模态融合的教态分析系统
CN114140885A (zh) * 2021-11-30 2022-03-04 网易(杭州)网络有限公司 一种情感分析模型的生成方法、装置、电子设备以及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8219438B1 (en) * 2008-06-30 2012-07-10 Videomining Corporation Method and system for measuring shopper response to products based on behavior and facial expression
CN112464875A (zh) * 2020-12-09 2021-03-09 南京大学 一种视频中的人-物交互关系检测方法及装置
CN112381072A (zh) * 2021-01-11 2021-02-19 西南交通大学 一种基于时空信息及人、物交互的人体异常行为检测方法
CN113392781A (zh) * 2021-06-18 2021-09-14 山东浪潮科学研究院有限公司 一种基于图神经网络的视频情感语义分析方法
CN114926837A (zh) * 2022-05-26 2022-08-19 东南大学 一种基于人-物时空交互行为的情感识别方法

Also Published As

Publication number Publication date
CN114926837B (zh) 2023-08-04
US20240037992A1 (en) 2024-02-01
CN114926837A (zh) 2022-08-19

Similar Documents

Publication Publication Date Title
WO2023226755A1 (zh) 一种基于人-物时空交互行为的情感识别方法
Clarkson Life patterns: structure from wearable sensors
Guanghui et al. Multi-modal emotion recognition by fusing correlation features of speech-visual
CN103268495B (zh) 计算机系统中基于先验知识聚类的人体行为建模识别方法
Yaddaden et al. User action and facial expression recognition for error detection system in an ambient assisted environment
US20170032186A1 (en) Information processing apparatus, information processing method, and program
US20220215175A1 (en) Place recognition method based on knowledge graph inference
CN105739688A (zh) 一种基于情感体系的人机交互方法、装置和交互系统
Karpouzis et al. Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition
TW201201115A (en) Facial expression recognition systems and methods and computer program products thereof
Zhang et al. ISEE Smart Home (ISH): Smart video analysis for home security
CN109765991A (zh) 社交互动系统、用于帮助用户进行社交互动的系统及非暂时性计算机可读存储介质
Zhao et al. Video Captioning with Tube Features.
Camgöz et al. Sign language recognition for assisting the deaf in hospitals
Zhou et al. A New Remote Health‐Care System Based on Moving Robot Intended for the Elderly at Home
Belissen et al. Dicta-Sign-LSF-v2: remake of a continuous French sign language dialogue corpus and a first baseline for automatic sign language processing
CN110889335A (zh) 基于多通道时空融合网络人体骨架双人交互行为识别方法
JP2021086274A (ja) 読唇装置及び読唇方法
de Dios et al. Landmark-based methods for temporal alignment of human motions
Adibuzzaman et al. In situ affect detection in mobile devices: a multimodal approach for advertisement using social network
Vacher et al. The Sweet-Home project: Audio processing and decision making in smart home to improve well-being and reliance
Wöllmer et al. Fully automatic audiovisual emotion recognition: Voice, words, and the face
CN111178141B (zh) 一种基于注意力机制的lstm人体行为识别方法
CN114386432A (zh) 语义识别方法、装置、机器人和智能设备
Mule et al. In-house object detection system for visually impaired

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810828

Country of ref document: EP

Kind code of ref document: A1