CN116721322A - Multi-mode-based character interaction relation detection method and detection system thereof - Google Patents

Multi-mode-based character interaction relation detection method and detection system thereof Download PDF

Info

Publication number
CN116721322A
CN116721322A CN202310626312.2A CN202310626312A CN116721322A CN 116721322 A CN116721322 A CN 116721322A CN 202310626312 A CN202310626312 A CN 202310626312A CN 116721322 A CN116721322 A CN 116721322A
Authority
CN
China
Prior art keywords
node
interaction
human
features
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310626312.2A
Other languages
Chinese (zh)
Inventor
叶青
徐秀菊
张永梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN202310626312.2A priority Critical patent/CN116721322A/en
Publication of CN116721322A publication Critical patent/CN116721322A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character interaction relation detection method and a detection system based on multiple modes, wherein the method comprises the following steps: s1: performing target detection on the input image to output a target detection result; s2: extracting human body posture features in the target detection result by combining with an improved cascading pyramid network, and extracting human-object visual features in the target detection result by adopting a graph model structure; s3: the classification labels in the target detection result are subjected to an algorithm by a human-object semantic enhancement module to obtain the most similar sample phrase embedded vector features and corresponding similarity scores; s4: obtaining the most relevant interactive region features of the characters in the original image and the interactive feature prediction scores of the visual feature parts through linear weighted summation calculation of the visual-semantic external attention module; s5: and obtaining a final character interaction relation detection result through multi-mode fusion reasoning and interaction detection.

Description

Multi-mode-based character interaction relation detection method and detection system thereof
Technical Field
The invention relates to the technical field of computer vision, in particular to a multi-mode-based character interaction relation detection method and a multi-mode-based character interaction relation detection system, which are used for analysis and research of character interaction relation detection.
Background
Character interaction relation (HOI) detection is a cross subject of target detection, behavior recognition and visual relation detection, and is an emerging research direction in the technical field of computer vision in recent years. The difficulty of HOI detection is that it is often necessary to perform inference analysis and judgment on fuzzy, complex and difficult-to-identify interaction behaviors, i.e. given a picture, not only the spatial positions of people and objects in the image are detected, but also the interaction relationship between them is identified. In particular, it can be generalized to detect < person, verb, object >. This is a very challenging problem because it involves both human-centric fine-grained motion interactions (e.g., horse-drawing and horse-feeding) and multiple simultaneous motions (e.g., sitting in a chair with fruit while playing a computer). In recent years, with popularization of informatization and vigorous development of the field of computer vision, HOI detection is a subtask for understanding visual relations, and has become a research direction with practical application significance and prospective in identifying the interrelationship between people and objects in specific application scenes. Currently, HOI detection is widely applied to various application scenes such as intelligent vehicle cabins, security monitoring, man-machine interaction, video monitoring and the like; especially in the fields of non-motor vehicles and motor vehicle dangerous driving behavior detection in video monitoring, HOI relation detection becomes a main solution.
Existing human interaction relation detection methods can be largely classified into global instance-based methods and local instance-based methods. Global instance-based HOI detection methods generally emphasize the integrity of human bodies, objects, and image backgrounds. However, when the original image is subjected to the interaction detection, if feature extraction is performed for each pixel point of the entire image, since effective information for researching whether or not there is interaction or what interaction is studied is small, the burden of calculation is greatly increased, and thus unnecessary calculation cost is increased; in addition, HOI categories are numerous in actual application scenes, and the interaction actions of the same person and object are not necessarily the same, so that the reasoning mode which only depends on example-level characteristics still has defects when rare interaction categories are processed and predicted, and the overall recognition degree is not high. The method based on the local example mainly analyzes the internal relation between the person and the object based on the local characteristics of bones, postures, parts and the like of the target main body, mainly solves the problems of integrating interactions between body parts with different postures and objects and improving model efficiency, but when the mutual shielding condition of the person, the object and the like occurs, the method cannot fully utilize visual space information of the interaction relation between the person and the object in an original image and potential information between label examples, and greatly influences the accuracy of interaction detection and identification.
In summary, the existing image-based human interaction relation detection method needs to be developed from theoretical research to practical application, and has the following two problems:
1) The long tail problem is that in an actual application scene, the interaction types of characters are numerous, the interaction actions of the same person and object are not necessarily the same, and the action types marked by the existing data samples are limited, so that the training data of the rare HOI type samples are lacked, the generalization capability of the small sample types is insufficient, and the improvement of the interaction detection accuracy is greatly influenced.
2) The visual space information is not fully utilized, namely in an actual scene, particularly in a public occasion with more people, the problem that people and people, people and objects and the like are mutually shielded can occur more, and the accuracy of interactive detection and identification can be greatly influenced.
In view of the above-mentioned shortcomings of the prior art, a person skilled in the art needs a person interaction relation detection method and a person interaction relation detection system that can improve both the accuracy of HOI detection and the generalization capability of a network model.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a multi-mode-based character interaction relation detection method and a multi-mode-based character interaction relation detection system, wherein a human-object instance label generated by a target detector is subjected to sample combination and coding of the label through a human-object semantic enhancement module to be converted into word vector characteristics, training data of rare interaction type samples are increased, and generalization capability of a network model is further improved; the method comprises the steps of combining the structural features of a graph model with the human body posture features to fully utilize contextual vision, space and label information in main human-object relationships and auxiliary relationships in an image, establishing a vision-semantic external attention mechanism, namely, combining potential label sample tags generated by a target detector through a human-object semantic enhancement module, then encoding the potential label sample tags into sample word vectors, finally generating label category similarity in a table look-up (namely a pre-trained corpus) mode, taking the label sample word vectors and the corresponding similarity as keywords and values of a vision-semantic external attention mechanism external dictionary, taking the fused vision features as query conditions, carrying out linear weighted calculation, further obtaining the most relevant interaction region features of characters in an original image, enabling global information to achieve maximum utilization, and further improving the accuracy of interaction detection.
In order to achieve the above object, the present invention provides a method for detecting a person interaction relationship based on multiple modes, comprising the steps of:
step S1: inputting an image, carrying out target detection on the input image through a target detection module, outputting a target detection result, wherein the target detection result comprises a human and object boundary box and a corresponding classification label thereof, and taking each human and object boundary box and the classification label thereof as an example;
step S2: extracting human body posture features in the target detection result by combining with an improved cascading pyramid network, extracting human-object visual features in the target detection result by adopting a graph model structure, and then, performing special human body postureThe sign and the human-object visual characteristics are fused to obtain a final visual characteristic F m
Step S3: the classification labels in the target detection result are used as the predictive value to be combined with the samples of the preset data set through a human-object semantic enhancement module, a new phrase embedding is generated through a sample combination algorithm, then a sample phrase word vector is generated through an example label sample coding algorithm, and the label category similarity is generated through a table look-up mode, so that the most similar sample phrase embedding vector features and the corresponding similarity scores are obtained
Step S4: obtaining the most relevant interactive region features of the characters in the original image through linear weighted summation calculation of the visual-semantic external attention module, namely the interactive feature prediction scores of the visual feature parts
Step S5: and obtaining a final character interaction relation detection result through multi-mode fusion reasoning and interaction detection.
In one embodiment of the present invention, in step S1, the input image is specifically detected by using a modified fast R-CNN network, where the backbone of the modified fast R-CNN network is ResNet-50.
In an embodiment of the present invention, the improved cascaded pyramid network in step S2 is specifically to add a convolution filter of 1x1 to the final residual block of different convolution features of the cascaded pyramid network to generate a thermodynamic diagram of the keypoints; and the improved cascade pyramid network only selects the odd-numbered layer characteristics to carry out element summation operation in the last layer of different convolution characteristics, and then the elements summation operation is input into a difficult key point extraction part together to detect difficult key points, and the specific process for extracting the human body posture characteristics comprises the following steps:
inputting the character boundary box generated from the target detection result into an improved cascading pyramid network to perform positioning evaluation on human body key points in the human body boundary box, and generating a human body posture estimation graph;
Mapping the human body posture estimation diagram and the human body boundary box to global features, and inhibiting the human body which is unlikely to generate interaction relation based on the position information of the human body posture estimation diagram and the human body boundary box, wherein the human body features which are likely to generate interaction are taken as human body posture features.
In an embodiment of the present invention, the graph model structure in step S2 extracts the human-object visual features in the target detection result, and then obtains the final visual feature F through fusion m The specific process of (2) comprises:
step S201: aiming at an original input image I, an initial dense undirected graph model G is constructed by taking a human and object boundary box in a target detection result as a node and taking a potential interaction relation existing between a human and an object as an edge through the following formula:
G=(V,E) (1)
wherein V is a set formed by all nodes, and E is a set of all edges;
wherein each human and object bounding box derived from the original input image I corresponds to a node v of the initial dense undirected graph model G i V, i.e i E, V; the interaction between different people and object bounding boxes is represented as one edge e of the initial dense undirected graph model G ij I.e. e ij ∈E;
Step S202: assuming that n instances are obtained from the original input image I, an initial dense undirected graph model G is obtained that shares n (n-1)/2 edges, and is represented as an n-by-n adjacency matrix A n×n The matrix elements i, j epsilon {0,1}, wherein 0 represents that no interaction relationship exists between the node i and the node j, and 1 represents that the interaction relationship exists between the node i and the node j;
step S203: assuming i as a central node, calculating a graph convolution operator by the following formula:
in the method, in the process of the invention,a characteristic representation of the representation node i at the n+1th layer; c (C) ij Representing the normalization factor; n (N) j Representing the neighbor node of the node j and containing self information; r is R j Representing the type of node j; />R represents j A transform weight parameter for the type node;
step S204: each node transmits the characteristic information of the node to the adjacent node after transformation, and then the characteristic information of the adjacent node is gathered and subjected to nonlinear transformation, wherein the traversal process of each node is specifically expressed as a formula (3) and a formula (4):
wherein H represents the target object determined to be a person, namely a person node, and H represents the total number of persons in the image; o represents the target object determined as an object, namely an object node, and O represents the total number of objects in the image; w is a weight matrix;for the interaction between a person node h and an object node o +.>For the person interaction of the person node h with another object node O ' (O ' ∈o and O ' +.o), the +.>For the interaction relation between the updated human node h and the object node o, S is the interaction relation between the human node h and the object node o Updated probability scores;
step S205: and (3) obtaining the interaction characteristics of the human node h and the object node o through the interaction characteristic generation algorithm of the formulas (5) and (6):
wherein F represents a newly generated interaction feature vector, F represents a feature vector extracted from the original image I, V represents an interaction relationship between a person and the person, and
step S206: fusing the human body posture features and the character interaction features by an add method to generate final visual features F m
In an embodiment of the present invention, step S3 specifically includes: .
Step S301: the classification labels in the target detection result are used as a predicted value to be combined with a sample of a preset data set and are sent into a person-object instance coding algorithm to obtain corresponding word embedding;
step S302: sending words representing verbs and objects into a word class coding algorithm to generate a verb set and an object set;
step S303: all word embedding obtained in the step S302 is sent to a sample combination algorithm to generate more training sample data, and then related word embedding is connected in the form of 'people, verbs and objects', so as to form phrase embedding;
step S304: embedding the formed phrase into the generated sample phrase word vector features through an example tag sample coding algorithm;
Step S305: embedding the sample phrase word vector features into the vector features which are most similar to the sample phrase word vector features found in the given lookup table, and obtaining corresponding similarityDegree score
In an embodiment of the present invention, when more training sample data is generated in step S303, the training sample data combination is fine-tuned by using a triplet loss function, and word vectors that are judged to be similar but different in category are far away from each other, where the triplet loss function is defined as:
wherein A is an anchor point; if one word and the anchor point A belong to the same category, the positive sample is regarded as P; otherwise, the sample is regarded as a negative sample and is marked as N; m represents an offset; d (a, P) represents the distance between the anchor point a and the positive sample P, and d (a, N) represents the distance between the anchor point a and the negative sample N.
In an embodiment of the present invention, step S4 specifically includes:
step S401: final visual feature F obtained in step S2 m Query condition Query as visual-semantic external attention dictionary, embedding vector features into the most similar sample phrases obtained in step S3 as keywords D of the external attention dictionary key The corresponding similarity score is used as the characteristic value D of the external attention dictionary val
Step S402: query condition Query and keyword D key Performing linear weighted summation operation to obtain the most distinguishable part in the global feature;
step S403: normalizing the result obtained in the step S402 to obtain a similarity weight matrix A tt
Step S404: weighting matrix A of similarity tt And the characteristic value D val Performing linear weighted summation operation to learn the tag class score f of the most similar part to the external semantic dictionary feature cls
Step S405: all label class scores f to be obtained cls Feeding SThe oftmax classifier performs human interaction prediction classification operation and normalizes the result into probability distribution to obtain a final interaction feature prediction score
In an embodiment of the present invention, step S5 specifically includes:
step S501: predicting the interactive feature of the visual feature part obtained in the step S4Similarity score corresponding to the most similar sample phrase embedded vector feature obtained in step S3 +.>Performing adding operation and outputting a result score;
step S502: normalizing the output result Score by the following formula to obtain a final interaction category probability Score:
where β represents a learnable hyper-parameter.
The invention also provides a multi-mode-based person interaction relation detection system for executing the method, which comprises the following steps:
The target detection module is used for detecting targets of the input images;
the feature extraction module is connected with the target detection module and is used for extracting human-object visual features;
the human-object semantic enhancement module is connected with the target detection module and is used for calculating the most similar sample phrase embedded vector characteristics and the corresponding similarity scores;
the visual-semantic external attention module is respectively connected with the feature extraction module and the human-object semantic enhancement module and is used for calculating an interactive feature prediction score of the visual feature part;
the multi-mode fusion reasoning and interaction detection module is respectively connected with the human-object semantic enhancement module and the visual-semantic external attention module and is used for calculating a final human interaction relation detection result.
Compared with the prior art, the method and the system for detecting the interaction relationship of the characters based on the multiple modes have the following advantages:
1) A new visual-semantic external attention mechanism is provided, and an example label sample obtained by a human-object semantic enhancement module is used for combining word vectors and similarity as keywords and values of an external dictionary; taking the final character relation feature obtained by fusing the graph model feature and the human body posture feature as a query condition, and then carrying out linear weighted summation on key human and object features in the original image to obtain the interactive relevant region feature with the most distinguishing property in the global feature, so as to further improve the HOI detection rate;
2) The external attention network structure based on the visual-semantic features comprises a feature extraction module, a human-object semantic enhancement module and a visual-semantic external attention module, wherein human and object bounding boxes generated by a target detection network and corresponding instance class labels are respectively sent to the feature extraction module and the human-object semantic enhancement module, obtained label combination sample word vectors and similarity scores are used as keywords and values of the external attention module of the visual-semantic features, obtained visual features are used as query conditions, linear weighted calculation is performed, interaction region features most relevant to the visual-semantic parts in an original image are obtained, and finally a final character interaction relation detection result is obtained through multi-mode fusion reasoning;
3) An Improved Cascading Pyramid Network (ICPN) is provided to achieve the purposes of maintaining the spatial resolution and semantic information of a feature layer and reducing the running cost of the network; and finally, the character which cannot possibly generate interaction relation is restrained (i.e. filtered) through the position information of the human body posture diagram and the object boundary box in the network algorithm, character features which can possibly generate interaction are sent into the convolutional neural network to perform next feature extraction, so that the extraction of invalid features is reduced, and the weight proportion of the effective features is increased.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a method for detecting human interaction relationship based on multiple modes according to an embodiment of the invention;
FIG. 2A is a schematic diagram of a prior art Cascading Pyramid Network (CPN) framework;
FIG. 2B is a schematic diagram of an Improved Cascaded Pyramid Network (ICPN) framework employed by an embodiment of the present invention;
FIG. 3 is an example of a human-object semantic enhancement module sample combination process according to one embodiment of the present invention;
FIG. 4 is a schematic diagram of the triplet loss function principle;
fig. 5 is a schematic diagram of the structure of a visual-semantic external attention module according to an embodiment of the present invention.
Reference numerals illustrate: 101-a target detection module; 102-a feature extraction module; 103-human-object semantic enhancement module; 104-a vision-semantic external attention module; 105-multimode fusion reasoning and interaction detection module.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a schematic diagram of a multi-mode-based person interaction relation detection method according to an embodiment of the present invention, and as shown in fig. 1, the embodiment provides a multi-mode-based person interaction relation detection method, which includes the following steps:
step S1: inputting an image, carrying out target detection on the input image through a target detection module, outputting a target detection result, wherein the target detection result comprises a human and object boundary box and a corresponding classification label thereof, and taking each human and object boundary box and the classification label thereof as an example;
the target detection is to input an entire original input image, every time an image is input, people and object boundary boxes and instance labels with confidence degrees larger than a preset threshold value are selected through a network positioning frame and screened out to serve as input of feature extraction in the next stage, and therefore effective people and objects for researching interaction behavior analysis and reasoning are finally researched and mainly depend on the confidence degrees.
In this embodiment, the input image is subject to object detection by specifically using the modified fast R-CNN network in step S1. The fast R-CNN network is a commonly used target detection network, and the principle and specific process of target detection are not described herein. The improved fast R-CNN Network of the present embodiment is slightly different from the existing fast R-CNN target detection Network structure, and the existing fast R-CNN target detection Network is formed by combining a CNN Network and an RPN Network, wherein a backbone (backbone) of the improved fast R-CNN Network is VGGNet (Visual Geometry Group Network), in order to more accurately extract valid people and object bounding boxes and instance labels in an image, VGGNet in an original fast R-CNN target detector is replaced with a reset-50 (reset Network), and the confidence is reset to be 0.5 in an original RPN Network algorithm, for example, but not limited thereto, the confidence is set to include more target boxes as much as possible, so as to avoid the problem of low detection rate caused by insufficient instances.
Step S2: extracting human body posture characteristics in target detection results by combining with an improved cascading pyramid network (Improving Cascaded Pyramid Network, ICPN), and adopting a graph model structure to perform extraction on Extracting human-object visual characteristics in the target detection result, and then fusing human posture characteristics and human-object visual characteristics to obtain final visual characteristics F m
Because the graph model structure can not well extract the blocked or blurred visual features, and can easily cause the false detection or omission of rare HOI categories, the embodiment of the invention combines the improved cascading pyramid network (Improving Cascaded Pyramid Network, ICPN) to extract the human body posture features, thereby enhancing the effective extraction of HOI key features and reducing the occurrence of over-fitting or gradient disappearance.
In order to accurately locate 17 key estimated points of a human body and simultaneously reduce network operation cost as much as possible, the invention provides an improved cascading pyramid network for extracting human body posture features so as to further strengthen extraction of relevant features of a human interaction region. Fig. 2A is a schematic diagram of a frame structure of an existing Cascading Pyramid Network (CPN), fig. 2B is a schematic diagram of a frame structure of an Improved Cascading Pyramid Network (ICPN) used in an embodiment of the present invention, as shown in fig. 2A and fig. 2B, where the Improved Cascading Pyramid Network (ICPN) is improved from the Cascading Pyramid Network (CPN), unlike the CPN, in this embodiment, a convolution filter of 1x 1 is added to a final residual block of different convolution characteristics of the CPN to generate a thermodynamic diagram of a key point; in addition, the ICPN of the embodiment only selects the odd-numbered layer features to carry out element summation operation in the last layer of different convolution features, and then the elements are input into a difficult key point extraction part together to detect the difficult key points, so that the spatial resolution and semantic information of the feature layers can be maintained, and the running cost of a network can be reduced. In addition, some functions of the network architecture design are different. Specifically, character boundary boxes generated from target detection results are input into an ICPN network, the network carries out positioning evaluation on human body key points in the human body boundary boxes, a human body posture estimation diagram is generated, the human body posture estimation diagram and the object boundary boxes are mapped to global features, then characters which cannot possibly generate interaction relations are restrained (i.e. filtered) based on the position information of the human body posture estimation diagram and the object boundary boxes, character features which can possibly generate interaction are sent into a convolutional neural network to carry out next feature extraction, and then the extraction of invalid features is reduced, and the weight proportion of effective features is increased.
In order to fully utilize the potential information such as vision, spatial position, instance labels and the like in the main character relation and auxiliary relation in the original input image, the invention proposes to construct an initial dense undirected graph model based on interaction characteristics by taking an effective human and object instance bounding box generated by a target detection module as nodes and taking the potential interaction relation existing between the effective human and object instance bounding box as edges; inputting the graph model into an interaction characteristic generation algorithm in the form of an adjacency matrix to learn and update the structural relation between the person and the object, namely, adaptively traversing each node in the graph model to update the graph model, so as to obtain a better figure interaction characteristic graph model; and finally, inputting the character interaction characteristic diagram model into a diagram convolutional neural network, and obtaining a final character interaction characteristic diagram through continuous learning.
In this embodiment, the graph model structure in step S2 extracts the human-object visual features in the target detection result, and then obtains the final visual feature F through fusion m The specific process of (2) comprises:
step S201: aiming at an original input image I, an initial dense undirected graph model G is constructed by taking a human and object boundary box in a target detection result as a node and taking a potential interaction relation existing between a human and an object as an edge through the following formula:
G=(V,E) (1)
Wherein V is a set formed by all nodes, and E is a set of all edges;
wherein each human object bounding box obtained from the original input image I corresponds to a node v of the initial dense undirected graph model G i V, i.e i E, V; interactions between different character bounding boxes are represented as one edge e of the initial dense undirected graph model G ij I.e. e ij ∈E(e ij Representing the edge connected by the node i and the node j, namely the edge interacted by the frame);
step S202: setting n examples obtained from the original input image I, and obtaining an initial dense undirected graph model G with n (n-1)/2 edges;
since the graph structure data is computationally transformed from each instance to an adjacency matrix form, the initial dense undirected graph model G is represented as an n-sized adjacency matrix A n×n The matrix elements i, j epsilon {0,1}, wherein 0 represents that no interaction relationship exists between the node i and the node j, and 1 represents that the interaction relationship exists between the node i and the node j;
step S203: assuming i as a central node, calculating a graph convolution operator by the following formula:
in the method, in the process of the invention,a characteristic representation of the representation node i at the n+1th layer; c (C) ij Representing normalization factors, such as the reciprocal of node degree; n (N) j Representing the neighbor node of the node j and containing self information; r is R j Representing the type of node j; />R represents j A transform weight parameter for the type node;
because the graph convolutional neural network is a network capable of performing deep learning on the graph data structure, the computed graph convolutional operator can be applied to the graph convolutional neural network, and the graph model needs to traverse and continuously learn updated nodes in an interactive feature generation algorithm.
Step S204: each node transmits the characteristic information of the node to the adjacent node after transformation, and then the characteristic information of the adjacent node is gathered and subjected to nonlinear transformation, so that better interaction characteristics are generated according to interaction relations among people, and the expression capacity of a model is further improved, wherein the traversal process of each node is as shown in a formula (3) and a formula (4):
wherein H represents the target object determined to be a person, namely a person node, and H represents the total number of persons in the image; o represents the target object determined as an object, namely an object node, and O represents the total number of objects in the image; w is a weight matrix;for the interaction between a person node h and an object node o +.>For the person interaction of the person node h with another object node O ' (O ' ∈o and O ' +.o), the +.>For the updated interaction relation between the human node h and the object node o, S is the probability score of the updated interaction relation between the human node h and the object node o;
Step S205: the interactive characteristics of the human node h and the object node o are obtained through the interactive characteristic generating algorithm of the following formulas (5) and (6):
wherein F represents a newly generated interaction feature vector, F represents a feature vector extracted from the original image I, V represents an interaction relationship between a person and the person, and
step S206: fusing the human body posture features and the character interaction features by an add method to generate final visual features F m
Step S3: the classification labels in the target detection result are used as the predictive value to be combined with the samples of the preset data set through a human-object semantic enhancement module, a new phrase embedding is generated through a sample combination algorithm, then a sample phrase word vector is generated through an example label sample coding algorithm, and the label category similarity is generated through a table look-up mode, so that the most similar sample phrase embedding vector features and the corresponding similarity scores are obtained
The module is mainly composed of an example tag coding algorithm so as to realize sample combination and coding of example tags, and aims to solve the long tail problem in character interaction detection.
In this embodiment, step S3 specifically includes: .
Step S301: the classification labels in the target detection result are used as a predicted value to be combined with a sample of a preset data set and are sent into a person-object instance coding algorithm to obtain corresponding word embedding;
Step S302: sending words representing verbs and objects into a word class coding algorithm to generate a verb set and an object set;
step S303: all word embedding obtained in the step S302 is sent to a sample combination algorithm to generate more training sample data, and then related word embedding is connected in the form of 'people, verbs and objects', so as to form phrase embedding;
step S304: embedding the formed phrase into the generated sample phrase word vector features through an example tag sample coding algorithm;
step S305: embedding the sample phrase word vector into vector features by finding the most similar sample phrase in a given lookup table (LUT), and obtaining a corresponding similarity score
FIG. 3 is an example of a human-object semantic enhancement module sample combination process according to an embodiment of the present invention, as shown in FIG. 3, using human, verb and object to form a phrase consisting of three words, e.g., the phrase "human riding" may be derived from the original human interaction triplet (human, riding, horse); the sample combination takes the HICO-DET dataset (a person interaction detection dataset) as an example, which contains 117 different verbs and 80 different objects. Wherein, for each verb, the embodiment can find the first K words (namely top K) with the nearest meaning in the whole data set, but only keep the similarity larger than a given similarity threshold t sim Thus, each verb can have its own neighborhood set. For example, given a query term "ride", a similarity threshold t is set sim =0.7, the neighborhood set can be found to be { ('stretch', 0.81), ('sit', 0.76), ('hold', 0.71) }, with floating point numbers in brackets indicating similarity to the query term 'ride'. And similarly, carrying out the same treatment on each object to obtain an object neighborhood set.
Then, sample combination is carried out through a sample combination algorithm, in the training process, a true value (can be an original value marked by an input picture) is given, and a 'people riding' is still taken as an example, and for a verb 'riding', samples can be taken from the neighborhood set { 'pulling', 'holding', 'crossing', } of the verb 'riding', and the probability is set to be p v The probability of riding is 1-p v The method comprises the steps of carrying out a first treatment on the surface of the For an object "horse," it may be sampled from its neighborhood set, e.g., { 'wild horse', 'war horse', 'courser horse', 'dwarf horse','s horse', assuming a probability p o The probability of the 'horse' is 1-p o . In this way, the original true value 'human riding' can be expanded to 20 different samples (verb and object cross arrangement combination), so that the diversity of example label samples is greatly increased, and the training data of the HOI rare interaction category is further improved.
Finally, the word vector O output by the example label sample coding algorithm is queried for the most similar embedding by a given LUT, and the most similar embedding is obtainedIts similarity scoreThe LUT is a dense dictionary obtained by screening a pre-trained corpus (e.g., a soaring word vector corpus, but not limited to), and ultimately contains only sample phrase embedments that occur in the training set. The similarity score may be calculated using equation (7):
label sample combining can be understood as a process of extraction of a word vector whose purpose is to predict a phrase embedding approaching a given truth value, while this process only attempts to approximate the relationship between the predicted word and the truth value, and does not take into account the relationship between the predicted word and the predicted word. In particular, we should aggregate all the predictors around their corresponding truth values, but it is also possible that the vectors of predictors belonging to different classes are close to each other. Thus, to address this problem, the present implementation selects a classical triplet loss function to fine tune it.
Fig. 4 is a schematic diagram of the principle of the triplet loss function, as shown in fig. 4, in this embodiment, when more training sample data is generated in step S303, the triplet loss function is used to fine tune the training sample data combination, and word vectors that are judged to be similar but different in category are far away from each other, where the definition of the triplet loss function is as shown in formula (8):
Wherein a is an anchor point (anchor point), which is the per se (e.g. "riding") of each output inlay; if one word and the anchor point A belong to the same category, the positive sample is regarded as P; otherwise, the sample is regarded as a negative sample and is marked as N; m represents an offset; d (a, P) represents the distance between the anchor point a and the positive sample P, and d (a, N) represents the distance between the anchor point a and the negative sample N. The effect of the triplet loss function employed in this embodiment is to pull the distance between similar and homonymic words, keeping the similar but different words away from each other.
Step S4: obtaining the most relevant interactive region features of the characters in the original image through linear weighted summation calculation of the visual-semantic external attention module, namely the interactive feature prediction scores of the visual feature parts
The main purpose of the visual-semantic external attention module is to obtain the most distinguishable features in the global features through linear weighted summation calculation, namely, similar characterization can be carried out on similar objects in different samples through learning potential relations in the samples.
Fig. 5 is a schematic structural diagram of a visual-semantic external attention module according to an embodiment of the present invention, as shown in fig. 5, in this embodiment, step S4 specifically includes:
Step S401: final visual feature F obtained in step S2 m Query condition Query as visual-semantic external attention dictionary, embedding vector features into the most similar sample phrases obtained in step S3 as keywords D of the external attention dictionary key The corresponding similarity score is used as the characteristic value D of the external attention dictionary val
Step S402: query condition Query and keyword D key Performing linear weighted summation operation to obtain the most distinguishable part in the global feature; in FIG. 5Representing a linear weighted sum;
step S403: normalizing the result obtained in step S402 (also called as Norm operation) to obtain a similarity weight matrix A tt Wherein A is tt Can be expressed as:
A tt =Norm(Query·D key ) (9);
step S404: weighting matrix A of similarity tt And the characteristic value D val Performing linear weighted summation operation to learn the tag class score f of the most similar part to the external semantic dictionary feature cls Wherein f cls Can be expressed as:
f cls =A tt· D val (10);
step S405: all label class scores f to be obtained cls Sending the human interaction prediction score into a Softmax classifier, performing human interaction prediction classification operation, and normalizing the result into probability distribution to obtain the final interaction feature prediction scoreWherein the method comprises the steps ofCan be expressed as:
step S5: and obtaining a final character interaction relation detection result through multi-mode fusion reasoning and interaction detection.
Because the visual features obtained in the step S4 and the semantic features obtained in the step S3 belong to two different modes, the two different modes are fused before character interaction detection reasoning is carried out, and the implementation adopts a multi-mode joint training method that the features of the modes are classified first and then the classification results are fused to carry out multi-mode fusion, so that the error interaction among the modes can be avoided, the most relevant interaction region features can be obtained from a semantic learning task, and further the HOI detection performance is improved. Thus, the last layer of the convolutional neural network for the visual feature extraction part at step S4 is fully connected with one Softmax layer (Softmax classifier of step S405), obtaining the final prediction score
In this embodiment, step S5 specifically includes:
step S501: predicting the interactive feature of the visual feature part obtained in the step S4Similarity score corresponding to the most similar sample phrase embedded vector feature obtained in step S3 +.>Performing an add operation (add operation), and outputting a result score;
step S502: normalizing the output result Score by the following formula to obtain a final interaction category probability Score:
Where the superscript v denotes the visual feature vector, the superscript s denotes the semantic feature vector, and β denotes the learnable super-parameter, typically a predetermined value.
Referring to fig. 1 again, the present embodiment further provides a multi-modal-based person interaction relation detection system for executing the method described above, which includes:
a target detection module 101 for performing target detection on an input image;
the feature extraction module 102 is connected with the target detection module 101 and is used for extracting human-object visual features;
the human-object semantic enhancement module 103 is connected with the target detection module 101 and is used for calculating the most similar sample phrase embedded vector features and the corresponding similarity scores;
a visual-semantic external attention module 104 connected to the feature extraction module 102 and the human-object semantic enhancement module 103, respectively, for calculating an interactive feature prediction score of the visual feature part;
the multi-mode fusion reasoning and interaction detection module 105 is respectively connected with the human-object semantic enhancement module 103 and the visual-semantic external attention module 104 and is used for calculating a final human interaction relation detection result.
The method and the system for detecting the interaction relationship of the characters based on the multiple modes have the following advantages:
1) A new visual-semantic external attention mechanism is provided, and an example label sample obtained by a human-object semantic enhancement module is used for combining word vectors and similarity as keywords and values of an external dictionary; taking the final character relation feature map obtained by fusing the map model features and the human body posture features as query conditions, and then carrying out linear weighted summation on key human and object features in the original image to obtain interactive related region features with the most distinguishing features in the global features, so that the HOI detection rate is further improved;
2) The external attention network structure based on the visual-semantic features comprises a feature extraction module, a human-object semantic enhancement module and a visual-semantic external attention module, wherein human and object bounding boxes generated by a target detection network and corresponding instance class labels are respectively sent to the feature extraction module and the human-object semantic enhancement module, obtained label combination sample word vectors and similarity scores are used as keywords and values of the external attention module of the visual-semantic features, visual features obtained after fusion of human body posture features and human interaction features are used as query conditions, linear weighting calculation is performed, interaction region features most relevant to the visual-semantic portions in an original image are obtained, and finally a final human interaction relation detection result is obtained through multi-modal fusion reasoning;
3) The Improved Cascade Pyramid Network (ICPN) is provided, so that the spatial resolution and semantic information of a feature layer can be maintained, and the running cost of the network can be reduced; and finally, the character which cannot possibly generate interaction relation is restrained (i.e. filtered) through the position information of the human body posture diagram and the object boundary box in the network algorithm, character features which can possibly generate interaction are sent into the convolutional neural network to perform next feature extraction, so that the extraction of invalid features is reduced, and the weight proportion of the effective features is increased.
Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.
Those of ordinary skill in the art will appreciate that: the modules in the apparatus of the embodiments may be distributed in the apparatus of the embodiments according to the description of the embodiments, or may be located in one or more apparatuses different from the present embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or may be further split into a plurality of sub-modules.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. The character interaction relation detection method based on the multiple modes is characterized by comprising the following steps of:
step S1: inputting an image, carrying out target detection on the input image through a target detection module, outputting a target detection result, wherein the target detection result comprises a human and object boundary box and a corresponding classification label thereof, and taking each human and object boundary box and the classification label thereof as an example;
step S2: extracting human body posture features in the target detection result by combining with an improved cascading pyramid network, extracting human-object visual features in the target detection result by adopting a graph model structure, and then fusing the human body posture features and the human-object visual features to obtain a final visual feature F m
Step S3: the classification labels in the target detection result are used as the predictive value to be combined with the samples of the preset data set through a human-object semantic enhancement module, a new phrase embedding is generated through a sample combination algorithm, then sample phrase word vectors are generated through an example label sample coding algorithm, and label category similarity is generated through a table look-up mode, so that the most similar sample phrase embedding vector features and the most similar sample phrase embedding vector features are obtainedSimilarity score
Step S4: obtaining the most relevant interactive region features of the characters in the original image through linear weighted summation calculation of the visual-semantic external attention module, namely the interactive feature prediction scores of the visual feature parts
Step S5: and obtaining a final character interaction relation detection result through multi-mode fusion reasoning and interaction detection.
2. The method for detecting human interaction relationship based on multiple modes according to claim 1, wherein in step S1, the input image is subject to detection by specifically using a modified Faster R-CNN network, wherein the modified Faster R-CNN network backbone is ResNet-50.
3. The multi-modal based person interaction relation detection method according to claim 1, wherein the improved cascading pyramid network in step S2 is specifically adding a 1x 1 convolution filter to the last residual block of different convolution features of the cascading pyramid network to generate a thermodynamic diagram of key points; and the improved cascade pyramid network only selects the odd-numbered layer characteristics to carry out element summation operation in the last layer of different convolution characteristics, and then the elements summation operation is input into a difficult key point extraction part together to detect difficult key points, and the specific process for extracting the human body posture characteristics comprises the following steps:
inputting the character boundary box generated from the target detection result into an improved cascading pyramid network to perform positioning evaluation on human body key points in the human body boundary box, and generating a human body posture estimation graph;
Mapping the human body posture estimation diagram and the human body boundary box to global features, and inhibiting the human body which is unlikely to generate interaction relation based on the position information of the human body posture estimation diagram and the human body boundary box, wherein the human body features which are likely to generate interaction are taken as human body posture features.
4. The method for detecting human interaction relation based on multiple modes according to claim 1, wherein the graph model structure in step S2 extracts human-object visual features in the target detection result, and then obtains final visual features F through fusion m The specific process of (2) comprises:
step S201: aiming at an original input image I, an initial dense undirected graph model G is constructed by taking a person boundary box in a target detection result as a node and taking a potential interaction relation existing between a person and a object as an edge through the following formula:
G=(V,E) (1)
wherein V is a set formed by all nodes, and E is a set of all edges;
wherein each human and object bounding box derived from the original input image I corresponds to a node v of the initial dense undirected graph model G i V, i.e i E, V; the interaction between different people and object bounding boxes is represented as one edge e of the initial dense undirected graph model G ij I.e. e ij ∈E;
Step S202: assuming that n instances are obtained from the original input image I, an initial dense undirected graph model G is obtained that shares n (n-1)/2 edges, and is represented as an n-by-n adjacency matrix A n×n The matrix elements i, j epsilon {0,1}, wherein 0 represents that no interaction relationship exists between the node i and the node j, and 1 represents that the interaction relationship exists between the node i and the node j;
step S203: assuming i as a central node, calculating a graph convolution operator by the following formula:
in the method, in the process of the invention,a characteristic representation of the representation node i at the n+1th layer; c (C) ij Representing the normalization factor; n (N) j Representing the neighbor node of the node j and containing self information; r is R j Representing the type of node j; />R represents j A transform weight parameter for the type node;
step S204: each node transmits the characteristic information of the node to the adjacent node after transformation, and then the characteristic information of the adjacent node is gathered and subjected to nonlinear transformation, wherein the traversal process of each node is specifically expressed as a formula (3) and a formula (4):
wherein H represents the target object determined to be a person, namely a person node, and H represents the total number of persons in the image; o represents the target object determined as an object, namely an object node, and O represents the total number of objects in the image; w is a weight matrix;for the interaction between a person node h and an object node o +.>Is the person interaction relationship of the person node h and another object node O ', wherein O ' is E O and O ' is not equal to O,>for the updated interaction relation between the human node h and the object node o, S is the probability score of the updated interaction relation between the human node h and the object node o;
Step S205: and (3) obtaining the interaction characteristics of the human node h and the object node o through the interaction characteristic generation algorithm of the formulas (5) and (6):
wherein F represents a newly generated interaction feature vector, F represents a feature vector extracted from the original image I, V represents an interaction relationship between a person and the person, and
step S206: the human body posture features and the character interaction features are fused by an adding method to generate final visual features F m
5. The method for detecting human interaction relation based on multiple modes according to claim 1, wherein step S3 specifically comprises:
step S301: the classification labels in the target detection result are used as a predicted value to be combined with a sample of a preset data set and are sent into a person-object instance coding algorithm to obtain corresponding word embedding;
step S302: sending words representing verbs and objects into a word class coding algorithm to generate a verb set and an object set;
step S303: all word embedding obtained in the step S302 is sent to a sample combination algorithm to generate more training sample data, and then related word embedding is connected in the form of 'people, verbs and objects', so as to form phrase embedding;
step S304: embedding the formed phrase into the generated sample phrase word vector features through an example tag sample coding algorithm;
Step S305: characterizing sample phrase word vectors toThe most similar sample phrase found in a given lookup table is embedded with vector features and a corresponding similarity score is obtained
6. The method for detecting interaction relation between characters based on multiple modes according to claim 1, wherein when more training sample data are generated in step S303, a triplet loss function is used to fine tune the training sample data combination, and word vectors which are judged to be similar but different in category are far away from each other, wherein the triplet loss function is defined as:
wherein A is an anchor point; if one word and the anchor point A belong to the same category, the positive sample is regarded as P; otherwise, the sample is regarded as a negative sample and is marked as N; m represents an offset; d (a, P) represents the distance between the anchor point a and the positive sample P, and d (a, N) represents the distance between the anchor point a and the negative sample N.
7. The method for detecting a human interaction relationship based on multiple modes according to claim 1, wherein step S4 specifically comprises:
step S401: final visual feature F obtained in step S2 m Query condition Query as visual-semantic external attention dictionary, embedding vector features into the most similar sample phrases obtained in step S3 as keywords D of the external attention dictionary key The corresponding similarity score is used as the characteristic value D of the external attention dictionary val
Step S402: query condition Query and keyword D key Performing linear weighted summation operation to obtain the most distinguishable part in the global feature;
step S403: normalizing the result obtained in the step S402 to obtain a similarity weight matrix A tt
Step S404: weighting matrix A of similarity tt And the characteristic value D val Performing linear weighted summation operation to learn the tag class score f of the most similar part to the external semantic dictionary feature cls
Step S405: all label class scores f to be obtained cls Sending the human interaction prediction score into a Softmax classifier, performing human interaction prediction classification operation, and normalizing the result into probability distribution to obtain the final interaction feature prediction score
8. The method for detecting interaction relation between people based on multiple modes according to claim 1, wherein the step S5 specifically comprises:
step S501: predicting the interactive feature of the visual feature part obtained in the step S4Similarity score corresponding to the most similar sample phrase embedded vector feature obtained in step S3 +.>Performing adding operation and outputting a result score;
step S502: normalizing the output result Score by the following formula to obtain a final interaction category probability Score:
Where β represents a learnable hyper-parameter.
9. A multimodal-based person interaction detection system for performing the method of any of claims 1-8, comprising:
the target detection module is used for detecting targets of the input images;
the feature extraction module is connected with the target detection module and is used for extracting human-object visual features;
the human-object semantic enhancement module is connected with the target detection module and is used for calculating the most similar sample phrase embedded vector characteristics and the corresponding similarity scores;
the visual-semantic external attention module is respectively connected with the feature extraction module and the human-object semantic enhancement module and is used for calculating an interactive feature prediction score of the visual feature part;
the multi-mode fusion reasoning and interaction detection module is respectively connected with the human-object semantic enhancement module and the visual-semantic external attention module and is used for calculating a final human interaction relation detection result.
CN202310626312.2A 2023-05-30 2023-05-30 Multi-mode-based character interaction relation detection method and detection system thereof Pending CN116721322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310626312.2A CN116721322A (en) 2023-05-30 2023-05-30 Multi-mode-based character interaction relation detection method and detection system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310626312.2A CN116721322A (en) 2023-05-30 2023-05-30 Multi-mode-based character interaction relation detection method and detection system thereof

Publications (1)

Publication Number Publication Date
CN116721322A true CN116721322A (en) 2023-09-08

Family

ID=87872581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310626312.2A Pending CN116721322A (en) 2023-05-30 2023-05-30 Multi-mode-based character interaction relation detection method and detection system thereof

Country Status (1)

Country Link
CN (1) CN116721322A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953590A (en) * 2024-03-27 2024-04-30 武汉工程大学 Ternary interaction detection method, system, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953590A (en) * 2024-03-27 2024-04-30 武汉工程大学 Ternary interaction detection method, system, equipment and medium

Similar Documents

Publication Publication Date Title
US10902243B2 (en) Vision based target tracking that distinguishes facial feature targets
Li et al. Object detection using convolutional neural networks in a coarse-to-fine manner
CN110633632A (en) Weak supervision combined target detection and semantic segmentation method based on loop guidance
CN111985525B (en) Text recognition method based on multi-mode information fusion processing
CN115019039B (en) Instance segmentation method and system combining self-supervision and global information enhancement
WO2021243947A1 (en) Object re-identification method and apparatus, and terminal and storage medium
Fathalla et al. A deep learning pipeline for semantic facade segmentation
CN111626291B (en) Image visual relationship detection method, system and terminal
Jemilda et al. Moving object detection and tracking using genetic algorithm enabled extreme learning machine
CN116721322A (en) Multi-mode-based character interaction relation detection method and detection system thereof
CN114332893A (en) Table structure identification method and device, computer equipment and storage medium
Kaur et al. A systematic review of object detection from images using deep learning
Quiroga et al. A study of convolutional architectures for handshape recognition applied to sign language
Cai et al. Vehicle detection based on visual saliency and deep sparse convolution hierarchical model
Shi Object detection algorithms: a comparison
CN115965968A (en) Small sample target detection and identification method based on knowledge guidance
Langenkämper et al. COATL-a learning architecture for online real-time detection and classification assistance for environmental data
Lyu et al. Distinguishing text/non-text natural images with multi-dimensional recurrent neural networks
CN114462490A (en) Retrieval method, retrieval device, electronic device and storage medium of image object
Asami et al. Data Augmentation with Synthesized Damaged Roof Images Generated by GAN.
CN111160078B (en) Human interaction behavior recognition method, system and device based on video image
Zhang et al. Robust Hierarchical Scene Graph Generation
Gao et al. Complex Labels Text Detection Algorithm Based on Improved YOLOv5.
Sarikaya Basturk Forest fire detection in aerial vehicle videos using a deep ensemble neural network model
Kumar et al. Robust object tracking based on adaptive multicue feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication