CN111783475B - Semantic visual positioning method and device based on phrase relation propagation - Google Patents

Semantic visual positioning method and device based on phrase relation propagation Download PDF

Info

Publication number
CN111783475B
CN111783475B CN202010736120.3A CN202010736120A CN111783475B CN 111783475 B CN111783475 B CN 111783475B CN 202010736120 A CN202010736120 A CN 202010736120A CN 111783475 B CN111783475 B CN 111783475B
Authority
CN
China
Prior art keywords
feature
nodes
node
features
subject
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010736120.3A
Other languages
Chinese (zh)
Other versions
CN111783475A (en
Inventor
俞益洲
史业民
杨思蓓
吴子丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenrui Bolian Technology Co Ltd
Shenzhen Deepwise Bolian Technology Co Ltd
Original Assignee
Beijing Shenrui Bolian Technology Co Ltd
Shenzhen Deepwise Bolian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenrui Bolian Technology Co Ltd, Shenzhen Deepwise Bolian Technology Co Ltd filed Critical Beijing Shenrui Bolian Technology Co Ltd
Priority to CN202010736120.3A priority Critical patent/CN111783475B/en
Publication of CN111783475A publication Critical patent/CN111783475A/en
Application granted granted Critical
Publication of CN111783475B publication Critical patent/CN111783475B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a semantic visual positioning method and a semantic visual positioning device based on phrase relation propagation, wherein the method comprises the following steps: extracting multi-scale features of the image information, and fusing the multi-scale features with the position information of each feature to obtain image space features; acquiring character description information, generating a grammatical graph by using a syntax analysis tool, learning phrase characteristics, fusing the phrase characteristics and image space characteristics to obtain multi-modal characteristics, learning a phrase enhanced characteristic graph to obtain semantic characteristics, and inputting the semantic characteristics of subject nodes, the multi-modal characteristics of subject nodes, the semantic characteristics of object nodes and the multi-modal characteristics of object nodes into a relationship propagation module to obtain a relationship enhanced characteristic graph; carrying out node combination on each node to obtain a final relation enhancement feature graph; and for each space position of the final relationship-enhanced feature map, selecting the detection box with the highest confidence coefficient as an output, and outputting a prediction result.

Description

Semantic visual positioning method and device based on phrase relation propagation
Technical Field
The invention relates to the field of computers, in particular to a semantic visual positioning method and device based on phrase relation propagation.
Background
In order to realize the computer-based human behavior solution and the efficient and accurate human-computer interaction, the key point is that a computer simultaneously understands natural language and visual scenes. In order to accurately understand the correspondence between the language element and the visual element, the language element needs to be semantically aligned with the visual region. The problem is defined as semantic visual localization, whose main goal is to find the visual area corresponding to the noun phrase in the sentence. It is similar to target detection, but requires detection of the corresponding region based on unlimited noun phrases, not just categories. The main challenges are: 1. the definite language of the noun will affect the corresponding object; 2. it may be desirable to select objects depending on the relative relationships of multiple phrases in the sentence. For example, in the sentence "a man playing a violin stands beside another man playing a five-stringed instrument", the detection of the position of the violin can assist in identifying the first person, and the elements such as "a hu", "a five-stringed instrument" and the like also play an important role in detecting the corresponding person.
Most of the existing work only considers the correlation between phrases and objects, performs feature fusion and ignores the correlation between phrases. Only a few efforts have considered phrase context information, but have also focused only on partial or sparse context information, without explicitly representing linguistic relationships, such as citation relationships, phrase pairs, context feedback. Therefore, the existing method is difficult to process the problems of complex semantics, complex language definition or multiple dependency relationships, so that the problems of wrong detection, wrong matching and the like of objects in the scene are caused.
Disclosure of Invention
The present invention aims to provide a semantic visual localization method and apparatus based on phrase relationship propagation that overcomes or at least partially solves the above mentioned problems.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
one aspect of the present invention provides a semantic visual positioning method based on phrase relation propagation, including: acquiring image information, processing the image information by using the depth CNN, and extracting multi-scale features of the image information at a plurality of depths; acquiring position information of each feature; fusing the multi-scale features with the position information of each feature to obtain image space features; acquiring character description information, and generating a grammatical graph by using a syntax analysis tool, wherein the grammatical graph comprises nodes and edges, the nodes comprise object nodes and subject nodes, each node corresponds to a word sequence in the character description information, the edges comprise an object node edge set and a subject node edge set, and each edge corresponds to a relation between a subject and an object; coding each word into a word embedding vector, setting the initial phrase embedding characteristics of each node as the embedding characteristic average value of all words in the phrase, and learning the phrase characteristics; fusing the phrase characteristics and the image space characteristics to obtain multi-modal characteristics, and learning a phrase enhanced characteristic diagram; combining a subject node, an object node and a preposition or verb of each edge of the multi-modal feature into a sequence, inputting an embedded vector of each word into a bidirectional LSTM, and obtaining semantic features by splicing hidden vectors of forward and reverse LSTM; inputting the semantic features of the subject language nodes, the multi-modal features of the subject language nodes, the semantic features of the object nodes and the multi-modal features of the object nodes into a relationship propagation module to obtain a relationship-enhanced feature map; carrying out node combination on each node to obtain a final relation enhancement feature graph; and for each spatial position of the final feature map with enhanced relationship, matching 3 anchor point frames on the feature maps with multiple scales, selecting the detection frame with the highest confidence coefficient as output, and outputting a prediction result.
The method comprises the following steps of inputting semantic features of subject nodes, multi-modal features of the subject nodes, semantic features of object nodes and multi-modal features of the object nodes into a relationship propagation module, and obtaining a feature graph with enhanced relationship, wherein the method comprises the following steps: the relationship-enhancing feature of the subject node is calculated by the following formula:
Figure GDA0002662402490000021
Figure GDA0002662402490000022
Rsub=Conv(γ(M′sub+Tile(gobj) ))) to compute the relationship-enhancing features of the object nodes by the following formula:
Figure GDA0002662402490000023
Robj=Conv(γ(M′obj+Tile(gsub) In a) of (a) to (b)), wherein v) issubIs a subject node, MsubBeing subject nodesMultimodal feature map, SsubFeature graph, v, enhanced for subject node phrasesobjObject node, MobjMultimodal feature map and S for object nodesobjA phrase enhanced feature graph of object nodes and a semantic feature with h as an edge; linear stands for multilayer fully connected layers, AvgPool stands for global average pooling,
Figure GDA0002662402490000024
representing the element multiplication, Conv representing the convolution layer, and gamma representing the ReLU activation function.
The node combination is carried out on each node, and the final relation enhancement feature graph is obtained by the following steps: and combining the relationship enhancement feature graph generated by combining the edges in the subject node edge set and the object node edge set with the phrase enhancement feature graph, and performing multiple iterations to obtain a final relationship enhancement feature graph.
Wherein obtaining the location information for each feature comprises: calculating the length-width ratio and the reciprocal of the length and the width of each position, organizing all values into the same dimensionality as the multi-scale features, and obtaining the position information of each feature; fusing the multi-scale features with the position information of each feature to obtain image features, wherein the image features comprise: and normalizing the multi-scale features and splicing the multi-scale features with the position information of each feature.
In another aspect, the present invention provides a semantic visual positioning apparatus based on phrase relation propagation, including: the image feature extraction module is used for acquiring image information, processing the image information by using the depth CNN and extracting multi-scale features of the image information at a plurality of depths; acquiring position information of each feature; fusing the multi-scale features with the position information of each feature to obtain image space features; the grammar graph building module is used for acquiring the character description information and generating a grammar graph by utilizing a syntax analysis tool, wherein the grammar graph comprises nodes and edges, the nodes comprise object nodes and subject nodes, each node corresponds to a word sequence in the character description information, the edges comprise an object node edge set and a subject node edge set, and each edge corresponds to a relationship between a subject and an object; the grammar graph propagation module is used for coding each word into a word embedding vector, setting the initial phrase embedding characteristics of each node as the embedding characteristic average value of all words in the phrase and learning the phrase characteristics; fusing the phrase characteristics and the image space characteristics to obtain multi-modal characteristics, and learning a phrase enhanced characteristic diagram; combining a subject node, an object node and a preposition or verb of each edge of the multi-modal feature into a sequence, inputting an embedded vector of each word into a bidirectional LSTM, and obtaining semantic features by splicing hidden vectors of forward and reverse LSTM; the relation propagation module is used for inputting the semantic features of the subject nodes, the multi-modal features of the subject nodes, the semantic features of the object nodes and the multi-modal features of the object nodes into the relation propagation module to obtain a relation-enhanced feature map; carrying out node combination on each node to obtain a final relation enhancement feature graph; and the prediction module is used for matching 3 anchor point frames on the feature maps of multiple scales for each space position of the final feature map with enhanced relationship, selecting the detection frame with the highest confidence coefficient as output, and outputting a prediction result.
The relation propagation module inputs the semantic features of the subject nodes, the multi-modal features of the subject nodes, the semantic features of the object nodes and the multi-modal features of the object nodes into the relation propagation module in the following mode to obtain a feature graph with enhanced relation: the relationship propagation module is specifically used for calculating the relationship enhancement characteristics of the subject nodes through the following formula:
Figure GDA0002662402490000031
Rsub=Conv(γ(M′sub+Tile(gobj) ))) to compute the relationship-enhancing features of the object nodes by the following formula:
Figure GDA0002662402490000034
Figure GDA0002662402490000032
Robj=Conv(γ(M′obj+ Tilegsub, where vsub is the subject node, Msub is the multimodal feature map of the subject node, Ssub is the feature map of subject node phrase enhancement, vobjObject node, MobjMultimodal feature map and S for object nodesobjA phrase enhanced feature graph of object nodes and a semantic feature with h as an edge; linear stands for multilayer fully connected layers, AvgPool stands for global average pooling,
Figure GDA0002662402490000033
representing the element multiplication, Conv representing the convolution layer, and gamma representing the ReLU activation function.
The relationship propagation module performs node combination on each node in the following mode to obtain a final relationship enhancement feature graph: and the relationship propagation module is specifically used for combining the relationship enhancement feature graph generated by combining the edges in the subject node edge set and the object node edge set with the phrase enhancement feature graph, and performing multiple iterations to obtain the final relationship enhancement feature graph.
The image feature extraction module acquires the position information of each feature in the following way: the image feature extraction module is specifically used for calculating the length-width ratio and the reciprocal of the length and the width of each position, organizing all values into the same dimensionality as the multi-scale features, and obtaining the position information of each feature; the image feature extraction module fuses the multi-scale features and the position information of each feature in the following mode to obtain the image features: and the image feature extraction module is specifically used for normalizing the multi-scale features and splicing the multi-scale features with the position information of each feature.
Therefore, the semantic visual positioning method and device based on phrase relation propagation provided by the invention realize modeling of any bidirectional and multiple semantic relations through a relation propagation mechanism so as to accurately distinguish different targets in a target scene; an effective complex directed graph analysis framework is provided, and efficient modeling is realized by decomposing multiple relations into a plurality of pairwise relations; modeling of any grammatical relation is supported, and the analysis capability of a complex scene is improved; by disassembling the complex relationship into pairwise relationships, the analysis difficulty is greatly reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flowchart of a semantic visual positioning method based on phrase relation propagation according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a semantic visual positioning apparatus based on phrase relationship propagation according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The core of the invention is that: the invention leads the network to learn complex multiple relations by introducing pairwise relations among all nodes in the grammatical graph and introducing a relation transfer mechanism in the neural network, and leads the network to acquire the global dependency relation by fusing the information of the same node in different relations. The invention aims to simultaneously solve 2 main challenges in semantic visual positioning and improve the detection accuracy of the algorithm in practical application.
Aiming at the important influence of relative relation of short words in natural language on object positioning, a propagation network (LSPN) guided by a grammar structure is provided, and the core of the propagation network at least comprises grammar graph construction, phrase relation propagation and single-stage grammar graph visual positioning.
Fig. 1 is a flowchart illustrating a semantic visual positioning method based on phrase relationship propagation according to an embodiment of the present invention, and referring to fig. 1, the semantic visual positioning method based on phrase relationship propagation according to an embodiment of the present invention includes:
s1, acquiring image information, processing the image information by using the depth CNN, and extracting multi-scale features of the image information at a plurality of depths;
s2, acquiring the position information of each feature;
and S3, fusing the multi-scale features and the position information of each feature to obtain image space features.
Specifically, the above steps S1-S3 are image feature extraction steps. The method comprises the steps of extracting spatial features from an image, wherein the spatial features generally refer to texture features of the image; the position information can be intuitively understood as the specific position of each object and pixel in the current image, and when corresponding to the language, the position information can be expressed as position descriptions such as "left side", "middle", and the like.
As an optional implementation manner of the embodiment of the present invention, the acquiring the location information of each feature includes: and calculating the length-width ratio and the reciprocal of the length and the width of each position, organizing all values into the same dimensionality as the multi-scale features, and obtaining the position information of each feature.
As an optional implementation manner of the embodiment of the present invention, fusing the multi-scale features and the position information of each feature, and obtaining the image feature includes: and normalizing the multi-scale features and splicing the multi-scale features with the position information of each feature.
In particular, the image may be processed using the depth CNN, and multi-scale features, denoted V, may be extracted at multiple depths.
To further represent the location information of the feature, the present invention calculates the aspect ratios x/W and y/H and the reciprocal values of 1/W and 1/H at each location and organizes all values into the same dimension as V, denoted P.
Afterwards, this patent realizes the integration of spatial feature and positional information through the concatenation, shows as:
F=[L2(V);P]
wherein L2 represents L2 normalization, and the semicolons represent feature splices. L2 is norm normalization, i.e., the L2 norm of the vector divided by each element in the vector.
For example: for an image with a length and width of [300,400], the position information of the pixel at the [100,100] th position can be represented as [100/300,100/400,1/300,1/400], so as to tell the subsequent machine learning algorithm the normalized coordinates (position of relative edge) of the current pixel in the image, which has an important role in the position information such as "left side", "middle" and the like in the model understanding language, and can be used for positioning a specific object.
And S4, acquiring the text description information, and generating a grammar graph by using a syntax analysis tool, wherein the grammar graph comprises nodes and edges, the nodes comprise object nodes and subject nodes, each node corresponds to a word sequence in the text description information, the edges comprise an object node edge set and a subject node edge set, and each edge corresponds to a relation between a subject and an object.
Specifically, step S4 is a syntax map construction step. It should be noted that the sequence of steps S1-S3 may be executed simultaneously with step S4, or steps S1-S3 may be executed first and then step S4 is executed, or steps S4 may be executed first and then steps S1-S3 are executed, which is not limited in the present invention.
The present invention represents statements as graph structures by utilizing syntactic parsing tools, such as "hakura LTP" and "Stanford Parser".
Wherein L represents language description, and the generated grammar graph is represented as
Figure GDA0002662402490000051
Wherein
Figure GDA0002662402490000052
On behalf of all of the nodes, the node,
Figure GDA0002662402490000053
representing a directed edge. Each vnCorresponding to the word sequence L in LnEach edge ekIs shown as
Figure GDA0002662402490000054
Figure GDA0002662402490000055
Wherein
Figure GDA0002662402490000056
And
Figure GDA0002662402490000057
representing a subject node and an object node,
Figure GDA0002662402490000058
represents prepositions or verbs, corresponding to the relationship between subjects and objects. Then, for this patent
Figure GDA0002662402490000061
Representing the object node as vnThe set of edges of (a) is,
Figure GDA0002662402490000062
when a subject node is represented vnSet of edges of, denRepresents vnDegree of (c).
Wherein, the in set represents a set formed by all edges taking Vn as object nodes; the out set represents the set of all edges with Vn as the subject node. For ease of understanding, one can imagine a network consisting of objects, subjects, all lines connected to the nodes of the objects constituting the in set, and all lines from the nodes of the subjects constituting the out set.
S5, coding each word into a word embedding vector, setting the initial phrase embedding characteristics of each node as the embedding characteristic average value of all words in the phrase, and learning the phrase characteristics;
s6, fusing the phrase features and the image space features to obtain multi-modal features, and learning a phrase enhanced feature map;
s7, combining the subject nodes, object nodes and prepositions or verbs of each edge of the multi-modal feature into a sequence, inputting the embedded vector of each word into a bidirectional LSTM, and obtaining semantic features by splicing the hidden vectors of the forward LSTM and the reverse LSTM;
s8, inputting the semantic features of the subject nodes, the multi-modal features of the subject nodes, the semantic features of the object nodes and the multi-modal features of the object nodes into a relationship propagation module to obtain a relationship-enhanced feature map;
and S9, carrying out node combination on each node to obtain a final relation enhancement feature graph.
Specifically, the above steps S5-S9 are syntax diagram propagation steps.
As an optional implementation manner of the embodiment of the present invention, the node merging is performed on each node, and obtaining a final relationship enhancement feature graph includes: and combining the relationship enhancement feature graph generated by combining the edges in the subject node edge set and the object node edge set with the phrase enhancement feature graph, and performing multiple iterations to obtain a final relationship enhancement feature graph.
As an optional implementation manner of the embodiment of the present invention, inputting the semantic features of the subject node, the multi-modal features of the subject node, and the semantic features of the object nodes and the multi-modal features of the object nodes to the relationship propagation module, and obtaining the feature graph with enhanced relationship includes:
the relationship-enhancing feature of the subject node is calculated by the following formula:
Figure GDA0002662402490000063
Figure GDA0002662402490000064
Rsub=Conv(γ(M′sub+Tile(gobj))),
calculating the relationship enhancement characteristics of the object nodes by the following formula:
Figure GDA0002662402490000065
Figure GDA0002662402490000066
Robj=Conv(γ(M′obj+Tile(gsub))),
wherein v issubIs a subject node, MsubMultimodal feature map, S, being subject nodesubFeature graph, v, enhanced for subject node phrasesobjObject node, MobjMultimodal feature map and S for object nodesobjA phrase enhanced feature graph of object nodes and a semantic feature with h as an edge; linear stands for multilayer fully connected layers, AvgPool stands for global average pooling,
Figure GDA0002662402490000071
representing the element multiplication, Conv representing the convolution layer, and gamma representing the ReLU activation function.
In specific implementation, in order to extract phrase features, each word is coded into a word embedding vector (word embedding vector), and the initial phrase embedding features of each node are set as the average value of the embedding features of all words in the phrase. For node vnThe noun phrase is LnThe phrase embedding features are
Figure GDA0002662402490000072
Learning phrase features through full connectivity layers
Figure GDA0002662402490000073
W′n=L2(Linear(wn))
Wherein Linear represents a multilayer full-link layer, and L2 represents L2 normalization. Then, by applying the phrase feature w'nFusing the spatial feature F to obtain a multi-modal feature
Figure GDA0002662402490000074
Simultaneous learning phrase enhanced feature maps
Figure GDA0002662402490000075
Mn=L2(Conv([F;Tile(w′n)]))
Sn=σ(Conv(Conv(F)+Tile(Fc(w′n))))
Where Tile represents the vector copied into a W × H matrix, Conv represents the convolutional layer, Fc represents the fully-connected layer, and σ represents the sigmoid activation function.
Where Mn is a multi-modal signature and Sn is a signature after local enhancement by a phrase signature.
Having obtained the multi-modal characteristics of all nodes, the present invention can propagate them along the edge ε. The method comprises the following specific steps: for an edge
Figure GDA0002662402490000076
We will want to
Figure GDA0002662402490000077
The above phrases are combined into a sequence, and the embedded vector of each word is input into a bidirectional LSTM, and then semantic features are obtained by splicing the hidden vectors of the forward and reverse LSTM, which are expressed as hk
Then, the subject node is added
Figure GDA0002662402490000078
And object node
Figure GDA0002662402490000079
The semantic features and the multi-modal features are input to a relationship propagation module to obtain a relationship-enhanced feature map, represented as
Figure GDA00026624024900000710
And
Figure GDA00026624024900000711
the relation propagation module respectively calculates the subject nodes and the object nodes: the following parameters were used: subject node vsubMulti-modal feature map MsubAnd phrase enhanced feature map SsubObject node vobjM of (A)objAnd SobjAnd semantic feature h of the edge;
wherein: relationship enhancement features for subject nodes
Figure GDA00026624024900000712
The calculation is as follows:
Figure GDA00026624024900000713
Figure GDA00026624024900000714
Rsub=Conv(γ(M′sub+Tile(gobj))),
relationship enhancing features for object nodes
Figure GDA00026624024900000715
Can be expressed as:
Figure GDA0002662402490000081
Figure GDA0002662402490000082
Robj=Conv(γ(M′obj+Tile(gsub))),
wherein Linear represents a multilayer fully-connected layer, AvgPool represents global average pooling,
Figure GDA0002662402490000083
representing the element multiplication, Conv representing the convolution layer, and gamma representing the ReLU activation function.
Thereafter, for each node vnNode merging is carried out, and the invention is combined
Figure GDA0002662402490000084
And
Figure GDA0002662402490000085
the relationship generated by the edge in (1) enhances the feature map to obtain the final relationshipIs an enhanced feature map and further enhances the feature map S with the initial phrasenCombine to obtain
Figure GDA0002662402490000086
The calculation process is expressed as:
Figure GDA0002662402490000087
Figure GDA0002662402490000088
wherein, denRepresentative node vnDegree of (c).
The above node merging process may be iteratively performed multiple times, and at each iteration, the merged enhanced feature graph is enhanced by replacing the phrase enhanced feature graph with the merged enhanced feature graph of the last step as input to the relationship propagation module.
And S10, for each spatial position of the final feature map with enhanced relationship, matching 3 anchor point frames on the feature map with multiple scales, selecting the detection frame with the highest confidence as output, and outputting a prediction result.
Specifically, the present invention can use a structure similar to YOLOv3 for object position prediction. For each spatial position of the feature map, we match 3 anchor boxes over the feature map at multiple scales and select the detection box with the highest confidence as output.
Therefore, the semantic visual positioning method based on phrase relation propagation provided by the invention realizes modeling of any bidirectional and multiple semantic relations through a relation propagation mechanism so as to accurately distinguish different targets in a target scene; an effective complex directed graph analysis framework is provided, and efficient modeling is realized by decomposing multiple relations into a plurality of pairwise relations; modeling of any grammatical relation is supported, and the analysis capability of a complex scene is improved; by disassembling the complex relationship into pairwise relationships, the analysis difficulty is greatly reduced.
Fig. 2 is a schematic structural diagram of a semantic visual positioning apparatus based on phrase relationship propagation according to an embodiment of the present invention, in which the above method is applied to the semantic visual positioning apparatus based on phrase relationship propagation, and the following only briefly explains the structure of the semantic visual positioning apparatus based on phrase relationship propagation, and please refer to the related description in the above semantic visual positioning method based on phrase relationship propagation for other things, referring to fig. 2, the semantic visual positioning apparatus based on phrase relationship propagation according to an embodiment of the present invention includes:
the image feature extraction module is used for acquiring image information, processing the image information by using the depth CNN and extracting multi-scale features of the image information at a plurality of depths; acquiring position information of each feature; fusing the multi-scale features with the position information of each feature to obtain image space features;
the grammar graph building module is used for acquiring the character description information and generating a grammar graph by utilizing a syntax analysis tool, wherein the grammar graph comprises nodes and edges, the nodes comprise object nodes and subject nodes, each node corresponds to a word sequence in the character description information, the edges comprise an object node edge set and a subject node edge set, and each edge corresponds to a relationship between a subject and an object;
the grammar graph propagation module is used for coding each word into a word embedding vector, setting the initial phrase embedding characteristics of each node as the embedding characteristic average value of all words in the phrase and learning the phrase characteristics; fusing the phrase characteristics and the image space characteristics to obtain multi-modal characteristics, and learning a phrase enhanced characteristic diagram; combining a subject node, an object node and a preposition or verb of each edge of the multi-modal feature into a sequence, inputting an embedded vector of each word into a bidirectional LSTM, and obtaining semantic features by splicing hidden vectors of forward and reverse LSTM;
the relation propagation module is used for inputting the semantic features of the subject nodes, the multi-modal features of the subject nodes, the semantic features of the object nodes and the multi-modal features of the object nodes into the relation propagation module to obtain a relation-enhanced feature map; carrying out node combination on each node to obtain a final relation enhancement feature graph;
and the prediction module is used for matching 3 anchor point frames on the feature maps of multiple scales for each space position of the final feature map with enhanced relationship, selecting the detection frame with the highest confidence coefficient as output, and outputting a prediction result.
As an optional implementation manner of the embodiment of the present invention, the image feature extraction module obtains the position information of each feature by: and the image feature extraction module is specifically used for calculating the length-width ratio and the reciprocal of the length-width at each position, organizing all values into the same dimensionality as the multi-scale features, and obtaining the position information of each feature.
As an optional implementation manner of the embodiment of the present invention, the image feature extraction module obtains the image features by fusing the multi-scale features with the position information of each feature in the following manner: and the image feature extraction module is specifically used for normalizing the multi-scale features and splicing the multi-scale features with the position information of each feature.
As an optional implementation manner of the embodiment of the present invention, the relationship propagation module performs node merging on each node in the following manner to obtain a final relationship enhancement feature map: and the relationship propagation module is specifically used for combining the relationship enhancement feature graph generated by combining the edges in the subject node edge set and the object node edge set with the phrase enhancement feature graph, and performing multiple iterations to obtain the final relationship enhancement feature graph.
As an optional implementation manner of the embodiment of the present invention, the relationship propagation module inputs the semantic features of the subject node, the multi-modal features of the subject node, and the semantic features of the object node and the multi-modal features of the object node to the relationship propagation module in the following manner, so as to obtain a feature map with enhanced relationship:
the relationship propagation module is specifically used for calculating the relationship enhancement characteristics of the subject nodes through the following formula:
Figure GDA0002662402490000101
Figure GDA0002662402490000102
Rsub=Conv(γ(M′sub+Tile(gobj))),
calculating the relationship enhancement characteristics of the object nodes by the following formula:
Figure GDA0002662402490000103
Figure GDA0002662402490000104
Robj=Conv(γ(M′obj+Tile(gsub))),
wherein v issubIs a subject node, MsubMultimodal feature map, S, being subject nodesubFeature graph, v, enhanced for subject node phrasesobjObject node, MobjMultimodal feature map and S for object nodesobjA phrase enhanced feature graph of object nodes and a semantic feature with h as an edge; linear stands for multilayer fully connected layers, AvgPool stands for global average pooling,
Figure GDA0002662402490000105
representing the element multiplication, Conv representing the convolution layer, and gamma representing the ReLU activation function.
Therefore, the semantic visual positioning device based on phrase relation propagation realizes modeling of any bidirectional and multiple semantic relations through a relation propagation mechanism so as to accurately distinguish different targets in a target scene; an effective complex directed graph analysis framework is provided, and efficient modeling is realized by decomposing multiple relations into a plurality of pairwise relations; modeling of any grammatical relation is supported, and the analysis capability of a complex scene is improved; by disassembling the complex relationship into pairwise relationships, the analysis difficulty is greatly reduced.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (4)

1. A semantic visual positioning method based on phrase relation propagation is characterized by comprising the following steps:
acquiring image information, processing the image information by using a depth CNN, and extracting multi-scale features of the image information at a plurality of depths;
acquiring position information of each feature;
fusing the multi-scale features with the position information of each feature to obtain image space features;
obtaining character description information, and generating a grammatical graph by using a syntax analysis tool, wherein the grammatical graph comprises nodes and edges, the nodes comprise object nodes and subject nodes, each node corresponds to a word sequence in the character description information, the edges comprise an object node edge set and a subject node edge set, and each edge corresponds to a relation between a subject and an object;
coding each word into a word embedding vector, setting the initial phrase embedding characteristics of each node as the embedding characteristic average value of all words in the phrase, and learning the phrase characteristics;
fusing the phrase characteristics and the image space characteristics to obtain multi-modal characteristics, and learning a phrase enhanced characteristic diagram;
combining the subject nodes, object nodes and prepositions or verbs of each edge of the multi-modal feature into a sequence, inputting the embedded vector of each word into a bidirectional LSTM, and obtaining semantic features by splicing hidden vectors of forward and reverse LSTMs;
inputting the semantic features of the subject nodes, the multi-modal features of the subject nodes, and the semantic features of the object nodes and the multi-modal features of the object nodes into a relationship propagation module to obtain a relationship-enhanced feature map;
carrying out node combination on each node to obtain a final relation enhancement feature graph;
for each spatial position of the final relationship-enhanced feature map, matching 3 anchor point frames on the feature maps of multiple scales, selecting a detection frame with the highest confidence as output, and outputting a prediction result;
inputting the semantic features of the subject nodes, the multi-modal features of the subject nodes, and the semantic features of the object nodes and the multi-modal features of the object nodes to a relationship propagation module to obtain a feature graph with enhanced relationship, wherein the method comprises the following steps:
the relationship-enhancing feature of the subject node is calculated by the following formula:
Figure FDA0002961989970000011
Figure FDA0002961989970000012
Rsub=Conv(γ(M′sub+Tile(gobj))),
calculating the relationship enhancement characteristics of the object nodes by the following formula:
Figure FDA0002961989970000021
Figure FDA0002961989970000022
Robj=Conv(γ(M′obj+Tile(gsub))),
wherein v issubIs a subject node, MsubMultimodal feature map, S, being subject nodesubFeature graph, v, enhanced for subject node phrasesobjObject node, MobjMultimodal feature map and S for object nodesobjA phrase enhanced feature graph of object nodes and a semantic feature with h as an edge; linear stands for multilayer fully connected layers, AvgPool stands for global average pooling,
Figure FDA0002961989970000023
representing element multiplication, Conv representing convolution layer, and gamma representing ReLU activation function;
the node combination of each node to obtain the final relationship enhancement feature graph comprises the following steps: and combining the relationship enhancement feature graph generated by combining the edges in the subject node edge set and the object node edge set with the phrase enhancement feature graph, and performing multiple iterations to obtain a final relationship enhancement feature graph.
2. The method of claim 1,
the acquiring the position information of each feature comprises:
calculating the length-width ratio and the reciprocal of the length and the width of each position, organizing all values into the same dimensionality as the multi-scale features, and obtaining the position information of each feature;
the fusing the multi-scale features with the position information of each feature to obtain the image features comprises:
and normalizing the multi-scale features and splicing the multi-scale features with the position information of each feature.
3. A semantic visual positioning device based on phrase relation propagation is characterized by comprising:
the image feature extraction module is used for acquiring image information, processing the image information by using the depth CNN and extracting multi-scale features of the image information at a plurality of depths; acquiring position information of each feature; fusing the multi-scale features with the position information of each feature to obtain image space features;
the grammar graph building module is used for obtaining the character description information and generating a grammar graph by utilizing a syntax analysis tool, wherein the grammar graph comprises nodes and edges, the nodes comprise object nodes and subject nodes, each node corresponds to a word sequence in the character description information, the edges comprise an object node edge set and a subject node edge set, and each edge corresponds to a relation between a subject and an object;
the grammar graph propagation module is used for coding each word into a word embedding vector, setting the initial phrase embedding characteristics of each node as the embedding characteristic average value of all words in the phrase and learning the phrase characteristics; fusing the phrase characteristics and the image space characteristics to obtain multi-modal characteristics, and learning a phrase enhanced characteristic diagram; combining the subject nodes, object nodes and prepositions or verbs of each edge of the multi-modal feature into a sequence, inputting the embedded vector of each word into a bidirectional LSTM, and obtaining semantic features by splicing hidden vectors of forward and reverse LSTMs;
the relation propagation module is used for inputting the semantic features of the subject nodes, the multi-modal features of the subject nodes, the semantic features of the object nodes and the multi-modal features of the object nodes into the relation propagation module to obtain a feature graph with enhanced relation; carrying out node combination on each node to obtain a final relation enhancement feature graph;
the prediction module is used for matching 3 anchor point frames on the feature maps with multiple scales for each space position of the final feature map with enhanced relationship, selecting the detection frame with the highest confidence coefficient as output and outputting a prediction result;
the relation propagation module inputs the semantic features of the subject nodes, the multi-modal features of the subject nodes, and the semantic features of the object nodes and the multi-modal features of the object nodes to the relation propagation module in the following way, so as to obtain a feature graph with enhanced relation:
the relationship propagation module is specifically configured to calculate a relationship enhancement feature of the subject node by the following formula:
Figure FDA0002961989970000031
Figure FDA0002961989970000032
Rsub=Conv(γ(M′sub+Tile(gobj))),
calculating the relationship enhancement characteristics of the object nodes by the following formula:
Figure FDA0002961989970000033
Figure FDA0002961989970000034
Robj=Conv(γ(M′obj+Tile(gsub))),
wherein v issubIs a subject node, MsubMultimodal feature map, S, being subject nodesubFeature graph, v, enhanced for subject node phrasesobjObject node, MobjMultimodal feature map and S for object nodesobjA phrase enhanced feature graph of object nodes and a semantic feature with h as an edge; linear stands for multilayer fully connected layers, AvgPool stands for global average pooling,
Figure FDA0002961989970000035
representing element multiplication, Conv representing convolution layer, and gamma representing ReLU activation function;
the relationship propagation module performs node combination on each node in the following mode to obtain a final relationship enhancement feature map: the relationship propagation module is specifically configured to combine the relationship enhancement feature map generated by combining the edges in the subject node edge set and the object node edge set with the phrase enhancement feature map, and perform multiple iterations to obtain a final relationship enhancement feature map.
4. The apparatus of claim 3,
the image feature extraction module acquires the position information of each feature in the following way:
the image feature extraction module is specifically used for calculating the length-width ratio and the reciprocal of the length and the width of each position, organizing all values into the same dimensionality as the multi-scale features, and obtaining the position information of each feature;
the image feature extraction module fuses the multi-scale features and the position information of each feature in the following mode to obtain image features:
the image feature extraction module is specifically configured to normalize the multi-scale features and splice the normalized multi-scale features with the position information of each feature.
CN202010736120.3A 2020-07-28 2020-07-28 Semantic visual positioning method and device based on phrase relation propagation Active CN111783475B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010736120.3A CN111783475B (en) 2020-07-28 2020-07-28 Semantic visual positioning method and device based on phrase relation propagation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010736120.3A CN111783475B (en) 2020-07-28 2020-07-28 Semantic visual positioning method and device based on phrase relation propagation

Publications (2)

Publication Number Publication Date
CN111783475A CN111783475A (en) 2020-10-16
CN111783475B true CN111783475B (en) 2021-05-11

Family

ID=72765045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010736120.3A Active CN111783475B (en) 2020-07-28 2020-07-28 Semantic visual positioning method and device based on phrase relation propagation

Country Status (1)

Country Link
CN (1) CN111783475B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112258564B (en) * 2020-10-20 2022-02-08 推想医疗科技股份有限公司 Method and device for generating fusion feature set
CN113191140B (en) * 2021-07-01 2021-10-15 北京世纪好未来教育科技有限公司 Text processing method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682059A (en) * 2015-11-11 2017-05-17 奥多比公司 Structured knowledge modeling and extraction from images
CN110751220A (en) * 2019-10-24 2020-02-04 江西应用技术职业学院 Machine vision indoor positioning method based on improved convolutional neural network structure
CN111126175A (en) * 2019-12-05 2020-05-08 厦门大象东方科技有限公司 Facial image recognition algorithm based on deep convolutional neural network
CN111259768A (en) * 2020-01-13 2020-06-09 清华大学 Image target positioning method based on attention mechanism and combined with natural language
EP3667561A1 (en) * 2018-12-10 2020-06-17 HERE Global B.V. Method and apparatus for creating a visual map without dynamic content
CN111325279A (en) * 2020-02-26 2020-06-23 福州大学 Pedestrian and personal sensitive article tracking method fusing visual relationship

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521877B (en) * 2011-12-12 2014-04-16 北京航空航天大学 Method for reconstructing Chinese ancient building meaning model and component gallery from single image
US9251433B2 (en) * 2012-12-10 2016-02-02 International Business Machines Corporation Techniques for spatial semantic attribute matching for location identification
CN104735468B (en) * 2015-04-03 2018-08-31 北京威扬科技有限公司 A kind of method and system that image is synthesized to new video based on semantic analysis
US20180060487A1 (en) * 2016-08-28 2018-03-01 International Business Machines Corporation Method for automatic visual annotation of radiological images from patient clinical data
CN108737530A (en) * 2018-05-11 2018-11-02 深圳双猴科技有限公司 A kind of content share method and system
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110609891B (en) * 2019-09-18 2021-06-08 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN111259851B (en) * 2020-01-23 2021-04-23 清华大学 Multi-mode event detection method and device
CN111402260A (en) * 2020-02-17 2020-07-10 北京深睿博联科技有限责任公司 Medical image segmentation method, system, terminal and storage medium based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682059A (en) * 2015-11-11 2017-05-17 奥多比公司 Structured knowledge modeling and extraction from images
EP3667561A1 (en) * 2018-12-10 2020-06-17 HERE Global B.V. Method and apparatus for creating a visual map without dynamic content
CN110751220A (en) * 2019-10-24 2020-02-04 江西应用技术职业学院 Machine vision indoor positioning method based on improved convolutional neural network structure
CN111126175A (en) * 2019-12-05 2020-05-08 厦门大象东方科技有限公司 Facial image recognition algorithm based on deep convolutional neural network
CN111259768A (en) * 2020-01-13 2020-06-09 清华大学 Image target positioning method based on attention mechanism and combined with natural language
CN111325279A (en) * 2020-02-26 2020-06-23 福州大学 Pedestrian and personal sensitive article tracking method fusing visual relationship

Also Published As

Publication number Publication date
CN111783475A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
Prakash et al. Neural paraphrase generation with stacked residual LSTM networks
CN108804530B (en) Subtitling areas of an image
US11972365B2 (en) Question responding apparatus, question responding method and program
WO2020140487A1 (en) Speech recognition method for human-machine interaction of smart apparatus, and system
CN108763535B (en) Information acquisition method and device
Mathur et al. Camera2Caption: a real-time image caption generator
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN111783475B (en) Semantic visual positioning method and device based on phrase relation propagation
US20240119268A1 (en) Data processing method and related device
CN116304748B (en) Text similarity calculation method, system, equipment and medium
Yang et al. Propagating over phrase relations for one-stage visual grounding
JP2023022845A (en) Method of processing video, method of querying video, method of training model, device, electronic apparatus, storage medium and computer program
CN111783457A (en) Semantic visual positioning method and device based on multi-modal graph convolutional network
Madhfar et al. Effective deep learning models for automatic diacritization of Arabic text
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Jie et al. Dependency-based hybrid trees for semantic parsing
Zhang et al. Chinese named-entity recognition via self-attention mechanism and position-aware influence propagation embedding
KR20220098628A (en) System and method for processing natural language using correspondence learning
WO2023169301A1 (en) Text processing method and apparatus, and electronic device
CN116776287A (en) Multi-mode emotion analysis method and system integrating multi-granularity vision and text characteristics
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
WO2023137903A1 (en) Reply statement determination method and apparatus based on rough semantics, and electronic device
CN114332288B (en) Method for generating text generation image of confrontation network based on phrase drive and network
CN113536797A (en) Slice document key information single model extraction method and system
Xie et al. Focusing attention network for answer ranking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Semantic Visual Localization Method and Device Based on Phrase Relationship Propagation

Effective date of registration: 20231007

Granted publication date: 20210511

Pledgee: Guotou Taikang Trust Co.,Ltd.

Pledgor: SHENZHEN DEEPWISE BOLIAN TECHNOLOGY Co.,Ltd.

Registration number: Y2023980059614

PE01 Entry into force of the registration of the contract for pledge of patent right