CN110390340B - Feature coding model, training method and detection method of visual relation detection model - Google Patents

Feature coding model, training method and detection method of visual relation detection model Download PDF

Info

Publication number
CN110390340B
CN110390340B CN201910650283.7A CN201910650283A CN110390340B CN 110390340 B CN110390340 B CN 110390340B CN 201910650283 A CN201910650283 A CN 201910650283A CN 110390340 B CN110390340 B CN 110390340B
Authority
CN
China
Prior art keywords
target
feature
model
coding model
feature coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910650283.7A
Other languages
Chinese (zh)
Other versions
CN110390340A (en
Inventor
朱艺
梁小丹
林倞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DMAI Guangzhou Co Ltd
Original Assignee
DMAI Guangzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DMAI Guangzhou Co Ltd filed Critical DMAI Guangzhou Co Ltd
Priority to CN201910650283.7A priority Critical patent/CN110390340B/en
Publication of CN110390340A publication Critical patent/CN110390340A/en
Application granted granted Critical
Publication of CN110390340B publication Critical patent/CN110390340B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/422Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/196Recognition using electronic means using sequential comparisons of the image signals with a plurality of references
    • G06V30/1983Syntactic or structural pattern recognition, e.g. symbolic string recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the technical field of visual relationship detection, in particular to a feature coding model, a training method and a detection method of a visual relationship detection model; the training method of the feature coding model comprises the steps of obtaining an initial feature coding model; acquiring sample data; inputting each sample data into an initial characteristic coding model; extracting a guidance drawing from the visual common sense data based on the category; and training the initial feature coding model according to the guide graph, and adjusting the conversion matrix to update the target features of each target region to obtain the target feature codes of each target region. The guide graph related to the category in the visual common sense is utilized to make up for the defect of insufficient sample data on one hand, so that enough sample data can be supported when the target feature is encoded again, on the other hand, the relation perception is ensured to be introduced when the target feature is encoded, conditions are provided for the subsequent detection of the visual relation, and the accuracy rate of the visual relation detection can be further improved.

Description

Feature coding model, training method and detection method of visual relation detection model
Technical Field
The invention relates to the technical field of visual relationship detection, in particular to a feature coding model, a training method and a detection method of a visual relationship detection model.
Background
In recent years, deep learning has made a breakthrough in the task of image recognition (e.g., image classification, object detection, object segmentation, etc.). In order to realize a computer-based solution scene, an important part is visual relationship detection, that is, for an input picture, the positions and the types of target objects in the picture and the types of the relationships between the targets are predicted.
The method commonly adopted for visual relationship detection is to encode the target and the relationship and predict the target category and the relationship category through a classifier. The method usually uses a recurrent neural network to gradually fuse the regional characteristics, so that finally each regional characteristic refers to the information of all other regions, and then the regional characteristics are matched pairwise and input into a relationship classifier to obtain a final visual relationship prediction result.
The recurrent neural network model adopted in the detection method needs to be trained by adopting a large amount of sample data in advance, and the category of the visual relationship in a real scene often has a serious imbalance problem, namely, the frequency of the occurrence of some common relationships (such as < person-wearing-jeans >) is far higher than that of the common relationships (such as < cat-sleeping-on-vehicle >), so that the method based on big data learning fails in the prediction of the common relationships due to the fact that enough samples cannot be obtained, and the accuracy of the visual relationship detection is further influenced.
Disclosure of Invention
In view of this, embodiments of the present invention provide a feature coding model, a training method for a visual relationship detection model, and a detection method, so as to solve the problem that accuracy of visual relationship detection is low.
According to a first aspect, an embodiment of the present invention provides a method for training a feature coding model, including:
acquiring an initial feature coding model; wherein the initial feature coding model comprises a plurality of attention modules cascaded in at least one layer, and parameters of each of the plurality of attention modules comprise a group of mutually independent transformation matrices;
acquiring sample data; wherein each sample data comprises a target feature of a target region in the sample image and a corresponding category;
inputting each sample data into the initial feature coding model;
extracting a guidance map from the visual common sense data based on the category; wherein the guidance map is for representing that visual common sense corresponds to a target category of the categories;
and training the initial feature coding model according to the guide graph, and adjusting the conversion matrix to update the target feature of each target area to obtain the target feature code of each target area.
The training method of the feature coding model provided by the embodiment of the invention adds the guide graph corresponding to the category of the target area in the visual sense into the training of the feature coding model, namely, the guide graph related to the category in the visual sense is utilized to make up for the defect of insufficient sample data on one hand, so that sufficient sample data support can be provided when the target feature is encoded again, and on the other hand, the relation perception is introduced when the target feature is encoded, so that conditions are provided for the subsequent detection of the visual relation, and the accuracy of the visual relation detection can be further improved.
With reference to the first aspect, in a first implementation manner of the first aspect, the training the initial feature coding model according to the guide map, and adjusting the transformation matrix to update the target feature of each target region to obtain the target feature code of each target region includes:
for each sample data, calculating an attention matrix of each sample image based on the transformation matrix and the target features; wherein the attention matrix is used for representing the attention of each target area to other target areas in the sample image;
combining the outputs of all the multi-head attention modules by using the conversion matrix and the attention matrix, and adding the target features to obtain target feature codes of each target region;
calculating a value of a loss function based on the target feature code and the guidance map;
and carrying out strong learning on the initial feature coding model by using the value of the loss function and a first learning rate, and adjusting the conversion matrix to update the target feature coding.
According to the feature coding model training method provided by the embodiment of the invention, when the target region is further coded, the relationship information between the regions reflected by the attention matrix is adopted for coding, and for each target region, the context information of the coded region and the related region can better help to predict the type of the target region and the visual relationship related to the coded region, so that the accuracy of subsequent visual relationship detection can be improved.
With reference to the first embodiment of the first aspect, in a second embodiment of the first aspect, the loss function is defined as follows:
Figure GDA0002996757720000031
wherein L isattnIs the value of the loss function; s is a guide diagram sequence; siIs the ith guide map in the guide map sequence; f (-) is a loss function; h is the number of each multi-head attention module; a. thehIs the attention matrix.
According to the training method of the feature coding model provided by the embodiment of the invention, the visual common sense is introduced as the learning of the feature coding, so that the learning of the explicit auxiliary visual relationship can be guided in the subsequent visual relationship detection, and conditions are provided for improving the accuracy of the visual relationship detection.
With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the attention matrix and the target feature code are calculated by using the following formulas:
Figure GDA0002996757720000032
Figure GDA0002996757720000033
wherein v isi,vjAny two target features in the sample image are obtained;
Figure GDA0002996757720000034
a group of mutually independent transformation matrixes are formed; a. theh(vi,vj) Is a target feature viFor target characteristic vjThe attention of (1); d is the dimension of the target feature;
Figure GDA0002996757720000035
to correspond to the target feature viThe target feature code of (4); and N is the number of the target areas in the sample image.
With reference to the first implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the strongly learning the initial feature coding model by using the value of the loss function and a first learning rate to update the transformation matrix includes:
calculating a first gradient estimate using the value of the loss function;
updating the transformation matrix using the first gradient estimate and the first learning rate.
According to the training method of the feature coding model provided by the embodiment of the invention, the conversion matrix is updated by adopting a gradient estimation method, so that the training efficiency of the feature coding model can be improved.
According to a second aspect, an embodiment of the present invention further provides a training method of a visual relationship detection model, including:
acquiring a target detection model; the target detection model is used for detecting target candidate regions in a second sample image, target characteristics of each target candidate region and corresponding categories of the target characteristics;
acquiring a feature coding model; wherein, the feature coding model is obtained by training according to the first aspect of the present invention or the training method of the feature coding model of any one of the first aspect; the feature coding model comprises a target feature coding model and/or a relation feature coding model; the input of the target feature coding model comprises the target features of the target candidate region and the word vectors corresponding to the categories, and the output is the target feature coding of the target candidate region; the input of the relation feature coding model comprises a target feature code of the target candidate region and a word vector corresponding to the category of the target feature code, and the output is the relation feature code of the target candidate region;
cascading the target detection model and the feature coding model to obtain an initial visual relationship detection model; wherein the feature coding model is connected with the output through a classification model;
training the initial visual relationship detection model based on a second learning rate, and adjusting parameters of the feature coding model to obtain a visual relationship detection model; wherein the second learning rate is less than a learning rate of training the feature coding model.
According to the training method of the visual relationship detection model provided by the embodiment of the invention, the target feature coding model and the relationship feature coding model are trained by adopting the guide graph, and the guide graph related to the type in the visual general knowledge is utilized to make up for the defect of insufficient sample data on one hand, so that sufficient sample data support can be provided when the target feature is coded again, on the other hand, the relationship perception is ensured to be introduced when the target feature is coded, and the accuracy rate can be improved when the model is subsequently utilized to detect the visual relationship.
With reference to the second aspect, in a first implementation manner of the second aspect, the training the initial visual relationship detection model based on the second learning rate, and adjusting parameters of the target feature coding model and the relationship feature coding model to obtain a visual relationship detection model includes:
calculating a value of a loss function of the feature coding model;
calculating a second gradient estimate using the value of the loss function;
updating the transformation matrix in the feature coding model using the second gradient estimation and the second learning rate.
According to the training method of the visual relation detection model provided by the embodiment of the invention, the conversion matrixes in the target characteristic coding model and the relation characteristic coding model are finely adjusted by adopting the second learning rate which is smaller than the first learning rate, so that the accuracy of the conversion matrixes can be ensured on one hand, and higher training efficiency can be ensured due to the adoption of smaller learning rate on the other hand.
With reference to the second aspect, in a second embodiment of the second aspect, the feature coding model is a concatenation of the target feature coding model and the relationship feature coding model; the feature coding model is cascaded with the relation feature coding model through a feature classification model, and the relation feature coding model is connected with the output through a relation classification model.
With reference to the first embodiment of the second aspect, in a third embodiment of the second aspect, the feature classification model is a first fully-connected layer, and the relationship classification model is a second fully-connected layer.
According to a third aspect, an embodiment of the present invention further provides a visual relationship detection method, including:
acquiring an image to be detected;
inputting the image to be detected into a visual relation detection model to obtain the visual relation of the image to be detected; the visual relationship detection model is obtained by training according to the third aspect of the present invention or the training method of the visual relationship detection model according to any one of the embodiments of the third aspect.
According to the visual relationship detection method provided by the embodiment of the invention, because the visual common sense is introduced into the visual relationship detection model, the defect of insufficient sample data is made up by utilizing the guide graph related to the type in the visual common sense, so that enough sample data can be supported when the target feature is encoded again, and on the other hand, the relationship perception is introduced when the feature is encoded, and the accuracy of visual relationship detection is improved.
With reference to the third aspect, in a first implementation manner of the third aspect, the inputting the image to be detected into a visual relationship detection model to obtain the visual relationship of the image to be detected includes:
inputting the image to be detected into the target detection model, and outputting at least one target candidate region, a feature vector of each target candidate region and a category probability vector;
obtaining target feature codes of the target candidate regions by using the target feature coding model based on the feature vectors and the category probability vectors;
inputting the target feature codes into a feature classification model to obtain corresponding target category vectors;
obtaining a relation feature code of the target candidate region by using the relation feature coding model based on the target feature code and the target category vector;
and combining every two relation feature codes corresponding to all the target candidate regions, and inputting the relation feature codes into a relation classification model to obtain the visual relation of any two target candidate regions.
According to the visual relation detection method provided by the embodiment of the invention, the feature codes with relation perception for each target candidate region are output through the target feature coding model, and more accurate target categories are further predicted through the feature classification model; and the output of the relational feature coding model is also the output of feature codes with relational perception, so that the accuracy of the predicted relational categories is further improved.
With reference to the first implementation manner of the third aspect, in a second implementation manner of the third aspect, the obtaining, by using the target feature coding model, a target feature code of the target candidate region based on the feature vector and the class probability vector includes:
extracting a first word vector corresponding to the category with the highest probability in each category probability vector;
for each target candidate region, combining the feature vector, the first word vector and the full-image feature vector of the target candidate region to obtain a first combined feature vector of each target candidate region;
and sequentially inputting each first joint feature vector into a target feature coding model to obtain target feature codes of each target candidate region.
With reference to the first implementation manner of the third aspect, in the third implementation manner of the third aspect, the obtaining, by using the relational feature coding model, a relational feature code of the target candidate region based on the target feature code and the target category vector includes:
extracting a second word vector of a category with the highest score value in the target category vectors of each target candidate region;
for each target candidate region, combining the target feature coding vector and the second word vector to obtain a second combined feature vector of each target candidate region;
and sequentially inputting each second combined feature vector into a relational feature coding model to obtain a relational feature code of each target candidate region.
According to a fourth aspect, an embodiment of the present invention further provides an electronic device, including:
a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, and the processor executing the computer instructions to perform a method for training a feature coding model according to the first aspect of the present invention or any of the embodiments of the first aspect, or perform a method for training a visual relationship detection model according to the second aspect of the present invention or any of the embodiments of the second aspect, or perform a method for visual relationship detection according to the third aspect of the present invention or any of the embodiments of the third aspect.
According to a fifth aspect, the embodiments of the present invention further provide a computer-readable storage medium, in which computer instructions are stored, the computer instructions being configured to cause the computer to perform the training method of the feature coding model according to the first aspect of the present invention or any one of the embodiments of the first aspect, or perform the training method of the visual relationship detection model according to the second aspect of the present invention or any one of the embodiments of the second aspect, or perform the visual relationship detection method according to the third aspect of the present invention or any one of the embodiments of the third aspect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow diagram of a method of training a feature coding model according to an embodiment of the invention;
FIG. 2 is an annotated sample image according to an embodiment of the invention;
FIG. 3 is a schematic illustration of visual common sense according to an embodiment of the present invention;
FIG. 4 is a guide diagram according to an embodiment of the present invention;
FIG. 5 is a relational feature diagram according to an embodiment of the invention;
FIG. 6 is a flow diagram of a method of training a feature coding model according to an embodiment of the present invention;
FIG. 7 is a block diagram of a multi-head attention module according to an embodiment of the invention;
FIG. 8 is a flow chart of a method of training a visual relationship detection model according to an embodiment of the invention;
FIG. 9 is a flow chart of a method of training a visual relationship detection model according to an embodiment of the invention;
FIG. 10 is a schematic diagram of a visual relationship detection model according to an embodiment of the invention;
FIG. 11 is a schematic diagram of a visual relationship detection model according to an embodiment of the invention;
FIG. 12 is a flow chart of a visual relationship detection method according to an embodiment of the invention;
FIG. 13 is a flow chart of a visual relationship detection method according to an embodiment of the invention;
FIG. 14 is a partial flow diagram of a visual relationship detection method according to an embodiment of the invention;
FIG. 15 is a partial flow diagram of a visual relationship detection method according to an embodiment of the invention;
FIG. 16 is data set information according to an embodiment of the present invention;
FIG. 17 is an evaluation of 3 visual tasks on VG and VG-MSDN in accordance with an embodiment of the invention;
FIG. 18 is an evaluation of 3 visual tasks on VG-MSDN and VG-DR-Net according to an embodiment of the invention;
FIG. 19 is an average per round training time and model parameters for a model according to an embodiment of the invention;
FIG. 20 is a block diagram of an apparatus for training a feature coding model according to an embodiment of the present invention;
FIG. 21 is a block diagram of a training apparatus for a visual relationship detection model according to an embodiment of the present invention;
fig. 22 is a block diagram of the structure of a visual relationship detection apparatus according to an embodiment of the present invention;
fig. 23 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, in the research process of the visual relationship detection, the inventor finds that many visual relationship detection methods based on big data learning fail in the prediction of the unusual relationship because the categories of the visual relationship in the real scene have a serious imbalance problem, which may cause the visual relationship detection methods based on big data learning to fail because sufficient samples cannot be obtained. Meanwhile, when the inventor notices that the human beings recognize the visual relationship between the target and the target in the scene, the accurate prediction can still be given to the unusual relationship, because the human beings accumulate own visual common knowledge in daily life, and the existing visual relationship or mode in the visual common knowledge can be well recognized.
The visual sense according to the present invention may be understood as a data table in which the relationship between the target and the object under the human visual sense is stored. Visual sense data and the like are target relationships extracted from visual sense seen in the human world, and are counted from a large data set. For example, seeing two targets, namely "person" and "bike", their relationship may be that the probability of "ride" is 0.7 and that of "next to" is 0.3 according to the common sense of vision, which provides a possible relationship category when the target classification result shows "person" and "bike" for a picture.
Therefore, the feature coding model referred to in the present application predicts the actual relationship class based on the guidance of visual common sense and specific target features, and introduces the features with relationship perception on the basis of the existing feature coding, thereby coding the features again. Specifically, the feature coding model provided in the embodiment of the present invention is a Progressive Knowledge-driven feature transformation module (PKT), which is used to perform relationship coding on a detected target based on existing visual common sense, so as to help the model detect a visual relationship, especially an uncommon visual relationship.
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for training a feature coding model, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
In this embodiment, a method for training a feature coding model is provided, which can be used in the above-mentioned mobile terminal, such as a mobile phone, a tablet computer, and the like, fig. 1 is a flowchart of a feature coding model according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:
and S11, acquiring an initial feature coding model.
Wherein the initial feature coding model comprises a plurality of attention modules cascaded in at least one layer, and the parameters of each plurality of attention modules comprise a group of mutually independent transformation matrices.
Specifically, the initial feature coding model includes at least one layer of cascade of multi-headed attention modules, for example, 3 layers, 4 layers, and the like, and the number of layers of the multi-headed attention modules in the cascade may be specifically set according to the actual situation. Taking cascade connection of 4 layers of multi-head attention modules as an example, the initial feature coding model is formed by cascade connection of a first layer of multi-head attention module, a second layer of multi-head attention module, a third layer of multi-head attention module and a fourth layer of multi-head attention module in sequence, and the output of the previous layer of multi-head attention module is connected with the input of the next layer of multi-head attention module.
For each multi-headed attention module, it may include 3, 4, or 5 attention modules, and so on. The number of the attention modules included in each multi-head attention module may be set according to specific situations, and is not limited herein.
The initial feature coding model acquired by the electronic device may be acquired from the outside or may be constructed in real time during training. The conversion matrix in each multi-head attention module may be obtained by random initialization, may also be obtained by pre-definition, and the like.
And S12, acquiring sample data.
Each sample data comprises a target feature of a target area in the sample image and a corresponding category.
The sample data acquired by the electronic device may be object features of each object region and categories corresponding to each object feature directly extracted from the labeled sample image. Wherein the category may be a plurality of categories, each category corresponding to a probability value; or the category with the highest probability among all categories, etc.
Optionally, the sample data may also be target areas in the sample image and categories corresponding to the target areas, which are manually marked; the target feature for the target area may be obtained by a target detector, may be obtained by other means, and so on.
And S13, inputting each sample data into the initial feature coding model.
In training the initial feature coding model, the electronic device needs to input the sample data obtained in S12 into the initial feature coding model.
S14, extracting the guidance drawing from the visual common sense data based on the category.
Wherein the guide map is used to indicate that the visual sense corresponds to a target category of categories.
After the electronic device obtains the category of each target region in the sample image in S12, the electronic device may extract a guide map corresponding to the category from the common sense of vision.
Specifically, referring to fig. 2, fig. 2 shows 4 objects marked on a sample image, which are bear, water, tree and branch. After the electronic device obtains the category corresponding to the target area, the corresponding guide map is extracted from the visual common sense. Therein, the visual common sense can be represented in the form of fig. 3, which characterizes all possible relations between the above 4 objects in the visual common sense. The electronic device extracts the guide map corresponding to fig. 2 from the visual sense in fig. 3 based on the visual sense in fig. 3 and the respective objects in the sample image in fig. 2, as shown in fig. 4.
In a subsequent step, the electronic device may detect the visual relationship of the target in fig. 2 based on the guide map of fig. 4 as shown in fig. 5. This will be described in detail below.
And S15, training the initial feature coding model according to the guide map, and adjusting the conversion matrix to update the target features of each target area to obtain the target feature codes of each target area.
When the electronic equipment trains the initial feature coding model, the electronic equipment takes the guide graph as a training reference, calculates the loss function of the initial feature coding model once the training is carried out, so as to obtain the difference between the training result and the actual value, and then adjusts the conversion matrix based on the difference. After the electronic device adjusts the transformation matrix, the initial feature coding model is updated accordingly, that is, the output target feature coding value is also updated.
The electronic equipment trains the initial feature coding model by using the guide graph to adjust the transformation matrix, and finally obtains the target feature codes of each target area.
The training method for the feature coding model provided in this embodiment adds the guide map corresponding to the category of the target region in the visual sense to the training of the feature coding model, that is, the guide map related to the category in the visual sense is used to make up for the defect of insufficient sample data on the one hand, so that sufficient sample data can be supported when the target feature is encoded again, and on the other hand, the relationship perception is ensured to be introduced when the target feature is encoded, so as to provide conditions for the subsequent detection of the visual relationship, and further improve the accuracy of the visual relationship detection.
In this embodiment, a method for training a feature coding model is provided, which can be used in the above-mentioned mobile terminal, such as a mobile phone, a tablet computer, and the like, fig. 6 is a flowchart of the feature coding model according to the embodiment of the present invention, and as shown in fig. 6, the flowchart includes the following steps:
and S21, acquiring an initial feature coding model.
Wherein the initial feature coding model comprises a plurality of attention modules cascaded in at least one layer, and the parameters of each plurality of attention modules comprise a group of mutually independent transformation matrices.
Specifically, as shown in FIG. 7, FIG. 7 shows a layer of multi-headed attention modules, wherein the output feature-encoding model may be 2-layer, 3-layer or a cascade of multiple multi-headed attention modules of FIG. 7
For each layer of multi-head attention modules, the number of the included attention modules is H, and H is illustrated as 3 in fig. 7. The multi-head injectorInput to the force module is characteristic viThe number of the transformation matrix is 3, and the initial transformation matrix is expressed as
Figure GDA0002996757720000111
And
Figure GDA0002996757720000112
the output of each layer of the multi-layer attention module is the union of the outputs of all the attention modules in the multi-layer attention module, i.e., Concat.
And S22, acquiring sample data.
Each sample data comprises a target feature of a target area in the sample image and a corresponding category.
Please refer to S12 in fig. 1, which is not described herein again.
And S23, inputting each sample data into the initial feature coding model.
Please refer to S13 in fig. 1, which is not described herein again.
S24, extracting the guidance drawing from the visual common sense data based on the category.
Wherein the guide map is used to indicate that the visual sense corresponds to a target category of categories.
Please refer to S14 in fig. 1, which is not described herein again.
And S25, training the initial feature coding model according to the guide map, and adjusting the conversion matrix to update the target features of each target area to obtain the target feature codes of each target area.
Specifically, S25 may include the steps of:
and S251, for each sample data, calculating an attention matrix of each sample image based on the conversion matrix and the target characteristics.
Wherein the attention matrix is used for representing the attention of each target area to other target areas in the sample image.
Specifically, if the sample image has N target regions, the sample image corresponds toThe attention matrix is an N × N matrix, and each element of the matrix represents attention of one target region to all the other N target regions, i.e., an association relationship between the regions. For example, the element at the (i, j) position in the attention matrix represents viFor vjI.e. the correlation between the two target regions.
As shown in FIG. 7, note that the moment matrix can be calculated using the following equation:
Figure GDA0002996757720000121
wherein v isi,vjAny two target features in the sample image are obtained;
Figure GDA0002996757720000122
a group of mutually independent transformation matrixes are formed; a. theh(vi,vj) Is a target feature viFor target characteristic vjThe attention of (1); d is the dimension of the target feature;
and S252, combining the outputs of all the multi-head attention modules by using the conversion matrix and the attention matrix, and adding the target features to obtain the target feature code of each target region.
Specifically, the attention matrix and the target feature code are calculated by the following formulas:
Figure GDA0002996757720000123
wherein,
Figure GDA0002996757720000124
to correspond to the target feature viThe target feature code of (4); n is the number of target areas in the sample image; h is the number of attention modules in each multi-head attention module.
And S253, calculating the value of the loss function based on the target feature code and the guide map.
The connection in the attention matrix is constrained by introducing a guide graph, wherein the guide graph of each picture is obtained by taking a sub-picture from the visual common sense data according to the predicted target category in the picture. The guide graph is a directed graph, the target area is taken as a node, and the area relation are edges. The guidance diagram may have an impact on the attention module in the form of external supervision, with the loss function defined as follows:
Figure GDA0002996757720000131
wherein L isattnIs the value of the loss function; s is a guide diagram sequence; siIs the ith guide map in the guide map sequence; f (-) is a loss function; h is the number of each multi-head attention module; a. thehIs the attention matrix. It should be noted that, the specific function of the loss function f (-) is not limited at all, and may be specifically set according to the actual situation.
Specifically, the guidance map is to extract sub-images in visual common sense according to the category of the current target, and in the actual calculation, the visual common sense data is a matrix representing the relationship between each target category and other target categories. The guide diagram is constructed by extracting rows or columns of corresponding categories from a visual common knowledge matrix according to the categories of the targets which are obtained by prediction at present to form a new matrix, thereby expressing the relationship between the targets.
And S254, carrying out strong learning on the initial feature coding model by using the value of the loss function and the first learning rate, and adjusting the conversion matrix to update the target feature coding.
After obtaining the value of the loss function in S253, the electronic device performs strong learning on the initial feature encoding model. The strong learning may be by a gradient descent method (e.g., a random gradient descent method), or by other methods. The purpose of the strong learning of the initial feature encoding model is to adjust the transformation matrix so that the target feature encoding calculated in S252 is updated.
The iteration stop condition for performing the strong learning on the initial feature coding model may be that the transformation matrix is not changed any more, or the number of iterations may be set, and the like.
For example, S254 may be implemented as follows:
(1) using the value of the loss function, a first gradient estimate is calculated.
Therein, a stochastic gradient descent method is used to calculate the gradient estimate, i.e. the difference between the attention matrix and the guidance map.
(2) The conversion matrix is updated using the first gradient estimate and the first learning rate.
And on the basis of the first gradient estimation, combining the first learning rate to obtain an updated conversion matrix.
As a specific application example of this embodiment, the feature coding model may be trained by using a stochastic gradient descent (i.e., SGD) algorithm, the first learning rate is 0.001, the batch size is 6, the number of layers of the multi-head attention module in the feature coding model is 4, the number of multi-head attention modules in each layer is 4, and the training is not more than 15 rounds until the model parameters are not updated.
In the training method of the feature coding model provided in this embodiment, when the target region is further encoded, the relationship information between the regions reflected by the attention matrix is used for encoding, and it is considered that for each target region, the context information of the region encoded and related to the target region can better help to predict the category of the target region and the visual relationship related to the target region, so that the accuracy of subsequent visual relationship detection can be improved.
It should be noted that the progressive knowledge driven feature transformation module (PKT) provided by the present invention plays an important role as a feature encoder for relationship perception in both the target classification model and the relationship classification model. In fact, PKT can act as a feature transcoder on any feature set with a certain correlation. In general, for a set of input features for picture I, PKT first learns a set of input featuresMutually independent transformation matrices, each v_iPaying attention to the area v related to the attention matrix A according to the attention matrix AjCombining the results of multi-head attention and adding the residual v_iAn updated signature code may be obtained.
Meanwhile, the connection in the attention matrix is restrained by introducing a guide graph, and the guide graph of each picture is obtained by taking a sub-picture from the visual common sense data according to the predicted target category in the picture. The guide graph is a directed graph, the target area is taken as a node, and the area relation are compiled. The guidance map may influence the attention module in the form of external supervision.
In further encoding of the region features, the information of the relationship between regions is added to the encoding, and it is believed that for each region feature, the context information of the encoding and its associated region can better help to predict the region class and the visual relationship to which it relates. Visual common sense is introduced as a guide to assist the learning of visual relationships explicitly, rather than learning relationships implicitly from the training of region classification and relationship classification.
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for training a feature coding model, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
In this embodiment, a training method of a visual relationship detection model is provided, which can be used in the above-mentioned mobile terminal, such as a mobile phone, a tablet computer, and the like, fig. 8 is a flowchart of the training method of the visual relationship detection model according to the embodiment of the present invention, and as shown in fig. 8, the flowchart includes the following steps:
and S31, acquiring the target detection model.
The target detection model is used for detecting target candidate regions in a second sample image, target features of each target candidate region and corresponding categories of the target features.
The target detection model may be constructed based on the fast RCNN framework of VGGNet, may also be constructed by other frameworks, and the like, and only the target detection model needs to be ensured to be able to detect the target candidate region in the second sample image, the target feature of each target candidate region, and the category corresponding to each target feature.
Taking a target detection model constructed by a fast RCNN framework based on VGGNet as an example, when the target detection model is trained, a stochastic gradient descent algorithm SGD is used for training, the learning rate is 0.001, the batch processing size is 6, and 50 rounds of parallel training are carried out on 3 blocks of NVIDIA GeForce GTX 1080Ti GPUs. The size of the second sample image is 592 × 592, and for the output result, the first 64 target detection frames with higher scores are selected as target candidate regions in each picture.
And S32, acquiring the feature coding model.
Wherein, the feature coding model is obtained by training according to the training method of the feature coding model in any one of the above embodiments; the feature coding model comprises a target feature coding model and/or a relation feature coding model; the input of the target feature coding model comprises the target features of the target candidate region and the word vectors corresponding to the categories, and the output is the target feature coding of the target candidate region; the input of the relation feature coding model comprises a target feature code of the target candidate region and a word vector corresponding to the category of the target feature code, and the output is the relation feature code of the target candidate region.
Specifically, for the visual relationship detection model, the included feature coding model may be only the target feature coding, only the relationship feature coding, or both the target feature coding and the relationship feature coding.
Hereinafter, the visual relationship detection model is described in detail by taking an example in which the visual relationship detection model includes a target feature code and a relationship feature code. The training methods of the target feature coding model and the relationship feature coding model are the same as those of the feature coding model in the above embodiments, except for sample data used by the two, i.e. a transformation matrix corresponding to the two.
For the target feature coding model, the function is that feature codes with relation perception are added into input features to obtain target feature codes; for the relational feature coding model, the role is to add the relational perception again to the features of the input to obtain the relational coding.
As described above, the input of the target feature coding model in the visual relationship detection model is the target feature of the target candidate region output by the target detection model in S31, the word vector corresponding to the category thereof, and the full map feature of the target candidate region; and outputting the target feature codes corresponding to the target candidate regions.
The input of the relation feature coding model in the visual relation detection model is target feature coding, and word vectors corresponding to the categories after the target feature coding is classified; the target candidate region is located in the relation feature code of the target candidate region.
And S33, cascading the target detection model and the feature coding model to obtain an initial visual relationship detection model.
Wherein the feature coding model is connected to the output by a classification model.
After the target detection model and the feature coding model are acquired in S31 and S32, the two models are concatenated and output in a connected manner on the basis of the feature coding model, thereby obtaining an initial visual relationship detection model. Wherein the feature coding model may connect the output layers through the full connection layer.
And S34, training the initial visual relationship detection model based on the second learning rate, and adjusting parameters of the feature coding model to obtain the visual relationship detection model.
Wherein the second learning rate is less than a learning rate of a training feature coding model.
After the electronic device constructs the initial visual relationship detection model in S33, the parameters of the target detection model obtained in S31 are fixed, and only the parameters of the feature coding model are adjusted when the initial visual relationship detection model is trained.
In the training method for the visual relationship detection model provided in this embodiment, the target feature coding model and the relationship feature coding model are both trained by using the guide map, and the guide map related to the type in the visual general knowledge makes up for the defect of insufficient sample data on one hand, so that sufficient sample data can be supported when the target feature is coded again, and on the other hand, the relationship perception is ensured to be introduced when the target feature is coded, so that the accuracy rate can be improved when the model is subsequently used for visual relationship detection.
In this embodiment, a training method of a visual relationship detection model is provided, which can be used in the above-mentioned mobile terminal, such as a mobile phone, a tablet computer, and the like, fig. 9 is a flowchart of the training method of the visual relationship detection model according to the embodiment of the present invention, and as shown in fig. 9, the flowchart includes the following steps:
and S41, acquiring the target detection model.
The target detection model is used for detecting target candidate regions in a second sample image, target features of each target candidate region and corresponding categories of the target features.
Please refer to S31 in fig. 8, which is not described herein.
And S42, acquiring the feature coding model.
Wherein, the feature coding model is obtained by training according to the training method of the feature coding model in any one of the above embodiments; the feature coding model comprises a target feature coding model and/or a relation feature coding model; the input of the target feature coding model comprises the target features of the target candidate region and the word vectors corresponding to the categories, and the output is the target feature coding of the target candidate region; the input of the relation feature coding model comprises a target feature code of the target candidate region and a word vector corresponding to the category of the target feature code, and the output is the relation feature code of the target candidate region.
Please refer to S32 in fig. 8, which is not described herein.
And S43, cascading the target detection model and the feature coding model to obtain an initial visual relationship detection model.
Wherein the feature coding model is connected to the output by a classification model.
As shown in fig. 10, the visual relationship detection model sequentially includes a target detection model, a target feature coding model, a feature classification model, a relationship feature coding model, and a relationship classification model. The feature classification model is used for classifying the target feature codes output by the target feature coding model, so that the class of the target detection region output by the target detection model is further identified. The function of the relational classification model is to output a relational class to the two inputted relational feature codes to characterize the relationship between the two relational feature codes, or to characterize the relationship between two target detection regions.
And S44, training the initial visual relationship detection model based on the second learning rate, and adjusting parameters of the feature coding model to obtain the visual relationship detection model.
Wherein the second learning rate is less than a learning rate of training the feature coding model.
Specifically, S44 may include the steps of:
s441, a value of a loss function of the feature coding model is calculated.
As shown in fig. 10, the feature coding model includes a target feature coding model and a relation feature coding model, and then the loss function of the feature coding model has a value of the sum of the loss functions of the two coding models. Please refer to S25 in the embodiment shown in fig. 6, which is not repeated herein.
At S442, a second gradient estimate is calculated using the value of the loss function.
The electronic device may use a random gradient descent method to calculate the second gradient estimate when training the initial visual relationship detection model, which may be referred to as S254 in the embodiment shown in fig. 6.
S443, the transformation matrix in the feature coding model is updated using the second gradient estimation and the second learning rate.
Please refer to S254 of the embodiment shown in fig. 6, which is not described herein again.
As an alternative implementation of this embodiment, the visual relationship detection model may be trained as follows: the fast RCNN frame based on VGGNet is used as a target detection model, the size of an input picture of the target detection model is 592 × 592, and the first 64 target detection frames with higher scores are selected from each picture. The target detection model is trained by adopting a random gradient descent algorithm SGD, the learning rate is 0.001, the batch processing size is 6, and 50 rounds of parallel training are carried out on 3 NVIDIA GeForce GTX 1080Ti GPUs. After the training of the target detection model is finished, the parameters of the target detection model are fixed at the moment, the region characteristics are extracted from the true value detection frame, and the target characteristic coding model and the relation characteristic coding model are input, wherein both the models are characteristic transformation models driven by progressive knowledge and can be called PKT. Wherein, for the target feature coding model, it can be called as PKTobj(ii) a For the relation feature coding model, it can be called PKTrel. The SGD algorithm is also used in the training of the part, the learning rate is 0.001, the batch processing size is 6, the number of layers of two PKT is 4, the attention number of the multiple head is 4, and the training is not more than 15 rounds generally until the model parameters are not updated. The reason why the truth detection box is used at this stage rather than the detection box predicted by the target detection model is that the prediction of the target detection model includes some prediction errors, and the region characteristics and the target class of the prediction errors are not very accurate, so that more data noise is introduced when the prediction errors are used for training two PKTs, and the training of the relational coding of the PKT model is influenced. After training is finished, parameters of the PKT model are finely adjusted on a target frame predicted by the target detection model by using a smaller learning rate of 0.0001, and training is carried out until the model parameters are not updated any more, generally, the number of times is not more than 15.
In the training method for the visual relationship detection model provided by this embodiment, the conversion matrix in the target feature coding model and the relationship feature coding model is finely adjusted by using the second learning rate that is smaller than the first learning rate, so that on one hand, the accuracy of the conversion matrix can be ensured, and on the other hand, higher training efficiency can be ensured due to the use of a smaller learning rate.
As an optional implementation manner of this embodiment, the feature coding model is the cascaded target feature coding model and the relational feature coding model; the feature coding model is cascaded with the relation feature coding model through a feature classification model, and the relation feature coding model is connected with the output through a relation classification model.
Further optionally, the feature classification model is a first fully-connected layer, and the relationship classification model is a second fully-connected layer.
As a specific application example of the present embodiment, fig. 11 shows a schematic diagram of a visual relationship detection model, which sequentially includes a target detection model for predicting a target detection area of an input image; target feature coding model PKTobjThe system comprises a target detection model, a guide graph and a relation perception model, wherein the guide graph is used for introducing on the basis of the output of the target detection model so as to output target characteristics with relation perception; linking PKTsobjThe output classifier is to PKTobjClassifying the output target characteristics to predict the categories of the target characteristics; relation characteristic coding model PKTrelThe guiding graph is reintroduced on the basis of the input relation-aware target features and the target class codes to output the relation features; linking PKTsrelThe output classifier classifies the pairwise combined relationship characteristics to determine the relationship between two target detection areas.
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for training a feature coding model, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
In this embodiment, a visual relationship detecting method is provided, which can be used in the above-mentioned mobile terminal, such as a mobile phone, a tablet computer, etc., fig. 12 is a flowchart of the visual relationship detecting method according to the embodiment of the present invention, and as shown in fig. 12, the flowchart includes the following steps:
and S51, acquiring an image to be detected.
And S52, inputting the image to be detected into the visual relation detection model to obtain the visual relation of the image to be detected.
The visual relationship detection model is obtained by training according to the third aspect of the present invention or the training method of the visual relationship detection model according to any one of the embodiments of the third aspect.
Please refer to fig. 8 or fig. 9 for details of the embodiment, which will not be described herein.
According to the visual relationship detection method provided by the embodiment of the invention, because the visual common sense is introduced into the visual relationship detection model, the defect of insufficient sample data is made up by utilizing the guide graph related to the type in the visual common sense, so that enough sample data can be supported when the target feature is encoded again, and on the other hand, the relationship perception is introduced when the feature is encoded, and the accuracy of visual relationship detection is improved.
In this embodiment, a visual relationship detecting method is provided, which can be used in the above-mentioned mobile terminal, such as a mobile phone, a tablet computer, etc., fig. 13 is a flowchart of the visual relationship detecting method according to the embodiment of the present invention, and as shown in fig. 13, the flowchart includes the following steps:
and S61, acquiring an image to be detected.
And S62, inputting the image to be detected into the visual relation detection model to obtain the visual relation of the image to be detected.
The visual relationship detection model is obtained by training according to the third aspect of the present invention or the training method of the visual relationship detection model according to any one of the embodiments of the third aspect.
This step will be described in detail below, with reference to the visual relationship detection model shown in fig. 11. First, a structured graph representation is defined for the visual relationship in a graph I, wherein a set of target candidate regions B ═ B is included1,...,bnAnd its corresponding target class O ═ O1,...,onAnd the relation between the target and the target, R ═ R1,...,rm}. As shown in fig. 11, the visual relationship detection model is decomposed into three models: target detection model P (B | I), target classification model P (O | B, I), relational classification model P (R | O, B, I).
Specifically, the steps include:
s621, inputting the image to be detected into the target detection model, and outputting at least one target candidate region, the feature vector of each target candidate region and the category probability vector.
The electronic apparatus inputs the image to be detected acquired in S61 into a target detection model that predicts target candidate regions in the image to be detected, feature vectors of the respective target candidate regions, and category probability vectors.
For example, target region detection is performed using a conventional target detector (e.g., fast-RCNN). For an input picture I, the output of the target detector fast-RCNN is a set of target candidate regions B ═ B1,...,bnAnd for each biE.g. B outputs a feature vector fiAnd a class probability vector li
And S622, obtaining target feature codes of the target candidate regions by using a target feature coding model based on the feature vectors and the category probability vectors.
The electronic device obtains the target feature with relationship perception based on the input by using the target feature coding model, wherein please refer to S32 in the embodiment shown in fig. 8 for details of the specific structure of the target feature coding model, which is not described herein again. Details of the specific implementation of this step will be described in detail below.
S623, inputting the target feature codes into the feature classification model to obtain corresponding target class vectors.
And inputting the target characteristics with the relation perception into the characteristic classification model, and predicting the category of the input target characteristics by using the characteristic classification model.
And S624, obtaining the relation feature codes of the target candidate regions by using a relation feature coding model based on the target feature codes and the target category vectors.
And combining the target features with the relation perception with the target category codes, and inputting the combined target features and the target category codes into a relation feature coding model to output the relation feature codes of the target candidate regions. Details of the specific implementation of this step will be described in detail below.
And S625, combining every two relation feature codes corresponding to all target candidate regions, and inputting the combined relation feature codes into a relation classification model to obtain the visual relation of any two target candidate regions.
The electronic device encodes the relationship features obtained in S624 into a relationship classification model in a pairwise combination manner, so as to obtain a visual relationship between any two target candidate regions.
According to the visual relation detection method provided by the embodiment of the invention, the feature codes with relation perception for each target candidate region are output through the target feature coding model, and more accurate target categories are further predicted through the feature classification model; and the output of the relational feature coding model is also the output of feature codes with relational perception, so that the accuracy of the predicted relational categories is further improved.
As an alternative implementation manner of this embodiment, as shown in fig. 14, the step S622 may include the following steps:
s6221, extracting the first word vector corresponding to the category with the highest probability in the probability vectors of each category.
After the category probability vector corresponding to each target candidate region is obtained in S621, a category with the highest probability is determined from the category probability vectors, and a first word vector corresponding to the category is extracted. Wherein the word vector may be a word vector code obtained in advance corresponding to all categories.
S6222, for each target candidate region, combining the feature vector, the first word vector and the full-map feature vector of the target candidate region to obtain a first combined feature vector of each target candidate region.
For each target candidate region, the feature vector of the target candidate region obtained in S621, the first word vector in S6221, and the full-map feature vector of the target candidate region are combined to obtain a first combined feature vector corresponding to each target candidate region. And the full-image feature vector is the feature of the target candidate region in the whole image to be detected.
For example, PKTobjIs the joint feature of each target candidate box { x; fi; yi, where x is a picture full-image feature, fi is a region feature vector of the target detection frame bi, yi is a word vector code of an initial prediction category of the target detection frame bi, and the initial prediction category is a category with the highest score in category probability vectors li output by the target detection model.
S6223, sequentially inputting each first joint feature vector into the target feature coding model to obtain the target feature code of each target candidate region.
The electronic equipment inputs the first joint feature vector corresponding to the target candidate region into the target feature coding model, so that a target feature code can be obtained corresponding to each target candidate region.
PKTobjFirstly, an attention matrix is obtained through learning according to the characteristics of the picture I and the target detection frame B, the size of the attention matrix is n multiplied by n, each element in the matrix represents the attention of one target area to all other n target areas, namely the incidence relation between the areas. Then, PKT encodes each target region according to the relationship between regions reflected in the attention matrix, and finally PKT encodes the characteristics of each target regionobjAnd outputting the feature codes with relation perception for each target area.
As another alternative implementation of this embodiment, as shown in fig. 15, the step S624 may include the following steps:
s6241, extracting the second word vector of the category with the highest score value in the target category vectors of each target candidate region.
After the target category vector corresponding to each target candidate region is obtained in S623, a category with the highest score value is determined from the target category vectors, and a second word vector corresponding to the category is extracted.
S6242, for each target candidate region, combining the target feature encoding vector and the second word vector to obtain a second combined feature vector of each target candidate region.
For each target candidate region, the target feature codes of the target candidate regions obtained in S623 and the second word vectors obtained in S6241 are combined to obtain second combined feature vectors corresponding to each target candidate region.
And S6243, sequentially inputting each second joint feature vector into the relation feature coding model to obtain the relation feature code of each target candidate region.
And the electronic equipment inputs the second combined feature vector corresponding to the target candidate region into the relational feature coding model, so that a relational feature code can be obtained corresponding to each target candidate region.
The visual relationship detection method provided by the embodiment of the invention is used for evaluation, and specifically, the method uses 2 evaluation indexes, 3 data sets and 4 visual relationship prediction related evaluation tasks for evaluation.
Evaluation indexes are as follows: the recall ratio (R @ K) is the ratio of the visual relationship successfully matched with the truth label to the visual relationship of the label in each sample, and the matching conditions are as follows: the intersection ratio (IoU) of the two target boxes and the truth target box is more than 0.5, the target class prediction is correct, and the inter-target relation prediction is correct. Because recall statistics on the number of matches for visual relationships do not take into account the categories of visual relationships, the recall indicators will be dominated by common categories and ignore the identification of uncommon categories in the event of an imbalance in relationship categories. Therefore, we also use class average recall (mR @ K) as an evaluation index, calculate the number of matches according to the relationship class, and then average among the classes.
Data set: the data set information we use is shown in figure 16. The number of pictures (# Img), the coefficient of correlation (# Rel), the coefficient of correlation (Ratio) in each average picture, and the number of target classes (# ObjCls) and the number of relationship classes (# RelCls) are shown for the three datasets VG, VG-MSDN, VG-DR-Net training set, and test set we used. Because the labeling deviation in the VG dataset is large, two cleaned versions of the VG dataset, namely VG-MSDN and VG-DR-Net, are introduced.
Visual relationship assessment task: 1) PredCls, for an input picture, predicting the relation between a target and a target under the condition of knowing all real target frames and target categories in the picture; 2) SGCls, predicting the relation between a target class and a target under the condition that all real target frames are known for one input picture; 3) SGGen, predicting a target frame, a target category and a target relation for one input picture; 4) PhrDet-for an input picture, a maximum bounding box for two objects, object class and inter-object relationship are predicted.
R @50, mR @50, R @100, mR @100 were evaluated on VG and VG-MSDN based on 3 vision tasks (SGGen, SGCls, PredCls), as shown in FIG. 17. PhrDest was evaluated on VG-MSDN and VG-DR-Net, as shown in FIG. 18.
From experimental results, the method achieves the effect close to or higher than that of the MOTIF method which is the best method at present on a plurality of visual relationship detection tasks. Furthermore, we evaluated the average training Time per round (Time) and model parameters (Params) of the model proposed by the present method, as shown in fig. 19. Compared with MOTIFS, the method has the advantages that the performance is better, the model is lighter, and the training time is greatly shortened.
The present embodiment further provides a training device for a feature coding model, a training device for a visual relationship detection model, and a visual relationship detection device, where the training device is used to implement the above corresponding embodiments and preferred embodiments, and the description of the training device and the visual relationship detection device is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
The present embodiment provides a training apparatus for a feature coding model, as shown in fig. 20, including:
a first obtaining module 2001, configured to obtain an initial feature coding model; wherein the initial feature coding model comprises a plurality of attention modules cascaded in at least one layer, and the parameters of each of the plurality of attention modules comprise a set of mutually independent transformation matrices.
A second obtaining module 2002, configured to obtain sample data; wherein each sample data comprises a target feature of a target region in the sample image and a corresponding category.
A first input module 2003, configured to input each sample data into the initial feature coding model.
An extraction module 2004 for extracting a guidance diagram from the visual common sense data based on the category; wherein the guidance diagram is for representing that visual sense corresponds to a target category of the categories.
A first training module 2005, configured to train the initial feature coding model according to the guide map, and adjust the transformation matrix to update the target feature of each target region, so as to obtain a target feature code of each target region.
The embodiment further provides a training apparatus for a visual relationship detection model, as shown in fig. 21, including:
a third obtaining module 2101, configured to obtain a target detection model; the target detection model is used for detecting target candidate regions in a second sample image, target features of each target candidate region and corresponding categories of the target features.
A fourth obtaining module 2102 configured to obtain a feature coding model; wherein, the feature coding model is obtained by training according to the first aspect of the present invention or the training method of the feature coding model described in any embodiment of the first aspect; the feature coding model comprises a target feature coding model and/or a relation feature coding model; the input of the target feature coding model comprises the target features of the target candidate region and the word vectors corresponding to the categories, and the output is the target feature coding of the target candidate region; the input of the relation feature coding model comprises a target feature code of the target candidate region and a word vector corresponding to the category of the target feature code, and the output is the relation feature code of the target candidate region.
A cascading module 2103, configured to cascade the target detection model and the feature coding model to obtain an initial visual relationship detection model; wherein the feature coding model is connected to the output by a classification model.
A second training module 2104 for training the initial visual relationship detection model based on a second learning rate, and adjusting parameters of the feature coding model to obtain a visual relationship detection model; wherein the second learning rate is less than a learning rate of training the feature coding model.
The present embodiment further provides a visual relationship detection apparatus, as shown in fig. 22, including:
a fifth acquiring module 2201, configured to acquire an image to be detected.
A detection module 2202, configured to input the image to be detected into a visual relationship detection model to obtain a visual relationship of the image to be detected; the visual relationship detection model is obtained by training according to the second aspect of the present invention, or the training method of the visual relationship detection model described in any embodiment of the second aspect.
The feature coding model training device, the visual relationship detection model training device, and the visual relationship detection device in the embodiments of the present invention are presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices capable of providing the above functions.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device having at least one of the apparatuses shown in fig. 20 to 22.
Referring to fig. 23, fig. 23 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, and as shown in fig. 23, the electronic device may include: at least one processor 2301, such as a CPU (Central Processing Unit), at least one communication interface 2303, memory 2304, at least one communication bus 2302. Wherein a communication bus 2302 is used to enable connection communication between these components. The communication interface 2303 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional communication interface 2303 may also include a standard wired interface and a standard wireless interface. The Memory 2304 may be a high-speed RAM Memory (volatile Random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 2304 may optionally also be at least one storage device located remotely from the processor 2301. Wherein the processor 2301 may incorporate the apparatus described in fig. 20-22, an application program stored in the memory 2304, and the processor 2301 calls the program code stored in the memory 2304 for performing any of the method steps described above.
The communication bus 2302 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 2302 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 23, but it is not intended that there be only one bus or one type of bus.
The memory 2304 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 2304 may also include a combination of the above types of memory.
The processor 2301 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The processor 2301 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
Optionally, the memory 2304 is also used for storing program instructions. The processor 2301 may call program instructions to implement a training method of a feature coding model as shown in the embodiments of fig. 1 and fig. 6 of the present application, or a training method of a visual relationship detection model as shown in the embodiments of fig. 8 and fig. 9 of the present application, or a visual relationship detection method as shown in the embodiments of fig. 12-fig. 15 of the present application.
The embodiment of the invention also provides a non-transitory computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the training method of the feature coding model, the training method of the visual relationship detection model or the visual relationship detection method in any method embodiment. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (15)

1. A method for training a feature coding model, comprising:
acquiring an initial feature coding model; wherein the initial feature coding model comprises a plurality of attention modules cascaded in at least one layer, and parameters of each of the plurality of attention modules comprise a group of mutually independent transformation matrices;
acquiring sample data; wherein each sample data comprises a target feature of a target region in the sample image and a corresponding category;
inputting each sample data into the initial feature coding model;
extracting a guidance map from the visual common sense data based on the category; wherein the guidance map is for representing that visual common sense corresponds to a target category of the categories;
and training the initial feature coding model according to the guide graph, and adjusting the conversion matrix to update the target feature of each target area to obtain the target feature code of each target area.
2. The method of claim 1, wherein the training the initial feature coding model according to the guide map and adjusting the transformation matrix to update the target features of each target region to obtain the target feature code of each target region comprises:
for each sample data, calculating an attention matrix of each sample image based on the transformation matrix and the target features; wherein the attention matrix is used for representing the attention of each target area to other target areas in the sample image;
combining the outputs of all the multi-head attention modules by using the conversion matrix and the attention matrix, and adding the target features to obtain target feature codes of each target region;
calculating a value of a loss function based on the target feature code and the guidance map;
and carrying out strong learning on the initial feature coding model by using the value of the loss function and a first learning rate, and adjusting the conversion matrix to update the target feature coding.
3. The method of claim 2, wherein the loss function is defined as follows:
Figure FDA0002134978610000011
wherein L isattnIs the value of the loss function; s is a guide diagram sequence; siIs the ith guide map in the guide map sequence; f (-) is a loss function; h is the number of each multi-head attention module; a. thehIs the attention matrix.
4. The method of claim 3, wherein the attention matrix and the target feature code are calculated using the following formulas:
Figure FDA0002134978610000021
Figure FDA0002134978610000022
wherein v isi,vjAny two target features in the sample image are obtained;
Figure FDA0002134978610000023
a group of mutually independent transformation matrixes are formed; a. theh(vi,vj) Is a target feature viFor target characteristic vjThe attention of (1); d is the dimension of the target feature;
Figure FDA0002134978610000024
to correspond to the target feature viThe target feature code of (4); and N is the number of the target areas in the sample image.
5. The method of claim 2, wherein the strongly learning the initial feature coding model using the value of the loss function and a first learning rate to update the transformation matrix comprises:
calculating a first gradient estimate using the value of the loss function;
updating the transformation matrix using the first gradient estimate and the first learning rate.
6. A training method of a visual relation detection model is characterized by comprising the following steps:
acquiring a target detection model; the target detection model is used for detecting target candidate regions in a second sample image, target characteristics of each target candidate region and corresponding categories of the target characteristics;
acquiring a feature coding model; wherein the feature coding model is obtained by training according to the training method of the feature coding model of any one of claims 1-5; the feature coding model comprises a target feature coding model and/or a relation feature coding model; the input of the target feature coding model comprises the target features of the target candidate region and the word vectors corresponding to the categories, and the output is the target feature coding of the target candidate region; the input of the relation feature coding model comprises a target feature code of the target candidate region and a word vector corresponding to the category of the target feature code, and the output is the relation feature code of the target candidate region;
cascading the target detection model and the feature coding model to obtain an initial visual relationship detection model; wherein the feature coding model is connected with the output through a classification model;
training the initial visual relationship detection model based on a second learning rate, and adjusting parameters of the feature coding model to obtain a visual relationship detection model; wherein the second learning rate is less than a learning rate of training the feature coding model.
7. The method of claim 6, wherein training the initial visual relationship detection model based on the second learning rate, and adjusting parameters of the feature coding model to obtain the visual relationship detection model comprises:
calculating a value of a loss function of the feature coding model;
calculating a second gradient estimate using the value of the loss function;
updating the transformation matrix in the feature coding model using the second gradient estimation and the second learning rate.
8. The method according to claim 6, wherein the feature coding model is a concatenation of the target feature coding model and the relational feature coding model; the feature coding model is cascaded with the relation feature coding model through a feature classification model, and the relation feature coding model is connected with the output through a relation classification model.
9. The method of claim 8, wherein the feature classification model is a first fully-connected layer and the relationship classification model is a second fully-connected layer.
10. A visual relationship detection method, comprising:
acquiring an image to be detected;
inputting the image to be detected into a visual relation detection model to obtain the visual relation of the image to be detected; wherein the visual relation detection model is trained according to the training method of the visual relation detection model of any one of claims 6-9.
11. The method according to claim 10, wherein the inputting the image to be detected into a visual relationship detection model to obtain the visual relationship of the image to be detected comprises:
inputting the image to be detected into the target detection model, and outputting at least one target candidate region, a feature vector of each target candidate region and a category probability vector;
obtaining target feature codes of the target candidate regions by using the target feature coding model based on the feature vectors and the category probability vectors;
inputting the target feature codes into a feature classification model to obtain corresponding target category vectors;
obtaining a relation feature code of the target candidate region by using the relation feature coding model based on the target feature code and the target category vector;
and combining every two relation feature codes corresponding to all the target candidate regions, and inputting the relation feature codes into a relation classification model to obtain the visual relation of any two target candidate regions.
12. The method according to claim 11, wherein the deriving the target feature code of the target candidate region using the target feature coding model based on the feature vector and the class probability vector comprises:
extracting a first word vector corresponding to the category with the highest probability in each category probability vector;
for each target candidate region, combining the feature vector, the first word vector and the full-image feature vector of the target candidate region to obtain a first combined feature vector of each target candidate region;
and sequentially inputting each first joint feature vector into a target feature coding model to obtain target feature codes of each target candidate region.
13. The method according to claim 11, wherein the deriving the relational feature code of the target candidate region using the relational feature coding model based on the target feature code and the target class vector comprises:
extracting a second word vector of a category with the highest score value in the target category vectors of each target candidate region;
for each target candidate region, combining the target feature coding vector and the second word vector to obtain a second combined feature vector of each target candidate region;
and sequentially inputting each second combined feature vector into a relational feature coding model to obtain a relational feature code of each target candidate region.
14. An electronic device, comprising:
a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method for training a feature coding model according to any one of claims 1 to 5, or to perform the method for training a visual relationship detection model according to any one of claims 6 to 9, or to perform the method for visual relationship detection according to any one of claims 10 to 13.
15. A computer-readable storage medium storing computer instructions for causing a computer to perform the method for training a feature coding model according to any one of claims 1 to 5, or the method for training a visual relationship detection model according to any one of claims 6 to 9, or the method for visual relationship detection according to any one of claims 10 to 13.
CN201910650283.7A 2019-07-18 2019-07-18 Feature coding model, training method and detection method of visual relation detection model Active CN110390340B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910650283.7A CN110390340B (en) 2019-07-18 2019-07-18 Feature coding model, training method and detection method of visual relation detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910650283.7A CN110390340B (en) 2019-07-18 2019-07-18 Feature coding model, training method and detection method of visual relation detection model

Publications (2)

Publication Number Publication Date
CN110390340A CN110390340A (en) 2019-10-29
CN110390340B true CN110390340B (en) 2021-06-01

Family

ID=68285137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910650283.7A Active CN110390340B (en) 2019-07-18 2019-07-18 Feature coding model, training method and detection method of visual relation detection model

Country Status (1)

Country Link
CN (1) CN110390340B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033582B (en) * 2019-12-09 2023-09-26 杭州海康威视数字技术股份有限公司 Model training method, feature extraction method and device
US20210217194A1 (en) * 2020-01-13 2021-07-15 Samsung Electronics Co., Ltd. Method and apparatus with object information estimation and virtual object generation
CN111626291B (en) * 2020-04-07 2023-04-25 上海交通大学 Image visual relationship detection method, system and terminal
CN112149692B (en) * 2020-10-16 2024-03-05 腾讯科技(深圳)有限公司 Visual relationship identification method and device based on artificial intelligence and electronic equipment
CN112837466B (en) * 2020-12-18 2023-04-07 北京百度网讯科技有限公司 Bill recognition method, device, equipment and storage medium
CN112507958B (en) * 2020-12-22 2024-04-02 成都东方天呈智能科技有限公司 Conversion system of different face recognition model feature codes and readable storage medium
CN113011584B (en) * 2021-03-18 2024-04-16 广东南方数码科技股份有限公司 Coding model training method, coding device and storage medium
CN113298151A (en) * 2021-05-26 2021-08-24 中国电子科技集团公司第五十四研究所 Remote sensing image semantic description method based on multi-level feature fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN109062907A (en) * 2018-07-17 2018-12-21 苏州大学 Incorporate the neural machine translation method of dependence
CN109492227A (en) * 2018-11-16 2019-03-19 大连理工大学 It is a kind of that understanding method is read based on the machine of bull attention mechanism and Dynamic iterations
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108492273A (en) * 2018-03-28 2018-09-04 深圳市唯特视科技有限公司 A kind of image generating method based on from attention model
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109710953B (en) * 2018-12-29 2023-04-11 成都金山互动娱乐科技有限公司 Translation method and device, computing equipment, storage medium and chip
CN109857845B (en) * 2019-01-03 2021-06-22 北京奇艺世纪科技有限公司 Model training and data retrieval method, device, terminal and computer-readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN109062907A (en) * 2018-07-17 2018-12-21 苏州大学 Incorporate the neural machine translation method of dependence
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method
CN109492227A (en) * 2018-11-16 2019-03-19 大连理工大学 It is a kind of that understanding method is read based on the machine of bull attention mechanism and Dynamic iterations

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Graph-based Knoeledge Distillation by Multi-head Attention Network;Seunghyun Lee et al;《arXiv》;20190709;全文 *
Visual Relationship Detection Based on Local Feature and Context Feature;Yuping Han et al;《Proceedings of IC-NIDC 2018》;20181231;全文 *
Visual Relationship Detection with Language Priors;Cewu Lu et al;《arXiv》;20160731;全文 *
基于注意力的图像视觉关系识别研究;李玉刚 等;《中国传媒大学学报(自然科学版)》;20181231;第25卷(第6期);全文 *

Also Published As

Publication number Publication date
CN110390340A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN110390340B (en) Feature coding model, training method and detection method of visual relation detection model
CN110188765B (en) Image semantic segmentation model generation method, device, equipment and storage medium
CN110111334B (en) Crack segmentation method and device, electronic equipment and storage medium
US10223582B2 (en) Gait recognition method based on deep learning
CN110826525B (en) Face recognition method and system
CN110046550B (en) Pedestrian attribute identification system and method based on multilayer feature learning
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN109740617A (en) A kind of image detecting method and device
CN110135505B (en) Image classification method and device, computer equipment and computer readable storage medium
EP4152211A1 (en) Neural network model training method, image classification method, text translation method and apparatus, and device
CN111046971A (en) Image recognition method, device, equipment and computer readable storage medium
CN110059646B (en) Method for training action planning model and target searching method
TWI803243B (en) Method for expanding images, computer device and storage medium
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN116246287B (en) Target object recognition method, training device and storage medium
CN114662605A (en) Flame detection method based on improved YOLOv5 model
CN114612659A (en) Power equipment segmentation method and system based on fusion mode contrast learning
CN111753714B (en) Multidirectional natural scene text detection method based on character segmentation
CN112785479A (en) Image invisible watermark universal detection method based on less-sample learning
CN114741487B (en) Image-text retrieval method and system based on image-text semantic embedding
CN116543295A (en) Lightweight underwater target detection method and system based on degradation image enhancement
CN115205975A (en) Behavior recognition method and apparatus, electronic device, and computer-readable storage medium
CN114360032A (en) Polymorphic invariance face recognition method and system
CN112529093A (en) Method for testing mold cleaning effect based on sample dimension weighting of pre-detection weight
CN110135419A (en) End-to-end text recognition method under a kind of natural scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant