CN116012685B - Image description generation method based on fusion of relation sequence and visual sequence - Google Patents

Image description generation method based on fusion of relation sequence and visual sequence Download PDF

Info

Publication number
CN116012685B
CN116012685B CN202211642392.2A CN202211642392A CN116012685B CN 116012685 B CN116012685 B CN 116012685B CN 202211642392 A CN202211642392 A CN 202211642392A CN 116012685 B CN116012685 B CN 116012685B
Authority
CN
China
Prior art keywords
sequence
target
feature
image
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211642392.2A
Other languages
Chinese (zh)
Other versions
CN116012685A (en
Inventor
张文凯
陈佳良
冯瑛超
李硕轲
李霁豪
杜润岩
周瑞雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Information Research Institute of CAS
Original Assignee
Aerospace Information Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Information Research Institute of CAS filed Critical Aerospace Information Research Institute of CAS
Priority to CN202211642392.2A priority Critical patent/CN116012685B/en
Publication of CN116012685A publication Critical patent/CN116012685A/en
Application granted granted Critical
Publication of CN116012685B publication Critical patent/CN116012685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the field of image processing, and discloses an image description generation method based on fusion of a relation sequence and a visual sequence. Including acquiring an initial image. A visual sequence v is generated. A sequence of relations r is generated. Encoding v and r to generate corresponding first visual sequence features v 1 First relation sequence feature r 1 . Performing first cross-attention encoding on the first combined sequence to generate a second visual sequence feature v 2 . Performing second cross attention coding on the second combined sequence to generate a second relation sequence feature r 2 . According to v 2 、r 2 And fusing the weight beta to generate image description information of the initial image. According to the invention, by adding the relation sequence between the target objects, the receptive field corresponding to the characteristics can be increased, and then the interrelationship between the target objects can be acquired more clearly. Therefore, the generated image description information can be more accurate and fine, and the precision is higher.

Description

Image description generation method based on fusion of relation sequence and visual sequence
Technical Field
The invention relates to the field of image processing, in particular to an image description generation method based on fusion of a relation sequence and a visual sequence.
Background
In the big data age, a large amount of image data requires a large amount of human resources to process. With the development of machine learning and deep learning technologies, image understanding tasks with a target as a core are acquired, for example: image classification, object detection, image segmentation, etc. have achieved good results. However, the task described above can only provide content information such as the target category, the target position, or the pixel category to which the target belongs, which is included in the current image. It remains a challenge to combine these contents to refine the subject matter and semantic information contained in the Image, i.e., image semantic description (Image capturing). The task aims at one-way conversion between image to text bi-modal, and converts an input image into a natural language description which accords with grammar rules and is consistent with image content. For the technology of image semantic description, the technology has wider application scenes. For example, in a massive image data management system under a remote sensing scene, aiming at massive multi-target large scene image data, image semantics are described based on understanding image semantic topics, so that images with the same targets and different semantic topics can be distinguished more conveniently. Or interpreting the corresponding illustration photos in the context of the newspaper article; or providing a text description for a chart or a map; or provide scene descriptions for visually impaired people, etc.
In the prior art, text descriptions of the semantic information that each image has and the subject matter it is to express are generated by decoding the corresponding visual features of the object. However, the method in the prior art has the problem that the accuracy of the generated text description is low.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme:
according to one aspect of the present invention, there is provided an image description generation method based on fusion of a relational sequence with a visual sequence, the method comprising the steps of:
and acquiring an initial image, wherein the initial image comprises images corresponding to the N target objects. N is E [10,100].
And performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.
Sequence encoding the initial target feature to generate a visual sequence
Figure SMS_1
. wherein ,/>
Figure SMS_2
Including a visual sub-feature sequence corresponding to each target frame.
And acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.
Feature extraction is performed on each of the co-framed image regions using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.
All the relation features are subjected to sequence coding to generate a relation sequence
Figure SMS_3
. wherein ,/>
Figure SMS_4
And the method comprises a relationship sub-feature sequence corresponding to each joint frame.
Performing self-attention coding on v and r respectively to generate corresponding first visual sequence features
Figure SMS_5
First relation sequence feature->
Figure SMS_6
. wherein ,/>
Figure SMS_7
Including a first visual sub-feature sequence corresponding to each target frame. />
Figure SMS_8
The first relation sub-feature sequence corresponding to each joint box is included.
The self-attention weight W in self-attention encoding meets the following condition:
Figure SMS_9
wherein ,
Figure SMS_10
is a feature of the geometrical relationship between two objects for which self-attention computation is performed in self-attention encoding.
Will be
Figure SMS_11
Is associated with +.>
Figure SMS_12
Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence. The target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapping with a target frame corresponding to the first visual sub-feature sequence.
Performing first cross-attention encoding on each first combined sequence to generate a second visual sequence characteristic
Figure SMS_13
The self-attention of the ith first combined sequence in the first cross-attention code meets the following condition:
Figure SMS_14
wherein ,
Figure SMS_16
is the i first visual sub-feature sequence, < ->
Figure SMS_18
Is->
Figure SMS_21
Corresponding target relation sub-feature sequences.
Figure SMS_17
Is->
Figure SMS_20
And the corresponding image area selected by the target frame. />
Figure SMS_22
Is->
Figure SMS_23
Corresponding to each of the image areas selected by the joint frame. />
Figure SMS_15
、/>
Figure SMS_19
In the first cross attention code corresponding to the ith first combination sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.
Will be
Figure SMS_24
Each of the first relational sub-feature sequences and +.>
Figure SMS_25
The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence. The target visual sub-feature sequence is that the joint frames corresponding to the first relation sub-feature sequence are overlapped, and the overlapped area is larger than the area threshold +.>
Figure SMS_26
A collection of first visual sub-feature sequences corresponding to at least one target frame.
For each second combined sequenceSecond cross-attention encoding to generate a second relationship sequence feature
Figure SMS_27
. Self-attention of the ith second combined sequence in the second cross-attention code +.>
Figure SMS_28
Meets the following conditions:
Figure SMS_29
wherein ,
Figure SMS_31
is the ith first relationship sub-feature sequence. />
Figure SMS_34
Is->
Figure SMS_37
A corresponding sequence of target visual sub-features.
Figure SMS_32
Is->
Figure SMS_35
Corresponding combined frame selected image areas. />
Figure SMS_36
Is->
Figure SMS_38
Is selected for each of the target frame-selected image areas.
Figure SMS_30
、/>
Figure SMS_33
In the second cross attention code corresponding to the ith second combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.
From a text sequence generated before the current time t
Figure SMS_39
Generating fusion weight corresponding to the current time t>
Figure SMS_40
。/>
Figure SMS_41
For indicating->
Figure SMS_42
Corresponding fusion ratio. Wherein (1)>
Figure SMS_43
Image description information for the initial image correspondingly generated at time t-1.
According to
Figure SMS_44
、/>
Figure SMS_45
And fusion weight corresponding to current time t>
Figure SMS_46
Image description information of the initial image corresponding to the current time t is generated +.>
Figure SMS_47
According to a second aspect of the present invention, there is provided a non-transitory computer readable storage medium storing a computer program which, when executed by a processor, implements an image description generation method based on fusion of a relational sequence with a visual sequence as described above.
According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a method of generating an image description based on fusion of a relational sequence with a visual sequence as described above when the computer program is executed by the processor.
The invention has at least the following beneficial effects:
the invention jointly generates the description text of the initial image by setting the two types of characteristics of the visual sequence and the relation sequence. By adding the relation sequence between the target objects, the receptive field corresponding to the features can be increased, and further the machine can be helped to obtain the interrelationship between the target objects more clearly. Therefore, finally, according to the visual sequence and the relation sequence, the generated image description information of the initial image is more accurate and fine, and the precision is higher.
In addition, in the invention, the self-attention calculation in the independent modes is respectively carried out on the visual sequence and the relation sequence, and the attention calculation between the modes is also carried out, so that the receptive fields of the obtained second visual sequence characteristics and the second relation sequence characteristics can be further increased, and further, the relation information between the richer characteristics can be contained. Correspondingly, the image description information of the initial image, which can be generated during decoding, is more accurate and fine, and the accuracy is higher.
At the same time according to
Figure SMS_48
To determine the corresponding fusion weight +.>
Figure SMS_49
Then according to the fusion weight +.>
Figure SMS_50
To determine
Figure SMS_51
and />
Figure SMS_52
The target fusion feature sequences F are generated in what proportions, respectively. The input values of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improvedSex.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an image description generating method based on fusion of a relationship sequence and a visual sequence according to an embodiment of the present invention.
Fig. 2 shows the test results of the model corresponding to the method of the present invention in MSCOCO online test.
Fig. 3 shows the test results of the model corresponding to the method of the present invention in MSCOCO karpathy split.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Specifically, the method of the invention is realized by adopting a multi-mode transducer architecture as a baseline model. The model mainly comprises two parts: an encoder and a decoder. The general working process of the invention comprises (1) construction of visual sequence and relation sequence, (2) encoding by a double-stream encoder and (3) decoding by the double-stream decoder according to the target fusion characteristic sequence F so as to generate description information of an initial image.
As a possible embodiment of the present invention, as shown in fig. 1, there is provided an image description generating method based on fusion of a relationship sequence and a visual sequence, the method comprising the steps of:
s100, acquiring an initial image, wherein the initial image comprises images corresponding to N target objects. N is E [10,100].
S101, performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.
Specifically, at the input end, a group of 2048-dimensional object features corresponding to N target objects are firstly extracted through a target detection network Fast-RCNN, and then the object features are mapped to 512 dimensions so as to adapt to the input dimension of the encoder.
S200, performing sequence coding on the initial target characteristics to generate a visual sequence
Figure SMS_53
. wherein ,/>
Figure SMS_54
Including a visual sub-feature sequence corresponding to each target frame.
S300, generating a relation sequence according to the initial target characteristics
Figure SMS_55
. Comprising the following steps:
s301, acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.
Through the step, a joint frame corresponding to any two target objects in the initial target characteristics can be generated. And the image area contained in the joint box may represent a relationship between two target objects, such as a positional relationship. Therefore, feature extraction is carried out on the image area selected by the combined frame, and the generated relation feature can be ensured to contain the semantic feature of the relation between the two corresponding target objects.
And S302, extracting features of each combined framed image area by using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.
S303, performing sequence coding on all the relation features to generate a relation sequence
Figure SMS_56
. wherein ,/>
Figure SMS_57
And the method comprises a relationship sub-feature sequence corresponding to each joint frame.
After the above steps, each of the relationship features in the relationship sequence has environmental information about the periphery of the target, while the features in the visual sequence mainly express specific details of a certain target.
S400, respectively performing self-attention coding on v and r to respectively generate corresponding first visual sequence characteristics
Figure SMS_58
First relation sequence feature->
Figure SMS_59
. wherein ,/>
Figure SMS_60
Including a first visual sub-feature sequence corresponding to each target frame. />
Figure SMS_61
The first relation sub-feature sequence corresponding to each joint box is included.
Specifically, for example, the target objects in the initial image are 10, that is, n=10. V is a visual sequence composed of the coded sub-sequences (visual sub-feature sequences) of the images framed by the target frames to which the 10 target objects respectively correspond. I.e. the 10 visual sub-feature sequences are the 10 corresponding elements that make up the visual sequence.
Similarly, r is a relation sequence composed of coding subsequences (relation sub-feature sequences) of images framed by the joint frames corresponding to the 10 target objects respectively. I.e. the 10 relational sub-feature sequences are the 10 corresponding elements that make up the relational sequence.
The self-attention encoding in this step is to perform self-attention encoding on 10 elements in v to generate
Figure SMS_62
; wherein ,/>
Figure SMS_63
Similarly containing 10 recoded elements in one-to-one correspondence with v.
Similarly, 10 elements in r are self-attentively encoded to generate
Figure SMS_64
; wherein ,/>
Figure SMS_65
Similarly, 10 recoded elements corresponding one-to-one to r.
The self-attention weight W in self-attention encoding meets the following condition:
Figure SMS_66
wherein ,
Figure SMS_67
is a feature of the geometrical relationship between two objects for which self-attention computation is performed in self-attention encoding.
Figure SMS_68
The vector is a vector with 4 dimensions, and the 4 dimensions are specifically the distance between the center points of the target frames corresponding to the two target objects for self-attention calculation in the X direction and the distance between the center points in the Y direction; and the ratio of the length and the ratio of the width between the two target frames. The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.
S500 is to
Figure SMS_69
Each of the first visual sub-featuresSequence and->
Figure SMS_70
Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence. The target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapping with a target frame corresponding to the first visual sub-feature sequence.
Specifically, in this step, at least one target relationship sub-feature sequence corresponding to each first visual sub-feature sequence is obtained according to the above conditions, so as to generate a corresponding first combined sequence. I.e. will
Figure SMS_71
Is associated with +.>
Figure SMS_72
Form a new input sequence for which self-attention calculations are to be made. Such as: />
Figure SMS_73
The 1 st visual sub-feature sequence of (2) and +.>
Figure SMS_74
The 3 rd and 6 th first relation sub-feature sequences form corresponding first combined sequences.
S501, performing first cross attention coding on each first combined sequence to generate a second visual sequence characteristic
Figure SMS_75
Self-attention of the ith first combined sequence in the first cross-attention code
Figure SMS_76
I.e. the i-th first visual sub-feature sequence is self-attentive +.>
Figure SMS_77
The method comprises the steps of carrying out a first treatment on the surface of the Meets the following conditions:
Figure SMS_78
wherein ,
Figure SMS_79
is the i first visual sub-feature sequence, < ->
Figure SMS_82
Is->
Figure SMS_86
Corresponding target relation sub-feature sequences.
Figure SMS_81
Is->
Figure SMS_84
And the corresponding image area selected by the target frame. />
Figure SMS_85
Is->
Figure SMS_87
Corresponding to each of the image areas selected by the joint frame. />
Figure SMS_80
、/>
Figure SMS_83
In the first cross attention code corresponding to the ith first combination sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.
Such as:
Figure SMS_88
the geometric relationship features between the target frames corresponding to the 1 st first visual sub-feature sequence and the joint frames corresponding to the 3 rd and 6 th first relationship sub-feature sequences can be obtained.
The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.
S600 will be
Figure SMS_89
Each of the first relational sub-feature sequences and +.>
Figure SMS_90
The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence. The target visual sub-feature sequence is that the joint frames corresponding to the first relation sub-feature sequence are overlapped, and the overlapped area is larger than the area threshold +.>
Figure SMS_91
A collection of first visual sub-feature sequences corresponding to at least one target frame.
The principle of formation of the second combined sequence in this step is similar to that in S500, except that the screening conditions are different, and will not be described here.
In practical use, too much noise is introduced in order to avoid introducing too many associated target boxes. Preferably, the method comprises the steps of,
Figure SMS_92
=0.3。
s601, performing second cross attention coding on each second combined sequence to generate second relation sequence characteristics
Figure SMS_93
. Self-attention of the ith second combined sequence in the second cross-attention code +.>
Figure SMS_94
Meets the following conditions:
Figure SMS_95
wherein ,
Figure SMS_98
is the ith first relationship sub-feature sequence. />
Figure SMS_100
Is->
Figure SMS_103
A corresponding sequence of target visual sub-features.
Figure SMS_97
Is->
Figure SMS_99
Corresponding combined frame selected image areas. />
Figure SMS_102
Is->
Figure SMS_104
Is selected for each of the target frame-selected image areas.
Figure SMS_96
、/>
Figure SMS_101
In the second cross attention code corresponding to the ith second combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.
The calculation in this step is the same as that in S501 described above, except that the input value for performing the cross-attention calculation is different.
The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art. The attention operators are expanded by adopting a multi-head attention mode. Each operator can calculate in each characteristic subspace, and then the calculation result is used as a final output in a cascading mode.
S100-S600 are mainly implemented by dual stream encoders, i.e. co-representing learning encoders.
Since each of the relational features in the sequence of relationships has environmental information about the surroundings of the object after processing in S100-S300, the features in the visual sequence mainly express specific details of a certain object. Thus, through the following S500-S700, both can be enabled to mutually compensate for information by means of cross-attention during execution of the attention mechanism.
The present invention divides the vision and relationship attention calculation process into two stages. In the first self-attention computation phase, the visual sequence and the relationship sequence each execute an attention operator. Whereby the model first learns intra-modal interactions on visual and relational modalities, respectively. Then in a second self-attention computing phase, the visual sequence and the relation sequence are interacted to execute attention operators respectively. Thereby, the visual features and the relational features can be mutually utilized to further promote the representation.
S700, according to the text sequence generated before the current time t
Figure SMS_105
Generating fusion weight corresponding to the current time t>
Figure SMS_106
。/>
Figure SMS_107
For indicating->
Figure SMS_108
Corresponding fusion ratio. Wherein (1)>
Figure SMS_109
To be at the timeAnd (5) corresponding to the image description information of the generated initial image at the time of t-1.
S800 according to
Figure SMS_110
、/>
Figure SMS_111
And fusion weight corresponding to current time t>
Figure SMS_112
Image description information of the initial image corresponding to the current time t is generated +.>
Figure SMS_113
Steps S700-S800 are mainly implemented by a dual stream decoder. The dual stream decoder is also referred to as a co-representation learning decoder. Specifically, in the process of generating the description sentence word by the visual-relation double-stream decoder, when the description word of the initial image at the time t is generated, the description sentence can be generated according to the content generated above
Figure SMS_114
To determine the word category (visual vocabulary or relational vocabulary) that needs to be generated at the current time t. Then, the ratio of the visual sequence to the relation sequence is determined according to the part-of-speech class, namely, the fusion weight is +.>
Figure SMS_115
Finally, according to->
Figure SMS_116
、/>
Figure SMS_117
And fusion weight corresponding to current time t>
Figure SMS_118
Image description information of the initial image corresponding to the current time t is generated +.>
Figure SMS_119
Thus, according to
Figure SMS_120
To determine the corresponding fusion weight +.>
Figure SMS_121
Then according to the fusion weight +.>
Figure SMS_122
To determine
Figure SMS_123
and />
Figure SMS_124
The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.
The invention jointly generates the description text of the initial image by setting the two types of characteristics of the visual sequence and the relation sequence. By adding the relation sequence between the target objects, the receptive field corresponding to the features can be increased, and further the machine can be helped to obtain the interrelationship between the target objects more clearly. Therefore, finally, according to the visual sequence and the relation sequence, the generated image description information of the initial image is more accurate and fine, and the precision is higher.
In addition, in the invention, the self-attention calculation in the independent modes is respectively carried out on the visual sequence and the relation sequence, and the attention calculation between the modes is also carried out, so that the receptive fields of the obtained second visual sequence characteristics and the second relation sequence characteristics can be further increased, and further, the relation information between the richer characteristics can be contained. Correspondingly, the image description information of the initial image, which can be generated during decoding, is more accurate and fine, and the accuracy is higher.
As a possible embodiment of the present invention, S301, obtaining a joint box corresponding to each target object in the initial target feature includes:
s311, counting the co-occurrence value set A of each category in the MSCOCO data set 1 ,A 2 ,…,A i ,…,A z ,A i =(A i1 ,A i2 ,…,A im ,…,A iz), wherein ,Ai Is the co-occurrence value set for the i-th category. A is that im Is the co-occurrence value between the i-th category and the m-th category. A is that im The total number of co-occurrences of the ith category of object and the mth category of object in all images of the MSCOCO dataset. z is the total number of categories in the MSCOCO dataset. i=1, 2, …, z.
S321 according to A 1 ,A 2 ,…,A i ,…,A z And determining a joint object corresponding to each target object from the initial target characteristics. A federated object is another target object in the initial target feature that has the greatest co-occurrence value with the target object.
S331, generating a joint frame corresponding to each target object according to each target object and the corresponding joint object. The image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.
In S301 of the above embodiment, the generated joint box contains all the pairwise relationships in the initial image. If used directly to learn the relational feature map, a significant amount of time and computing resources are consumed due to the large amount of data.
Meanwhile, since all the generated pairwise relationships exist, error relationships which do not meet the theorem exist. For example, the target object includes a boat, a person, water, and a hat; the relationship of "person" and "boat" and the relationship of "water" and "boat" belong to the relationship that frequently occurs at the same time in reality, and thus the relationship is a more rational correct relationship. However, the relationship of "cap" and "boat" and the relationship of "cap" and "water" belong to relationships which hardly occur at the same time in reality, and thus the relationship is an unreasonable erroneous relationship. It is necessary to remove noise relationships (erroneous relationships) among all the pair relationships to thereby improve generalization and effectiveness of the resulting joint box.
In this embodiment, the denoising process is performed according to the co-occurrence value, so as to solve the above-described problems.
As a possible embodiment of the invention, S700 is based on a text sequence generated before the current time t
Figure SMS_125
Generating fusion weight corresponding to the current time t>
Figure SMS_126
Comprising:
s701, inputting S into a first Multi-Layer perceptron (MLP) to generate part-of-speech probability corresponding to each piece of image description information in S. Part of speech probability of
Figure SMS_127
Probability of the corresponding part-of-speech category.
Further, S701 includes:
s711, inputting S into the first full connection layer. And generating part-of-speech features corresponding to each piece of image description information in S. The part-of-speech feature represents the likelihood that the corresponding image description information is a relational vocabulary.
And S721, carrying out normalization processing on all the part-of-speech features by using a first sigmoid activation function, and generating part-of-speech probability corresponding to each part-of-speech feature. Each part-of-speech probability is within a preset value interval [0,1 ].
S702, inputting all part-of-speech probabilities corresponding to the S into a second multi-layer perceptron to generate
Figure SMS_128
Corresponding fusion weight->
Figure SMS_129
Further, the second multi-layer perceptron includes a second fully connected layer. The second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.
Further, the second multi-layer perceptron further comprises a second sigmoid activates a function. The second sigmoid activation function is used for generating according to the fusion weight characteristics
Figure SMS_130
In this embodiment, the fused weight feature generation is generated by composing a gate function with two multi-layer perceptrons
Figure SMS_131
As a possible embodiment of the invention, S800 is according to
Figure SMS_132
、/>
Figure SMS_133
Fusion weight corresponding to current time t
Figure SMS_134
Image description information of the initial image corresponding to the current time t is generated +.>
Figure SMS_135
Comprising:
s801 according to
Figure SMS_136
For->
Figure SMS_137
Is->
Figure SMS_138
And fusing to generate a target fusion characteristic sequence F. F satisfies the following conditions:
Figure SMS_139
s802, decoding according to S and F to generate image description information of initial image corresponding to current time t
Figure SMS_140
In the present embodiment, according to
Figure SMS_141
To determine the corresponding fusion weight +.>
Figure SMS_142
Then according to the fusion weight +.>
Figure SMS_143
To determine->
Figure SMS_144
and />
Figure SMS_145
The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.
Fig. 2 and 3 are test data of various performance indexes after the model of the method of the present invention is tested by two conventional test methods. Specifically, fig. 2 shows an on-line test: MSCOCO online test (MSCOCO in-line test). Fig. 3 shows an off-line test: MSCOCOkarpathy split test results.
The test result corresponding to the outer is the test result of the model corresponding to the method. The other items are test results corresponding to the existing correlation model.
According to the test results, the method disclosed by the invention has the advantages that all indexes corresponding to the method are basically improved to a certain extent, and compared with the existing method, the method has a higher effect.
Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.
Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.
Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention described in the present specification when the program product is run on the electronic device.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (8)

1. An image description generation method based on fusion of a relation sequence and a visual sequence is characterized by comprising the following steps:
acquiring an initial image, wherein the initial image comprises images corresponding to N target objects; n is E [10,100];
performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics; the initial target features comprise image features corresponding to N target objects; wherein each target object corresponds to a target frame;
performing sequence coding on the initial target characteristics to generate a visual sequence
Figure QLYQS_1
The method comprises the steps of carrying out a first treatment on the surface of the Wherein v comprises a visual sub-feature sequence corresponding to each target frame;
acquiring a joint frame corresponding to each target object in the initial target characteristics; the image area selected by the combined frame is larger than the image area selected by the corresponding target frame; the image region contained in the joint box may represent a relationship between two target objects;
extracting features of the image area selected by each combined frame by using a ResNet152 network; generating a corresponding relation characteristic of each joint frame;
performing sequence coding on all the relation features to generate a relation sequence
Figure QLYQS_2
The method comprises the steps of carrying out a first treatment on the surface of the Wherein r comprises a relationship sub-feature sequence corresponding to each joint frame;
performing self-attention coding on v and r respectively to generate corresponding first visual sequence features
Figure QLYQS_3
First relation sequence feature->
Figure QLYQS_4
; wherein ,/>
Figure QLYQS_5
The method comprises a first visual sub-feature sequence corresponding to each target frame; />
Figure QLYQS_6
The method comprises a first relation sub-feature sequence corresponding to each joint frame;
the self-attention weight W in the self-attention encoding meets the following condition:
Figure QLYQS_7
wherein ,
Figure QLYQS_8
is a feature of the geometrical relationship between two objects performing self-attention computation in self-attention encoding;
will be
Figure QLYQS_9
Is associated with +.>
Figure QLYQS_10
Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence; the target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapped with a target frame corresponding to the first visual sub-feature sequence;
performing first cross-attention encoding on each first combined sequence to generate a second visual sequence characteristic
Figure QLYQS_11
Self-attention of the ith first combined sequence in the first cross-attention code
Figure QLYQS_12
Meets the following conditions:
Figure QLYQS_13
wherein ,
Figure QLYQS_14
is the i first visual sub-feature sequence, < ->
Figure QLYQS_19
Is->
Figure QLYQS_20
A corresponding target relationship sub-feature sequence; />
Figure QLYQS_15
Is->
Figure QLYQS_18
Corresponding image areas selected by the target frame; />
Figure QLYQS_21
Is->
Figure QLYQS_22
Corresponding to each of the image areas selected by the combined frame; />
Figure QLYQS_16
、/>
Figure QLYQS_17
In the first cross attention code corresponding to the ith first combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;
will be
Figure QLYQS_23
Each of the first relational sub-feature sequences and +.>
Figure QLYQS_24
The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence; the target visual sub-feature sequence is overlapped with the joint frame corresponding to the first relation sub-feature sequence, and the overlapped area is larger than the area threshold value +.>
Figure QLYQS_25
A collection of first visual sub-feature sequences corresponding to at least one target frame;
performing second cross-attention encoding on each second combined sequence to generate second relation sequence characteristics
Figure QLYQS_26
The method comprises the steps of carrying out a first treatment on the surface of the Self-attention of the ith second combined sequence in said second cross-attention code +.>
Figure QLYQS_27
Meets the following conditions:
Figure QLYQS_28
wherein ,
Figure QLYQS_30
is the ith first relation sub-feature sequence; />
Figure QLYQS_34
Is->
Figure QLYQS_36
A corresponding target visual sub-feature sequence; />
Figure QLYQS_31
Is->
Figure QLYQS_32
The corresponding combined frame selects the image area; />
Figure QLYQS_35
Is->
Figure QLYQS_37
The image area selected by each target frame; />
Figure QLYQS_29
、/>
Figure QLYQS_33
In the second cross attention code corresponding to the ith second combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;
from a text sequence generated before the current time t
Figure QLYQS_38
Generating fusion weight corresponding to the current time t>
Figure QLYQS_39
;/>
Figure QLYQS_40
For indicating->
Figure QLYQS_41
Corresponding fusion proportion; wherein (1)>
Figure QLYQS_42
Image description information of the initial image correspondingly generated at the time t-1;
according to
Figure QLYQS_43
、/>
Figure QLYQS_44
And fusion weight corresponding to current time t>
Figure QLYQS_45
Generating image description information of the initial image corresponding to the current time t>
Figure QLYQS_46
From a text sequence generated before the current time t
Figure QLYQS_47
Generating fusion weight corresponding to the current time t>
Figure QLYQS_48
Comprising:
inputting S into a first multi-layer perceptron to generate part-of-speech probability corresponding to each piece of image description information in S; the part-of-speech probability is
Figure QLYQS_49
Probability of the corresponding part-of-speech class;
inputting all part-of-speech probabilities corresponding to the S into a second multi-layer perceptron to generate
Figure QLYQS_50
Corresponding fusion weight->
Figure QLYQS_51
According to
Figure QLYQS_52
、/>
Figure QLYQS_53
And fusion weight corresponding to current time t>
Figure QLYQS_54
Generating image description information of the initial image corresponding to the current time t>
Figure QLYQS_55
Comprising:
according to
Figure QLYQS_56
For->
Figure QLYQS_57
Is->
Figure QLYQS_58
Fusing to generate a target fusion characteristic sequence F; f satisfies the following conditions:
Figure QLYQS_59
decoding according to S and F to generate image description information of the initial image corresponding to the current time t
Figure QLYQS_60
2. The method of claim 1, wherein obtaining a bounding box for each target object in the initial target feature comprises:
counting the co-occurrence value set A of each category in the MSCOCO data set 1 ,A 2 ,…,A i ,…,A z ,A i =(A i1 ,A i2 ,…,A im ,…,A iz), wherein ,Ai A co-occurrence value set for the i-th category; a is that im A co-occurrence value between the i-th category and the m-th category; a is that im The total number of co-occurrences of the ith category of target and the mth category of target in all images of the MSCOCO dataset; z is the total number of categories in the MSCOCO dataset; i=1, 2, …, z;
according to A 1 ,A 2 ,…,A i ,…,A z Determining a joint object corresponding to each target object from the initial target characteristics; the joint object is another target object with the maximum co-occurrence value with the target object in the initial target feature;
generating a joint frame corresponding to each target object according to each target object and the corresponding joint object; the image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.
3. The method of claim 1, wherein inputting S into the first multi-layer perceptron to generate part-of-speech probabilities for each image description in S comprises:
inputting S into a first full connection layer; generating part-of-speech features corresponding to each piece of image description information in S;
normalizing all the part-of-speech features by using a first sigmoid activation function to generate part-of-speech probability corresponding to each part-of-speech feature; each part-of-speech probability is within a preset numerical interval.
4. The method of claim 1, wherein the second multi-layer perceptron comprises a second fully-connected layer;
and the second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.
5. The method of claim 4, wherein the second multi-layer perceptron further comprises a second sigmoid activation function;
the second sigmoid activation function is used for generating according to the fusion weight characteristics
Figure QLYQS_61
6. The method of claim 1, wherein the step of determining the position of the substrate comprises,
Figure QLYQS_62
=0.3。
7. a non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 6.
8. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 6.
CN202211642392.2A 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence Active CN116012685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211642392.2A CN116012685B (en) 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211642392.2A CN116012685B (en) 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence

Publications (2)

Publication Number Publication Date
CN116012685A CN116012685A (en) 2023-04-25
CN116012685B true CN116012685B (en) 2023-06-16

Family

ID=86029043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211642392.2A Active CN116012685B (en) 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence

Country Status (1)

Country Link
CN (1) CN116012685B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444968A (en) * 2020-03-30 2020-07-24 哈尔滨工程大学 Image description generation method based on attention fusion
CN113609326A (en) * 2021-08-25 2021-11-05 广西师范大学 Image description generation method based on external knowledge and target relation
WO2021223323A1 (en) * 2020-05-06 2021-11-11 首都师范大学 Image content automatic description method based on construction of chinese visual vocabulary list
CN115311598A (en) * 2022-07-29 2022-11-08 复旦大学 Video description generation system based on relation perception

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444968A (en) * 2020-03-30 2020-07-24 哈尔滨工程大学 Image description generation method based on attention fusion
WO2021223323A1 (en) * 2020-05-06 2021-11-11 首都师范大学 Image content automatic description method based on construction of chinese visual vocabulary list
CN113609326A (en) * 2021-08-25 2021-11-05 广西师范大学 Image description generation method based on external knowledge and target relation
CN115311598A (en) * 2022-07-29 2022-11-08 复旦大学 Video description generation system based on relation perception

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
跨层多模型特征融合与因果卷积解码的图像描述;罗会兰;岳亮亮;;中国图象图形学报(08);第96-109页 *

Also Published As

Publication number Publication date
CN116012685A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN109447242B (en) Image description regeneration system and method based on iterative learning
CN111368993B (en) Data processing method and related equipment
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN111291212A (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
CN110598191B (en) Complex PDF structure analysis method and device based on neural network
US20220215159A1 (en) Sentence paraphrase method and apparatus, and method and apparatus for training sentence paraphrase model
CN114926835A (en) Text generation method and device, and model training method and device
CN114612767B (en) Scene graph-based image understanding and expressing method, system and storage medium
CN116129141A (en) Medical data processing method, apparatus, device, medium and computer program product
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
CN108804544A (en) Internet video display multi-source data fusion method and device
CN113868451B (en) Cross-modal conversation method and device for social network based on up-down Wen Jilian perception
CN117875395A (en) Training method, device and storage medium of multi-mode pre-training model
CN117392488A (en) Data processing method, neural network and related equipment
CN116012685B (en) Image description generation method based on fusion of relation sequence and visual sequence
WO2023116572A1 (en) Word or sentence generation method and related device
CN116704066A (en) Training method, training device, training terminal and training storage medium for image generation model
CN116758558A (en) Cross-modal generation countermeasure network-based image-text emotion classification method and system
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph
CN115311598A (en) Video description generation system based on relation perception
CN115346132A (en) Method and device for detecting abnormal events of remote sensing images by multi-modal representation learning
CN114564568A (en) Knowledge enhancement and context awareness based dialog state tracking method and system
CN110442706B (en) Text abstract generation method, system, equipment and storage medium
Wang et al. TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant