CN116012685A - Image description generation method based on fusion of relation sequence and visual sequence - Google Patents

Image description generation method based on fusion of relation sequence and visual sequence Download PDF

Info

Publication number
CN116012685A
CN116012685A CN202211642392.2A CN202211642392A CN116012685A CN 116012685 A CN116012685 A CN 116012685A CN 202211642392 A CN202211642392 A CN 202211642392A CN 116012685 A CN116012685 A CN 116012685A
Authority
CN
China
Prior art keywords
sequence
target
feature
image
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211642392.2A
Other languages
Chinese (zh)
Other versions
CN116012685B (en
Inventor
张文凯
陈佳良
冯瑛超
李硕轲
李霁豪
杜润岩
周瑞雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Information Research Institute of CAS
Original Assignee
Aerospace Information Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Information Research Institute of CAS filed Critical Aerospace Information Research Institute of CAS
Priority to CN202211642392.2A priority Critical patent/CN116012685B/en
Publication of CN116012685A publication Critical patent/CN116012685A/en
Application granted granted Critical
Publication of CN116012685B publication Critical patent/CN116012685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of image processing, and discloses an image description generation method based on fusion of a relation sequence and a visual sequence. Including acquiring an initial image. A visual sequence v is generated. A sequence of relations r is generated. Encoding v and r to generate corresponding first visual sequence features v 1 First relation sequence feature r 1 . For the firstA combined sequence performs a first cross-attention encoding to generate a second visual sequence feature v 2 . Performing second cross attention coding on the second combined sequence to generate a second relation sequence feature r 2 . According to v 2 、r 2 And fusing the weight beta to generate image description information of the initial image. According to the invention, by adding the relation sequence between the target objects, the receptive field corresponding to the characteristics can be increased, and then the interrelationship between the target objects can be acquired more clearly. Therefore, the generated image description information can be more accurate and fine, and the precision is higher.

Description

Image description generation method based on fusion of relation sequence and visual sequence
Technical Field
The invention relates to the field of image processing, in particular to an image description generation method based on fusion of a relation sequence and a visual sequence.
Background
In the big data age, a large amount of image data requires a large amount of human resources to process. With the development of machine learning and deep learning technologies, image understanding tasks with a target as a core are acquired, for example: image classification, object detection, image segmentation, etc. have achieved good results. However, the task described above can only provide content information such as the target category, the target position, or the pixel category to which the target belongs, which is included in the current image. It remains a challenge to combine these contents to refine the subject matter and semantic information contained in the image, i.e., image semantic description (ImageCaptioning). The task aims at one-way conversion between image to text bi-modal, and converts an input image into a natural language description which accords with grammar rules and is consistent with image content. For the technology of image semantic description, the technology has wider application scenes. For example, in a massive image data management system under a remote sensing scene, aiming at massive multi-target large scene image data, image semantics are described based on understanding image semantic topics, so that images with the same targets and different semantic topics can be distinguished more conveniently. Or interpreting the corresponding illustration photos in the context of the newspaper article; or providing a text description for a chart or a map; or provide scene descriptions for visually impaired people, etc.
In the prior art, text descriptions of the semantic information that each image has and the subject matter it is to express are generated by decoding the corresponding visual features of the object. However, the method in the prior art has the problem that the accuracy of the generated text description is low.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme:
according to one aspect of the present invention, there is provided an image description generation method based on fusion of a relational sequence with a visual sequence, the method comprising the steps of:
and acquiring an initial image, wherein the initial image comprises images corresponding to the N target objects. N is E [10,100].
And performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.
Sequence encoding the initial target feature to generate a visual sequence
Figure BDA0004008051340000011
Wherein v comprises a visual sub-feature sequence corresponding to each target frame.
And acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.
Feature extraction is performed on each of the co-framed image regions using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.
All the relation features are subjected to sequence coding to generate a relation sequence
Figure BDA0004008051340000021
Wherein r comprises a relationship sub-feature sequence corresponding to each joint frame.
For v andr respectively perform self-attention coding to respectively generate corresponding first visual sequence characteristics v 1 First relation sequence feature r 1. wherein ,v1 Including a first visual sub-feature sequence corresponding to each target frame. r is (r) 1 The first relation sub-feature sequence corresponding to each joint box is included.
The self-attention weight W in self-attention encoding meets the following condition:
Figure BDA0004008051340000022
where Ω is a feature of the geometric relationship between two objects for which self-attention calculations are made in self-attention encoding.
Will v 1 Each of the first visual sub-feature sequences and r 1 Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence. The target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapping with a target frame corresponding to the first visual sub-feature sequence.
Performing first cross-attention encoding on each first combined sequence to generate a second visual sequence characteristic v 2
Self-attention W of the ith first combined sequence in the first cross-attention code i Meets the following conditions:
Figure BDA0004008051340000023
wherein ,qi Is the ith first visual sub-feature sequence, K i Is q i Corresponding target relation sub-feature sequences.
Figure BDA0004008051340000024
Is q i And the corresponding image area selected by the target frame. />
Figure BDA0004008051340000025
For K i Corresponding to each of the image areas selected by the joint frame. />
Figure BDA0004008051340000026
In the first cross attention code corresponding to the ith first combination sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.
Will r 1 Each first relation sub-feature sequence in (a) and v 1 The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence. The target visual sub-feature sequence is a set of first visual sub-feature sequences corresponding to at least one target frame, wherein the first visual sub-feature sequences are overlapped with the joint frames corresponding to the first relation sub-feature sequences, and the overlapped area is larger than the area threshold value theta.
Performing second cross-attention encoding on each second combined sequence to generate a second relation sequence feature r 2 . Self-attention Y of the ith second combined sequence in the second cross-attention code i Meets the following conditions:
Figure BDA0004008051340000031
wherein ,Ri Is the ith first relationship sub-feature sequence. M is M i Is R i A corresponding sequence of target visual sub-features.
Figure BDA0004008051340000032
Is R i Corresponding combined frame selected image areas. />
Figure BDA0004008051340000033
Is M i Is selected for each of the target frame-selected image areas. />
Figure BDA0004008051340000034
Second cross-priming corresponding to the ith second combined sequence respectivelyIn the meaning code, the geometric relation features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence.
A text sequence s= (w) generated from before the current time t 0 ,w 1 ,…,w t-1 ) And generating fusion weight beta corresponding to the current time t. Beta is used to represent r 2 Corresponding fusion ratio. Wherein w is t-1 Image description information for the initial image correspondingly generated at time t-1.
According to v 2 、r 2 And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t t
According to a second aspect of the present invention, there is provided a non-transitory computer readable storage medium storing a computer program which, when executed by a processor, implements an image description generation method based on fusion of a relational sequence with a visual sequence as described above.
According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a method of generating an image description based on fusion of a relational sequence with a visual sequence as described above when the computer program is executed by the processor.
The invention has at least the following beneficial effects:
the invention jointly generates the description text of the initial image by setting the two types of characteristics of the visual sequence and the relation sequence. By adding the relation sequence between the target objects, the receptive field corresponding to the features can be increased, and further the machine can be helped to obtain the interrelationship between the target objects more clearly. Therefore, finally, according to the visual sequence and the relation sequence, the generated image description information of the initial image is more accurate and fine, and the precision is higher.
In addition, in the invention, the self-attention calculation in the independent modes is respectively carried out on the visual sequence and the relation sequence, and the attention calculation between the modes is also carried out, so that the receptive fields of the obtained second visual sequence characteristics and the second relation sequence characteristics can be further increased, and further, the relation information between the richer characteristics can be contained. Correspondingly, the image description information of the initial image, which can be generated during decoding, is more accurate and fine, and the accuracy is higher.
Meanwhile, determining a corresponding fusion weight beta at each moment according to S, and then determining v according to the fusion weight beta 2 and r2 The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an image description generating method based on fusion of a relationship sequence and a visual sequence according to an embodiment of the present invention.
FIG. 2 shows the experimental results of the model corresponding to the method of the present invention in MSCOCOon lineest.
FIG. 3 shows the experimental results of the model corresponding to the method of the present invention in MSCOCOkarply split.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Specifically, the method of the invention is realized by adopting a multi-mode transducer architecture as a baseline model. The model mainly comprises two parts: an encoder and a decoder. The general working process of the invention comprises (1) construction of visual sequence and relation sequence, (2) encoding by a double-stream encoder and (3) decoding by the double-stream decoder according to the target fusion characteristic sequence F so as to generate description information of an initial image.
As a possible embodiment of the present invention, as shown in fig. 1, there is provided an image description generating method based on fusion of a relationship sequence and a visual sequence, the method comprising the steps of:
s100, acquiring an initial image, wherein the initial image comprises images corresponding to N target objects. N is E [10,100].
S101, performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.
Specifically, at the input end, a group of 2048-dimensional object features corresponding to N target objects are firstly extracted through a target detection network Fast-RCNN, and then the object features are mapped to 512 dimensions so as to adapt to the input dimension of the encoder.
S200, performing sequence coding on the initial target characteristics to generate a visual sequence
Figure BDA0004008051340000041
Wherein v comprises a visual sub-feature sequence corresponding to each target frame.
S300, generating a relation sequence r according to the initial target characteristics. Comprising the following steps:
s301, acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.
Through the step, a joint frame corresponding to any two target objects in the initial target characteristics can be generated. And the image area contained in the joint box may represent a relationship between two target objects, such as a positional relationship. Therefore, feature extraction is carried out on the image area selected by the combined frame, and the generated relation feature can be ensured to contain the semantic feature of the relation between the two corresponding target objects.
And S302, extracting features of each combined framed image area by using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.
S303, performing sequence coding on all the relation features to generate a relation sequence
Figure BDA0004008051340000051
Wherein r comprises a relationship sub-feature sequence corresponding to each joint frame.
After the above steps, each of the relationship features in the relationship sequence has environmental information about the periphery of the target, while the features in the visual sequence mainly express specific details of a certain target.
S400, respectively performing self-attention coding on v and r to respectively generate corresponding first visual sequence features v 1 First relation sequence feature r 1. wherein ,v1 Including a first visual sub-feature sequence corresponding to each target frame. r is (r) 1 The first relation sub-feature sequence corresponding to each joint box is included.
Specifically, for example, the target objects in the initial image are 10, that is, n=10. V is a visual sequence composed of the coded sub-sequences (visual sub-feature sequences) of the images framed by the target frames to which the 10 target objects respectively correspond. I.e. the 10 visual sub-feature sequences are the 10 corresponding elements that make up the visual sequence.
Similarly, r is a relation sequence composed of coding subsequences (relation sub-feature sequences) of images framed by the joint frames corresponding to the 10 target objects respectively. I.e. the 10 relational sub-feature sequences are the 10 corresponding elements that make up the relational sequence.
The self-attention encoding in this step is to perform self-attention encoding on 10 elements in v to generate v 1; wherein ,v1 Similarly containing 10 recoded elements in one-to-one correspondence with v.
Similarly, 10 elements in r are self-processedAttention encoding to generate r 1; wherein ,r1 Similarly, 10 recoded elements corresponding one-to-one to r.
The self-attention weight W in self-attention encoding meets the following condition:
Figure BDA0004008051340000052
where Ω is a feature of the geometric relationship between two objects for which self-attention calculations are made in self-attention encoding. Omega is a vector with 4 dimensions, and the 4 dimensions are specifically the distance between the center points of the target frames corresponding to the two target objects for self-attention calculation in the X direction and the distance in the X direction; and the ratio of the length and the ratio of the width between the two target frames. The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.
S500, v 1 Each of the first visual sub-feature sequences and r 1 Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence. The target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapping with a target frame corresponding to the first visual sub-feature sequence.
Specifically, in this step, at least one target relationship sub-feature sequence corresponding to each first visual sub-feature sequence is obtained according to the above conditions, so as to generate a corresponding first combined sequence. I.e. will v 1 Each element of (a) and r 1 Form a new input sequence for which self-attention calculations are to be made. Such as: v 1 1 st first visual sub-feature sequence and r 1 The 3 rd and 6 th first relation sub-feature sequences form a pairA corresponding first combined sequence.
S501, performing first cross attention coding on each first combined sequence to generate a second visual sequence characteristic v 2
Self-attention W of the ith first combined sequence in the first cross-attention code i I.e. the self-attention W of the ith first visual sub-feature sequence i The method comprises the steps of carrying out a first treatment on the surface of the Meets the following conditions:
Figure BDA0004008051340000061
wherein ,qi Is the ith first visual sub-feature sequence, K i Is q i Corresponding target relation sub-feature sequences.
Figure BDA0004008051340000062
Is q i And the corresponding image area selected by the target frame. />
Figure BDA0004008051340000063
For K i Corresponding to each of the image areas selected by the joint frame. />
Figure BDA0004008051340000064
In the first cross attention code corresponding to the ith first combination sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.
Such as:
Figure BDA0004008051340000065
the geometric relationship features between the target frames corresponding to the 1 st first visual sub-feature sequence and the joint frames corresponding to the 3 rd and 6 th first relationship sub-feature sequences can be obtained.
The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.
S600, r is calculated 1 Each first relation sub-feature sequence in (a) and v 1 The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence. The target visual sub-feature sequence is a set of first visual sub-feature sequences corresponding to at least one target frame, wherein the first visual sub-feature sequences are overlapped with the joint frames corresponding to the first relation sub-feature sequences, and the overlapped area is larger than the area threshold value theta.
The principle of formation of the second combined sequence in this step is similar to that in S500, except that the screening conditions are different, and will not be described here.
In practical use, too much noise is introduced in order to avoid introducing too many associated target boxes. Preferably, θ=0.3.
S601, performing second cross attention coding on each second combined sequence to generate a second relation sequence feature r 2 . Self-attention Y of the ith second combined sequence in the second cross-attention code i Meets the following conditions:
Figure BDA0004008051340000071
/>
wherein ,Ri Is the ith first relationship sub-feature sequence. M is M i Is R i A corresponding sequence of target visual sub-features.
Figure BDA0004008051340000072
Is R i Corresponding combined frame selected image areas. />
Figure BDA0004008051340000073
Is M i Is selected for each of the target frame-selected image areas. />
Figure BDA0004008051340000074
Respectively the ith second combination sequenceIn the second cross-attention code corresponding to the column, the geometric relationship features between the two objects performing the self-attention computation and the dimensions of each feature sequence in the corresponding key sequence.
The calculation in this step is the same as that in S501 described above, except that the input value for performing the cross-attention calculation is different.
The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art. The attention operators are expanded by adopting a multi-head attention mode. Each operator can calculate in each characteristic subspace, and then the calculation result is used as a final output in a cascading mode.
S100-S600 are mainly implemented by dual stream encoders, i.e. co-representing learning encoders.
Since each of the relational features in the sequence of relationships has environmental information about the surroundings of the object after processing in S100-S300, the features in the visual sequence mainly express specific details of a certain object. Thus, through the following S500-S700, both can be enabled to mutually compensate for information by means of cross-attention during execution of the attention mechanism.
The present invention divides the vision and relationship attention calculation process into two stages. In the first self-attention computation phase, the visual sequence and the relationship sequence each execute an attention operator. Whereby the model first learns intra-modal interactions on visual and relational modalities, respectively. Then in a second self-attention computing phase, the visual sequence and the relation sequence are interacted to execute attention operators respectively. Thereby, the visual features and the relational features can be mutually utilized to further promote the representation.
S700 a text sequence s= (w) generated from before the current time t 0 ,w 1 ,…,w t-1 ) Generating a time whenFusion weight beta corresponding to previous time t. Beta is used to represent r 2 Corresponding fusion ratio. Wherein w is t-1 Image description information for the initial image correspondingly generated at time t-1.
S800 according to v 2 、r 2 And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t t
Steps S700-S800 are mainly implemented by a dual stream decoder. The dual stream decoder is also referred to as a co-representation learning decoder. Specifically, in the process of generating the description sentence word by the visual-relation dual-stream decoder, when the description word of the initial image at the time t is generated, the description sentence can be generated according to the content s= (w) 0 ,w 1 ,…,w t-1 ) To determine the word category (visual vocabulary or relational vocabulary) that needs to be generated at the current time t. Then, the ratio of the visual sequence to the relation sequence is determined according to the part-of-speech category, namely the fusion weight beta is finally determined according to v 2 、r 2 And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t t
Thus, the corresponding fusion weight beta at each moment is determined according to S, and then v is determined according to the fusion weight beta 2 and r2 The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.
The invention jointly generates the description text of the initial image by setting the two types of characteristics of the visual sequence and the relation sequence. By adding the relation sequence between the target objects, the receptive field corresponding to the features can be increased, and further the machine can be helped to obtain the interrelationship between the target objects more clearly. Therefore, finally, according to the visual sequence and the relation sequence, the generated image description information of the initial image is more accurate and fine, and the precision is higher.
In addition, in the invention, the self-attention calculation in the independent modes is respectively carried out on the visual sequence and the relation sequence, and the attention calculation between the modes is also carried out, so that the receptive fields of the obtained second visual sequence characteristics and the second relation sequence characteristics can be further increased, and further, the relation information between the richer characteristics can be contained. Correspondingly, the image description information of the initial image, which can be generated during decoding, is more accurate and fine, and the accuracy is higher.
As a possible embodiment of the present invention, S301, obtaining a joint box corresponding to each target object in the initial target feature includes:
s311, counting the co-occurrence value set A of each category in the MSCOCO data set 1 ,A 2 ,…,A i ,…,A z ,A i =(A i1 ,A i2 ,…,A im ,…,A iz), wherein ,Ai Is the co-occurrence value set for the i-th category. A is that im Is the co-occurrence value between the i-th category and the m-th category. A is that im The total number of co-occurrences of the ith category of object and the mth category of object in all images of the MSCOCO dataset. z is the total number of categories in the MSCOCO dataset. i=1, 2, …, z.
S321 according to A 1 ,A 2 ,…,A i ,…,A z And determining a joint object corresponding to each target object from the initial target characteristics. A federated object is another target object in the initial target feature that has the greatest co-occurrence value with the target object.
S331, generating a joint frame corresponding to each target object according to each target object and the corresponding joint object. The image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.
In S301 of the above embodiment, the generated joint box contains all the pairwise relationships in the initial image. If used directly to learn the relational feature map, a significant amount of time and computing resources are consumed due to the large amount of data.
Meanwhile, since all the generated pairwise relationships exist, error relationships which do not meet the theorem exist. For example, the target object includes a boat, a person, water, and a hat; the relationship of "person" and "boat" and the relationship of "water" and "boat" belong to the relationship that frequently occurs at the same time in reality, and thus the relationship is a more rational correct relationship. However, the relationship of "cap" and "boat" and the relationship of "cap" and "water" belong to relationships which hardly occur at the same time in reality, and thus the relationship is an unreasonable erroneous relationship. It is necessary to remove noise relationships (erroneous relationships) among all the pair relationships to thereby improve generalization and effectiveness of the resulting joint box.
In this embodiment, the denoising process is performed according to the co-occurrence value, so as to solve the above-described problems.
As a possible embodiment of the invention S700 a text sequence s= (w) generated from before the current time t 0 ,w 1 ,…,w t-1 ) Generating a fusion weight beta corresponding to the current time t, including:
s701, inputting S into a first MLP (Multi-layer perceptron) to generate part-of-speech probability corresponding to each piece of image description information in S. Part of speech probability r 2 Probability of the corresponding part-of-speech category.
Further, S701 includes:
s711, inputting S into the first full connection layer. And generating part-of-speech features corresponding to each piece of image description information in S. The part-of-speech feature represents the likelihood that the corresponding image description information is a relational vocabulary.
And S721, carrying out normalization processing on all the part-of-speech features by using a first sigmoid activation function, and generating part-of-speech probability corresponding to each part-of-speech feature. Each part-of-speech probability is within a preset value interval [0,1 ].
S702, inputting all part-of-speech probabilities corresponding to S into a second MLP to generate r 2 And the corresponding fusion weight beta.
Further, the second MLP includes a second fully connected layer. The second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.
Further, the second MLP further includes a second sigmoid activation function. The second sigmoid activation function is used to generate β from the fusion weight features.
In this embodiment, the fusion weight feature generation β is generated by two MLPs constituting a gate function.
As a possible embodiment of the invention, S800 is according to v 2 、r 2 And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t t Comprising:
s801 v according to beta 2 R 2 And fusing to generate a target fusion characteristic sequence F. F satisfies the following conditions:
F=β*r 2 +(1-β)*v 2
s802, decoding according to S and F to generate image description information w of initial image corresponding to current time t t
In this embodiment, the corresponding fusion weight β at each time is determined according to S, and then v is determined according to the fusion weight β 2 and r2 The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.
Fig. 2 and 3 are test data of various performance indexes after the model of the method of the present invention is tested by two conventional test methods. Specifically, fig. 2 shows an on-line test: MSCOCO online test (MSCOCO in-line test). Fig. 3 shows an off-line test: test results of mscocorkathysplit.
The test result corresponding to the outer is the test result of the model corresponding to the method. The other items are test results corresponding to the existing correlation model.
According to the test results, the method disclosed by the invention has the advantages that all indexes corresponding to the method are basically improved to a certain extent, and compared with the existing method, the method has a higher effect.
Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.
Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.
Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention described in the present specification when the program product is run on the electronic device.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. An image description generation method based on fusion of a relation sequence and a visual sequence is characterized by comprising the following steps:
acquiring an initial image, wherein the initial image comprises images corresponding to N target objects; n is E [10,100];
performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics; the initial target features comprise image features corresponding to N target objects; wherein each target object corresponds to a target frame;
performing sequence coding on the initial target characteristics to generate a visual sequence
Figure FDA0004008051330000011
Wherein v comprises a visual sub-feature sequence corresponding to each target frame;
acquiring a joint frame corresponding to each target object in the initial target characteristics; the image area selected by the combined frame is larger than the image area selected by the corresponding target frame;
extracting features of the image area selected by each combined frame by using a ResNet152 network; generating a corresponding relation characteristic of each joint frame;
performing sequence coding on all the relation features to generate a relation sequence
Figure FDA0004008051330000012
Wherein r comprises a relationship sub-feature sequence corresponding to each joint frame;
performing self-attention coding on v and r respectively to generate corresponding first visual sequence features v 1 First relation sequence feature r 1; wherein ,v1 The method comprises a first visual sub-feature sequence corresponding to each target frame; r is (r) 1 The method comprises a first relation sub-feature sequence corresponding to each joint frame;
the self-attention weight W in the self-attention encoding meets the following condition:
Figure FDA0004008051330000013
wherein Ω is a geometric relationship feature between two objects for which self-attention computation is performed in self-attention encoding;
will v 1 Each of the first visual sub-feature sequences and r 1 Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence; the target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapped with a target frame corresponding to the first visual sub-feature sequence;
performing first cross attention coding on each first combined sequence to generate a second visual sequence characteristic v 2
Self-attention W of the ith first combined sequence in the first cross-attention code i Meets the following conditions:
Figure FDA0004008051330000014
wherein ,qi Is the ith first visual sub-feature sequence, K i Is q i A corresponding target relationship sub-feature sequence;
Figure FDA0004008051330000015
is q i Corresponding image areas selected by the target frame; />
Figure FDA0004008051330000016
For K i Corresponding to each of the image areas selected by the combined frame; />
Figure FDA0004008051330000017
In the first cross attention code corresponding to the ith first combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;
will r 1 Each first relation sub-feature sequence in (a) and v 1 The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence; the target visual sub-feature sequence is a first visual sub-feature sequence set corresponding to at least one target frame, wherein the first visual sub-feature sequence set is overlapped with the joint frame corresponding to the first relation sub-feature sequence, and the overlapped area is larger than the area threshold value theta;
performing second cross-attention encoding on each second combined sequence to generate a second relation sequence feature r 2 The method comprises the steps of carrying out a first treatment on the surface of the Self-attention Y of the ith second combined sequence in the second cross-attention code i Meets the following conditions:
Figure FDA0004008051330000021
wherein ,Ri Is the ith first relation sub-feature sequence; m is M i Is R i A corresponding target visual sub-feature sequence;
Figure FDA0004008051330000022
is R i The corresponding combined frame selects the image area; />
Figure FDA0004008051330000023
Is M i The image area selected by each target frame; />
Figure FDA0004008051330000024
In the second cross attention code corresponding to the ith second combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;
a text sequence s= (w) generated from before the current time t 0 ,w 1 ,…,w t-1 ) Generating fusion weight beta corresponding to the current time t; beta is used to represent r 2 Corresponding fusion proportion; wherein w is t-1 Image description information of the initial image correspondingly generated at the time t-1;
according to v 2 、r 2 And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t t
2. The method of claim 1, wherein obtaining a bounding box for each target object in the initial target feature comprises:
counting the co-occurrence value set A of each category in the MSCOCO data set 1 ,A 2 ,…,A i ,…,A z ,A i =(A i1 ,A i2 ,…,A im ,…,A iz), wherein ,Ai A co-occurrence value set for the i-th category; a is that im For the ith category and the mth categoryCo-occurrence value between; a is that im The total number of co-occurrences of the ith category of target and the mth category of target in all images of the MSCOCO dataset; z is the total number of categories in the MSCOCO dataset; i=1, 2, …, z;
according to A 1 ,A 2 ,…,A i ,…,A z Determining a joint object corresponding to each target object from the initial target characteristics; the joint object is another target object with the maximum co-occurrence value with the target object in the initial target feature;
generating a joint frame corresponding to each target object according to each target object and the corresponding joint object; the image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.
3. The method according to claim 1, characterized in that the text sequence s= (w) generated from before the current time t 0 ,w 1 ,…,w t-1 ) Generating a fusion weight beta corresponding to the current time t, including:
s is input into a first MLP, and part-of-speech probability corresponding to each piece of image description information in the S is generated; the part-of-speech probability is r 2 Probability of the corresponding part-of-speech class;
inputting all part-of-speech probabilities corresponding to S into a second MLP to generate r 2 And the corresponding fusion weight beta.
4. A method according to claim 3, characterized in that according to v 2 、r 2 And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t t Comprising:
according to beta vs 2 R 2 Fusing to generate a target fusion characteristic sequence F; f satisfies the following conditions:
F=β*r 2 +(1-β)*v 2
decoding according to S and F to generate image description information w of the initial image corresponding to the current time t t
5. The method of claim 3, wherein inputting S into the first MLP to generate a part-of-speech probability for each image description information in S comprises:
inputting S into a first full connection layer; generating part-of-speech features corresponding to each piece of image description information in S;
normalizing all the part-of-speech features by using a first sigmoid activation function to generate part-of-speech probability corresponding to each part-of-speech feature; each part-of-speech probability is within a preset numerical interval.
6. The method of claim 3, wherein the second MLP comprises a second fully connected layer;
and the second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.
7. The method of claim 6, wherein the second MLP further comprises a second sigmoid activation function;
the second sigmoid activation function is used for generating beta according to the fusion weight feature.
8. The method of claim 1, wherein θ = 0.3.
9. A non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 8.
10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 8.
CN202211642392.2A 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence Active CN116012685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211642392.2A CN116012685B (en) 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211642392.2A CN116012685B (en) 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence

Publications (2)

Publication Number Publication Date
CN116012685A true CN116012685A (en) 2023-04-25
CN116012685B CN116012685B (en) 2023-06-16

Family

ID=86029043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211642392.2A Active CN116012685B (en) 2022-12-20 2022-12-20 Image description generation method based on fusion of relation sequence and visual sequence

Country Status (1)

Country Link
CN (1) CN116012685B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444968A (en) * 2020-03-30 2020-07-24 哈尔滨工程大学 Image description generation method based on attention fusion
CN113609326A (en) * 2021-08-25 2021-11-05 广西师范大学 Image description generation method based on external knowledge and target relation
WO2021223323A1 (en) * 2020-05-06 2021-11-11 首都师范大学 Image content automatic description method based on construction of chinese visual vocabulary list
CN115311598A (en) * 2022-07-29 2022-11-08 复旦大学 Video description generation system based on relation perception

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444968A (en) * 2020-03-30 2020-07-24 哈尔滨工程大学 Image description generation method based on attention fusion
WO2021223323A1 (en) * 2020-05-06 2021-11-11 首都师范大学 Image content automatic description method based on construction of chinese visual vocabulary list
CN113609326A (en) * 2021-08-25 2021-11-05 广西师范大学 Image description generation method based on external knowledge and target relation
CN115311598A (en) * 2022-07-29 2022-11-08 复旦大学 Video description generation system based on relation perception

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗会兰;岳亮亮;: "跨层多模型特征融合与因果卷积解码的图像描述", 中国图象图形学报, no. 08, pages 96 - 109 *

Also Published As

Publication number Publication date
CN116012685B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
US11093560B2 (en) Stacked cross-modal matching
US20190220691A1 (en) Segmentation of Data
CN109344404B (en) Context-aware dual-attention natural language reasoning method
GB2571825A (en) Semantic class localization digital environment
CN110390363A (en) A kind of Image Description Methods
CN111368993A (en) Data processing method and related equipment
CN111984772B (en) Medical image question-answering method and system based on deep learning
WO2021212601A1 (en) Image-based writing assisting method and apparatus, medium, and device
US20220215159A1 (en) Sentence paraphrase method and apparatus, and method and apparatus for training sentence paraphrase model
CN114926835A (en) Text generation method and device, and model training method and device
CN114612767B (en) Scene graph-based image understanding and expressing method, system and storage medium
CN116129141A (en) Medical data processing method, apparatus, device, medium and computer program product
CN116611024A (en) Multi-mode trans mock detection method based on facts and emotion oppositivity
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
US20220188636A1 (en) Meta pseudo-labels
CN117392488A (en) Data processing method, neural network and related equipment
WO2023116572A1 (en) Word or sentence generation method and related device
CN111783475A (en) Semantic visual positioning method and device based on phrase relation propagation
CN116704066A (en) Training method, training device, training terminal and training storage medium for image generation model
CN116012685B (en) Image description generation method based on fusion of relation sequence and visual sequence
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph
CN115311598A (en) Video description generation system based on relation perception
CN115346132A (en) Method and device for detecting abnormal events of remote sensing images by multi-modal representation learning
CN110442706B (en) Text abstract generation method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant