CN116012685A

CN116012685A - Image description generation method based on fusion of relation sequence and visual sequence

Info

Publication number: CN116012685A
Application number: CN202211642392.2A
Authority: CN
Inventors: 张文凯; 陈佳良; 冯瑛超; 李硕轲; 李霁豪; 杜润岩; 周瑞雪
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-04-25
Anticipated expiration: 2042-12-20
Also published as: CN116012685B

Abstract

The invention relates to the field of image processing, and discloses an image description generation method based on fusion of a relation sequence and a visual sequence. Including acquiring an initial image. A visual sequence v is generated. A sequence of relations r is generated. Encoding v and r to generate corresponding first visual sequence features v ₁ First relation sequence feature r ₁ . For the firstA combined sequence performs a first cross-attention encoding to generate a second visual sequence feature v ₂ . Performing second cross attention coding on the second combined sequence to generate a second relation sequence feature r ₂ . According to v ₂ 、r ₂ And fusing the weight beta to generate image description information of the initial image. According to the invention, by adding the relation sequence between the target objects, the receptive field corresponding to the characteristics can be increased, and then the interrelationship between the target objects can be acquired more clearly. Therefore, the generated image description information can be more accurate and fine, and the precision is higher.

Description

Image description generation method based on fusion of relation sequence and visual sequence

Technical Field

The invention relates to the field of image processing, in particular to an image description generation method based on fusion of a relation sequence and a visual sequence.

Background

In the big data age, a large amount of image data requires a large amount of human resources to process. With the development of machine learning and deep learning technologies, image understanding tasks with a target as a core are acquired, for example: image classification, object detection, image segmentation, etc. have achieved good results. However, the task described above can only provide content information such as the target category, the target position, or the pixel category to which the target belongs, which is included in the current image. It remains a challenge to combine these contents to refine the subject matter and semantic information contained in the image, i.e., image semantic description (ImageCaptioning). The task aims at one-way conversion between image to text bi-modal, and converts an input image into a natural language description which accords with grammar rules and is consistent with image content. For the technology of image semantic description, the technology has wider application scenes. For example, in a massive image data management system under a remote sensing scene, aiming at massive multi-target large scene image data, image semantics are described based on understanding image semantic topics, so that images with the same targets and different semantic topics can be distinguished more conveniently. Or interpreting the corresponding illustration photos in the context of the newspaper article; or providing a text description for a chart or a map; or provide scene descriptions for visually impaired people, etc.

In the prior art, text descriptions of the semantic information that each image has and the subject matter it is to express are generated by decoding the corresponding visual features of the object. However, the method in the prior art has the problem that the accuracy of the generated text description is low.

Disclosure of Invention

Aiming at the technical problems, the invention adopts the following technical scheme:

according to one aspect of the present invention, there is provided an image description generation method based on fusion of a relational sequence with a visual sequence, the method comprising the steps of:

and acquiring an initial image, wherein the initial image comprises images corresponding to the N target objects. N is E [10,100].

And performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.

Sequence encoding the initial target feature to generate a visual sequence

Wherein v comprises a visual sub-feature sequence corresponding to each target frame.

And acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.

Feature extraction is performed on each of the co-framed image regions using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.

All the relation features are subjected to sequence coding to generate a relation sequence

Wherein r comprises a relationship sub-feature sequence corresponding to each joint frame.

For v andr respectively perform self-attention coding to respectively generate corresponding first visual sequence characteristics v ₁ First relation sequence feature r ₁. wherein ,v₁ Including a first visual sub-feature sequence corresponding to each target frame. r is (r) ₁ The first relation sub-feature sequence corresponding to each joint box is included.

The self-attention weight W in self-attention encoding meets the following condition:

where Ω is a feature of the geometric relationship between two objects for which self-attention calculations are made in self-attention encoding.

Will v ₁ Each of the first visual sub-feature sequences and r ₁ Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence. The target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapping with a target frame corresponding to the first visual sub-feature sequence.

Performing first cross-attention encoding on each first combined sequence to generate a second visual sequence characteristic v ₂ 。

Self-attention W of the ith first combined sequence in the first cross-attention code _i Meets the following conditions:

wherein ,q_i Is the ith first visual sub-feature sequence, K _i Is q _i Corresponding target relation sub-feature sequences.

Is q _i And the corresponding image area selected by the target frame. />

For K _i Corresponding to each of the image areas selected by the joint frame. />

In the first cross attention code corresponding to the ith first combination sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.

Will r ₁ Each first relation sub-feature sequence in (a) and v ₁ The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence. The target visual sub-feature sequence is a set of first visual sub-feature sequences corresponding to at least one target frame, wherein the first visual sub-feature sequences are overlapped with the joint frames corresponding to the first relation sub-feature sequences, and the overlapped area is larger than the area threshold value theta.

Performing second cross-attention encoding on each second combined sequence to generate a second relation sequence feature r ₂ . Self-attention Y of the ith second combined sequence in the second cross-attention code _i Meets the following conditions:

wherein ,R_i Is the ith first relationship sub-feature sequence. M is M _i Is R _i A corresponding sequence of target visual sub-features.

Is R _i Corresponding combined frame selected image areas. />

Is M _i Is selected for each of the target frame-selected image areas. />

Second cross-priming corresponding to the ith second combined sequence respectivelyIn the meaning code, the geometric relation features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence.

A text sequence s= (w) generated from before the current time t ₀ ,w ₁ ,…,w _t-1 ) And generating fusion weight beta corresponding to the current time t. Beta is used to represent r ₂ Corresponding fusion ratio. Wherein w is _t-1 Image description information for the initial image correspondingly generated at time t-1.

According to v ₂ 、r ₂ And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t _t 。

According to a second aspect of the present invention, there is provided a non-transitory computer readable storage medium storing a computer program which, when executed by a processor, implements an image description generation method based on fusion of a relational sequence with a visual sequence as described above.

According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a method of generating an image description based on fusion of a relational sequence with a visual sequence as described above when the computer program is executed by the processor.

The invention has at least the following beneficial effects:

the invention jointly generates the description text of the initial image by setting the two types of characteristics of the visual sequence and the relation sequence. By adding the relation sequence between the target objects, the receptive field corresponding to the features can be increased, and further the machine can be helped to obtain the interrelationship between the target objects more clearly. Therefore, finally, according to the visual sequence and the relation sequence, the generated image description information of the initial image is more accurate and fine, and the precision is higher.

In addition, in the invention, the self-attention calculation in the independent modes is respectively carried out on the visual sequence and the relation sequence, and the attention calculation between the modes is also carried out, so that the receptive fields of the obtained second visual sequence characteristics and the second relation sequence characteristics can be further increased, and further, the relation information between the richer characteristics can be contained. Correspondingly, the image description information of the initial image, which can be generated during decoding, is more accurate and fine, and the accuracy is higher.

Meanwhile, determining a corresponding fusion weight beta at each moment according to S, and then determining v according to the fusion weight beta ₂ and r₂ The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an image description generating method based on fusion of a relationship sequence and a visual sequence according to an embodiment of the present invention.

FIG. 2 shows the experimental results of the model corresponding to the method of the present invention in MSCOCOon lineest.

FIG. 3 shows the experimental results of the model corresponding to the method of the present invention in MSCOCOkarply split.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Specifically, the method of the invention is realized by adopting a multi-mode transducer architecture as a baseline model. The model mainly comprises two parts: an encoder and a decoder. The general working process of the invention comprises (1) construction of visual sequence and relation sequence, (2) encoding by a double-stream encoder and (3) decoding by the double-stream decoder according to the target fusion characteristic sequence F so as to generate description information of an initial image.

As a possible embodiment of the present invention, as shown in fig. 1, there is provided an image description generating method based on fusion of a relationship sequence and a visual sequence, the method comprising the steps of:

s100, acquiring an initial image, wherein the initial image comprises images corresponding to N target objects. N is E [10,100].

S101, performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.

Specifically, at the input end, a group of 2048-dimensional object features corresponding to N target objects are firstly extracted through a target detection network Fast-RCNN, and then the object features are mapped to 512 dimensions so as to adapt to the input dimension of the encoder.

S200, performing sequence coding on the initial target characteristics to generate a visual sequence

S300, generating a relation sequence r according to the initial target characteristics. Comprising the following steps:

s301, acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.

Through the step, a joint frame corresponding to any two target objects in the initial target characteristics can be generated. And the image area contained in the joint box may represent a relationship between two target objects, such as a positional relationship. Therefore, feature extraction is carried out on the image area selected by the combined frame, and the generated relation feature can be ensured to contain the semantic feature of the relation between the two corresponding target objects.

And S302, extracting features of each combined framed image area by using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.

S303, performing sequence coding on all the relation features to generate a relation sequence

After the above steps, each of the relationship features in the relationship sequence has environmental information about the periphery of the target, while the features in the visual sequence mainly express specific details of a certain target.

S400, respectively performing self-attention coding on v and r to respectively generate corresponding first visual sequence features v ₁ First relation sequence feature r ₁. wherein ,v₁ Including a first visual sub-feature sequence corresponding to each target frame. r is (r) ₁ The first relation sub-feature sequence corresponding to each joint box is included.

Specifically, for example, the target objects in the initial image are 10, that is, n=10. V is a visual sequence composed of the coded sub-sequences (visual sub-feature sequences) of the images framed by the target frames to which the 10 target objects respectively correspond. I.e. the 10 visual sub-feature sequences are the 10 corresponding elements that make up the visual sequence.

Similarly, r is a relation sequence composed of coding subsequences (relation sub-feature sequences) of images framed by the joint frames corresponding to the 10 target objects respectively. I.e. the 10 relational sub-feature sequences are the 10 corresponding elements that make up the relational sequence.

The self-attention encoding in this step is to perform self-attention encoding on 10 elements in v to generate v ₁； wherein ,v₁ Similarly containing 10 recoded elements in one-to-one correspondence with v.

Similarly, 10 elements in r are self-processedAttention encoding to generate r ₁； wherein ,r₁ Similarly, 10 recoded elements corresponding one-to-one to r.

where Ω is a feature of the geometric relationship between two objects for which self-attention calculations are made in self-attention encoding. Omega is a vector with 4 dimensions, and the 4 dimensions are specifically the distance between the center points of the target frames corresponding to the two target objects for self-attention calculation in the X direction and the distance in the X direction; and the ratio of the length and the ratio of the width between the two target frames. The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.

S500, v ₁ Each of the first visual sub-feature sequences and r ₁ Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence. The target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapping with a target frame corresponding to the first visual sub-feature sequence.

Specifically, in this step, at least one target relationship sub-feature sequence corresponding to each first visual sub-feature sequence is obtained according to the above conditions, so as to generate a corresponding first combined sequence. I.e. will v ₁ Each element of (a) and r ₁ Form a new input sequence for which self-attention calculations are to be made. Such as: v ₁ 1 st first visual sub-feature sequence and r ₁ The 3 rd and 6 th first relation sub-feature sequences form a pairA corresponding first combined sequence.

S501, performing first cross attention coding on each first combined sequence to generate a second visual sequence characteristic v ₂ 。

Self-attention W of the ith first combined sequence in the first cross-attention code _i I.e. the self-attention W of the ith first visual sub-feature sequence _i The method comprises the steps of carrying out a first treatment on the surface of the Meets the following conditions:

Is q _i And the corresponding image area selected by the target frame. />

Such as:

the geometric relationship features between the target frames corresponding to the 1 st first visual sub-feature sequence and the joint frames corresponding to the 3 rd and 6 th first relationship sub-feature sequences can be obtained.

The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.

S600, r is calculated ₁ Each first relation sub-feature sequence in (a) and v ₁ The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence. The target visual sub-feature sequence is a set of first visual sub-feature sequences corresponding to at least one target frame, wherein the first visual sub-feature sequences are overlapped with the joint frames corresponding to the first relation sub-feature sequences, and the overlapped area is larger than the area threshold value theta.

The principle of formation of the second combined sequence in this step is similar to that in S500, except that the screening conditions are different, and will not be described here.

In practical use, too much noise is introduced in order to avoid introducing too many associated target boxes. Preferably, θ=0.3.

S601, performing second cross attention coding on each second combined sequence to generate a second relation sequence feature r ₂ . Self-attention Y of the ith second combined sequence in the second cross-attention code _i Meets the following conditions:

/>

Is R _i Corresponding combined frame selected image areas. />

Is M _i Is selected for each of the target frame-selected image areas. />

Respectively the ith second combination sequenceIn the second cross-attention code corresponding to the column, the geometric relationship features between the two objects performing the self-attention computation and the dimensions of each feature sequence in the corresponding key sequence.

The calculation in this step is the same as that in S501 described above, except that the input value for performing the cross-attention calculation is different.

The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art. The attention operators are expanded by adopting a multi-head attention mode. Each operator can calculate in each characteristic subspace, and then the calculation result is used as a final output in a cascading mode.

S100-S600 are mainly implemented by dual stream encoders, i.e. co-representing learning encoders.

Since each of the relational features in the sequence of relationships has environmental information about the surroundings of the object after processing in S100-S300, the features in the visual sequence mainly express specific details of a certain object. Thus, through the following S500-S700, both can be enabled to mutually compensate for information by means of cross-attention during execution of the attention mechanism.

The present invention divides the vision and relationship attention calculation process into two stages. In the first self-attention computation phase, the visual sequence and the relationship sequence each execute an attention operator. Whereby the model first learns intra-modal interactions on visual and relational modalities, respectively. Then in a second self-attention computing phase, the visual sequence and the relation sequence are interacted to execute attention operators respectively. Thereby, the visual features and the relational features can be mutually utilized to further promote the representation.

S700 a text sequence s= (w) generated from before the current time t ₀ ,w ₁ ,…,w _t-1 ) Generating a time whenFusion weight beta corresponding to previous time t. Beta is used to represent r ₂ Corresponding fusion ratio. Wherein w is _t-1 Image description information for the initial image correspondingly generated at time t-1.

S800 according to v ₂ 、r ₂ And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t _t 。

Steps S700-S800 are mainly implemented by a dual stream decoder. The dual stream decoder is also referred to as a co-representation learning decoder. Specifically, in the process of generating the description sentence word by the visual-relation dual-stream decoder, when the description word of the initial image at the time t is generated, the description sentence can be generated according to the content s= (w) ₀ ,w ₁ ,…,w _t-1 ) To determine the word category (visual vocabulary or relational vocabulary) that needs to be generated at the current time t. Then, the ratio of the visual sequence to the relation sequence is determined according to the part-of-speech category, namely the fusion weight beta is finally determined according to v ₂ 、r ₂ And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t _t 。

Thus, the corresponding fusion weight beta at each moment is determined according to S, and then v is determined according to the fusion weight beta ₂ and r₂ The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.

As a possible embodiment of the present invention, S301, obtaining a joint box corresponding to each target object in the initial target feature includes:

s311, counting the co-occurrence value set A of each category in the MSCOCO data set ₁ ，A ₂ ，…，A _i ，…，A _z ，A _i ＝(A _i1 ，A _i2 ，…，A _im ，…，A _iz), wherein ,A_i Is the co-occurrence value set for the i-th category. A is that _im Is the co-occurrence value between the i-th category and the m-th category. A is that _im The total number of co-occurrences of the ith category of object and the mth category of object in all images of the MSCOCO dataset. z is the total number of categories in the MSCOCO dataset. i=1, 2, …, z.

S321 according to A ₁ ，A ₂ ，…，A _i ，…，A _z And determining a joint object corresponding to each target object from the initial target characteristics. A federated object is another target object in the initial target feature that has the greatest co-occurrence value with the target object.

S331, generating a joint frame corresponding to each target object according to each target object and the corresponding joint object. The image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.

In S301 of the above embodiment, the generated joint box contains all the pairwise relationships in the initial image. If used directly to learn the relational feature map, a significant amount of time and computing resources are consumed due to the large amount of data.

Meanwhile, since all the generated pairwise relationships exist, error relationships which do not meet the theorem exist. For example, the target object includes a boat, a person, water, and a hat; the relationship of "person" and "boat" and the relationship of "water" and "boat" belong to the relationship that frequently occurs at the same time in reality, and thus the relationship is a more rational correct relationship. However, the relationship of "cap" and "boat" and the relationship of "cap" and "water" belong to relationships which hardly occur at the same time in reality, and thus the relationship is an unreasonable erroneous relationship. It is necessary to remove noise relationships (erroneous relationships) among all the pair relationships to thereby improve generalization and effectiveness of the resulting joint box.

In this embodiment, the denoising process is performed according to the co-occurrence value, so as to solve the above-described problems.

As a possible embodiment of the invention S700 a text sequence s= (w) generated from before the current time t ₀ ,w ₁ ,…,w _t-1 ) Generating a fusion weight beta corresponding to the current time t, including:

s701, inputting S into a first MLP (Multi-layer perceptron) to generate part-of-speech probability corresponding to each piece of image description information in S. Part of speech probability r ₂ Probability of the corresponding part-of-speech category.

Further, S701 includes:

s711, inputting S into the first full connection layer. And generating part-of-speech features corresponding to each piece of image description information in S. The part-of-speech feature represents the likelihood that the corresponding image description information is a relational vocabulary.

And S721, carrying out normalization processing on all the part-of-speech features by using a first sigmoid activation function, and generating part-of-speech probability corresponding to each part-of-speech feature. Each part-of-speech probability is within a preset value interval [0,1 ].

S702, inputting all part-of-speech probabilities corresponding to S into a second MLP to generate r ₂ And the corresponding fusion weight beta.

Further, the second MLP includes a second fully connected layer. The second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.

Further, the second MLP further includes a second sigmoid activation function. The second sigmoid activation function is used to generate β from the fusion weight features.

In this embodiment, the fusion weight feature generation β is generated by two MLPs constituting a gate function.

As a possible embodiment of the invention, S800 is according to v ₂ 、r ₂ And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t _t Comprising:

s801 v according to beta ₂ R ₂ And fusing to generate a target fusion characteristic sequence F. F satisfies the following conditions:

F＝β*r ₂ +(1-β)*v ₂ 。

s802, decoding according to S and F to generate image description information w of initial image corresponding to current time t _t 。

In this embodiment, the corresponding fusion weight β at each time is determined according to S, and then v is determined according to the fusion weight β ₂ and r₂ The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.

Fig. 2 and 3 are test data of various performance indexes after the model of the method of the present invention is tested by two conventional test methods. Specifically, fig. 2 shows an on-line test: MSCOCO online test (MSCOCO in-line test). Fig. 3 shows an off-line test: test results of mscocorkathysplit.

The test result corresponding to the outer is the test result of the model corresponding to the method. The other items are test results corresponding to the existing correlation model.

According to the test results, the method disclosed by the invention has the advantages that all indexes corresponding to the method are basically improved to a certain extent, and compared with the existing method, the method has a higher effect.

Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.

Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.

Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention described in the present specification when the program product is run on the electronic device.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. An image description generation method based on fusion of a relation sequence and a visual sequence is characterized by comprising the following steps:

acquiring an initial image, wherein the initial image comprises images corresponding to N target objects; n is E [10,100];

performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics; the initial target features comprise image features corresponding to N target objects; wherein each target object corresponds to a target frame;

performing sequence coding on the initial target characteristics to generate a visual sequence

Wherein v comprises a visual sub-feature sequence corresponding to each target frame;

acquiring a joint frame corresponding to each target object in the initial target characteristics; the image area selected by the combined frame is larger than the image area selected by the corresponding target frame;

extracting features of the image area selected by each combined frame by using a ResNet152 network; generating a corresponding relation characteristic of each joint frame;

performing sequence coding on all the relation features to generate a relation sequence

Wherein r comprises a relationship sub-feature sequence corresponding to each joint frame;

performing self-attention coding on v and r respectively to generate corresponding first visual sequence features v ₁ First relation sequence feature r ₁； wherein ,v₁ The method comprises a first visual sub-feature sequence corresponding to each target frame; r is (r) ₁ The method comprises a first relation sub-feature sequence corresponding to each joint frame;

the self-attention weight W in the self-attention encoding meets the following condition:

wherein Ω is a geometric relationship feature between two objects for which self-attention computation is performed in self-attention encoding;

will v ₁ Each of the first visual sub-feature sequences and r ₁ Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence; the target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapped with a target frame corresponding to the first visual sub-feature sequence;

performing first cross attention coding on each first combined sequence to generate a second visual sequence characteristic v ₂ ；

wherein ,q_i Is the ith first visual sub-feature sequence, K _i Is q _i A corresponding target relationship sub-feature sequence;

is q _i Corresponding image areas selected by the target frame; />

For K _i Corresponding to each of the image areas selected by the combined frame; />

In the first cross attention code corresponding to the ith first combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;

will r ₁ Each first relation sub-feature sequence in (a) and v ₁ The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence; the target visual sub-feature sequence is a first visual sub-feature sequence set corresponding to at least one target frame, wherein the first visual sub-feature sequence set is overlapped with the joint frame corresponding to the first relation sub-feature sequence, and the overlapped area is larger than the area threshold value theta;

performing second cross-attention encoding on each second combined sequence to generate a second relation sequence feature r ₂ The method comprises the steps of carrying out a first treatment on the surface of the Self-attention Y of the ith second combined sequence in the second cross-attention code _i Meets the following conditions:

wherein ,R_i Is the ith first relation sub-feature sequence; m is M _i Is R _i A corresponding target visual sub-feature sequence;

is R _i The corresponding combined frame selects the image area; />

Is M _i The image area selected by each target frame; />

In the second cross attention code corresponding to the ith second combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;

a text sequence s= (w) generated from before the current time t ₀ ,w ₁ ,…,w _t-1 ) Generating fusion weight beta corresponding to the current time t; beta is used to represent r ₂ Corresponding fusion proportion; wherein w is _t-1 Image description information of the initial image correspondingly generated at the time t-1;

2. The method of claim 1, wherein obtaining a bounding box for each target object in the initial target feature comprises:

counting the co-occurrence value set A of each category in the MSCOCO data set ₁ ，A ₂ ，…，A _i ，…，A _z ，A _i ＝(A _i1 ，A _i2 ，…，A _im ，…，A _iz), wherein ,A_i A co-occurrence value set for the i-th category; a is that _im For the ith category and the mth categoryCo-occurrence value between; a is that _im The total number of co-occurrences of the ith category of target and the mth category of target in all images of the MSCOCO dataset; z is the total number of categories in the MSCOCO dataset; i=1, 2, …, z;

according to A ₁ ，A ₂ ，…，A _i ，…，A _z Determining a joint object corresponding to each target object from the initial target characteristics; the joint object is another target object with the maximum co-occurrence value with the target object in the initial target feature;

generating a joint frame corresponding to each target object according to each target object and the corresponding joint object; the image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.

3. The method according to claim 1, characterized in that the text sequence s= (w) generated from before the current time t ₀ ,w ₁ ,…,w _t-1 ) Generating a fusion weight beta corresponding to the current time t, including:

s is input into a first MLP, and part-of-speech probability corresponding to each piece of image description information in the S is generated; the part-of-speech probability is r ₂ Probability of the corresponding part-of-speech class;

inputting all part-of-speech probabilities corresponding to S into a second MLP to generate r ₂ And the corresponding fusion weight beta.

4. A method according to claim 3, characterized in that according to v ₂ 、r ₂ And fusion weight beta corresponding to the current time t, and generating image description information w of the initial image corresponding to the current time t _t Comprising:

according to beta vs ₂ R ₂ Fusing to generate a target fusion characteristic sequence F; f satisfies the following conditions:

F＝β*r ₂ +(1-β)*v ₂ ；

decoding according to S and F to generate image description information w of the initial image corresponding to the current time t _t 。

5. The method of claim 3, wherein inputting S into the first MLP to generate a part-of-speech probability for each image description information in S comprises:

inputting S into a first full connection layer; generating part-of-speech features corresponding to each piece of image description information in S;

normalizing all the part-of-speech features by using a first sigmoid activation function to generate part-of-speech probability corresponding to each part-of-speech feature; each part-of-speech probability is within a preset numerical interval.

6. The method of claim 3, wherein the second MLP comprises a second fully connected layer;

and the second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.

7. The method of claim 6, wherein the second MLP further comprises a second sigmoid activation function;

the second sigmoid activation function is used for generating beta according to the fusion weight feature.

8. The method of claim 1, wherein θ = 0.3.

9. A non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 8.

10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 8.