CN116012685B

CN116012685B - Image description generation method based on fusion of relation sequence and visual sequence

Info

Publication number: CN116012685B
Application number: CN202211642392.2A
Authority: CN
Inventors: 张文凯; 陈佳良; 冯瑛超; 李硕轲; 李霁豪; 杜润岩; 周瑞雪
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-06-16
Anticipated expiration: 2042-12-20
Also published as: CN116012685A

Abstract

The invention relates to the field of image processing, and discloses an image description generation method based on fusion of a relation sequence and a visual sequence. Including acquiring an initial image. A visual sequence v is generated. A sequence of relations r is generated. Encoding v and r to generate corresponding first visual sequence features v ₁ First relation sequence feature r ₁ . Performing first cross-attention encoding on the first combined sequence to generate a second visual sequence feature v ₂ . Performing second cross attention coding on the second combined sequence to generate a second relation sequence feature r ₂ . According to v ₂ 、r ₂ And fusing the weight beta to generate image description information of the initial image. According to the invention, by adding the relation sequence between the target objects, the receptive field corresponding to the characteristics can be increased, and then the interrelationship between the target objects can be acquired more clearly. Therefore, the generated image description information can be more accurate and fine, and the precision is higher.

Description

Image description generation method based on fusion of relation sequence and visual sequence

Technical Field

The invention relates to the field of image processing, in particular to an image description generation method based on fusion of a relation sequence and a visual sequence.

Background

In the big data age, a large amount of image data requires a large amount of human resources to process. With the development of machine learning and deep learning technologies, image understanding tasks with a target as a core are acquired, for example: image classification, object detection, image segmentation, etc. have achieved good results. However, the task described above can only provide content information such as the target category, the target position, or the pixel category to which the target belongs, which is included in the current image. It remains a challenge to combine these contents to refine the subject matter and semantic information contained in the Image, i.e., image semantic description (Image capturing). The task aims at one-way conversion between image to text bi-modal, and converts an input image into a natural language description which accords with grammar rules and is consistent with image content. For the technology of image semantic description, the technology has wider application scenes. For example, in a massive image data management system under a remote sensing scene, aiming at massive multi-target large scene image data, image semantics are described based on understanding image semantic topics, so that images with the same targets and different semantic topics can be distinguished more conveniently. Or interpreting the corresponding illustration photos in the context of the newspaper article; or providing a text description for a chart or a map; or provide scene descriptions for visually impaired people, etc.

In the prior art, text descriptions of the semantic information that each image has and the subject matter it is to express are generated by decoding the corresponding visual features of the object. However, the method in the prior art has the problem that the accuracy of the generated text description is low.

Disclosure of Invention

Aiming at the technical problems, the invention adopts the following technical scheme:

according to one aspect of the present invention, there is provided an image description generation method based on fusion of a relational sequence with a visual sequence, the method comprising the steps of:

and acquiring an initial image, wherein the initial image comprises images corresponding to the N target objects. N is E [10,100].

And performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.

Sequence encoding the initial target feature to generate a visual sequence

. wherein ,/>

Including a visual sub-feature sequence corresponding to each target frame.

And acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.

Feature extraction is performed on each of the co-framed image regions using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.

All the relation features are subjected to sequence coding to generate a relation sequence

. wherein ,/>

And the method comprises a relationship sub-feature sequence corresponding to each joint frame.

Performing self-attention coding on v and r respectively to generate corresponding first visual sequence features

First relation sequence feature->

. wherein ,/>

Including a first visual sub-feature sequence corresponding to each target frame. />

The first relation sub-feature sequence corresponding to each joint box is included.

The self-attention weight W in self-attention encoding meets the following condition:

wherein ,

is a feature of the geometrical relationship between two objects for which self-attention computation is performed in self-attention encoding.

Will be

Is associated with +.>

Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence. The target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapping with a target frame corresponding to the first visual sub-feature sequence.

Performing first cross-attention encoding on each first combined sequence to generate a second visual sequence characteristic

。

The self-attention of the ith first combined sequence in the first cross-attention code meets the following condition:

wherein ,

is the i first visual sub-feature sequence, < ->

Is->

Corresponding target relation sub-feature sequences.

Is->

And the corresponding image area selected by the target frame. />

Is->

Corresponding to each of the image areas selected by the joint frame. />

、/>

In the first cross attention code corresponding to the ith first combination sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.

Will be

Each of the first relational sub-feature sequences and +.>

The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence. The target visual sub-feature sequence is that the joint frames corresponding to the first relation sub-feature sequence are overlapped, and the overlapped area is larger than the area threshold +.>

A collection of first visual sub-feature sequences corresponding to at least one target frame.

For each second combined sequenceSecond cross-attention encoding to generate a second relationship sequence feature

. Self-attention of the ith second combined sequence in the second cross-attention code +.>

Meets the following conditions:

wherein ,

is the ith first relationship sub-feature sequence. />

Is->

A corresponding sequence of target visual sub-features.

Is->

Corresponding combined frame selected image areas. />

Is->

Is selected for each of the target frame-selected image areas.

、/>

In the second cross attention code corresponding to the ith second combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively obtained.

From a text sequence generated before the current time t

Generating fusion weight corresponding to the current time t>

。/>

For indicating->

Corresponding fusion ratio. Wherein (1)>

Image description information for the initial image correspondingly generated at time t-1.

According to

、/>

And fusion weight corresponding to current time t>

Image description information of the initial image corresponding to the current time t is generated +.>

。

According to a second aspect of the present invention, there is provided a non-transitory computer readable storage medium storing a computer program which, when executed by a processor, implements an image description generation method based on fusion of a relational sequence with a visual sequence as described above.

According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a method of generating an image description based on fusion of a relational sequence with a visual sequence as described above when the computer program is executed by the processor.

The invention has at least the following beneficial effects:

the invention jointly generates the description text of the initial image by setting the two types of characteristics of the visual sequence and the relation sequence. By adding the relation sequence between the target objects, the receptive field corresponding to the features can be increased, and further the machine can be helped to obtain the interrelationship between the target objects more clearly. Therefore, finally, according to the visual sequence and the relation sequence, the generated image description information of the initial image is more accurate and fine, and the precision is higher.

In addition, in the invention, the self-attention calculation in the independent modes is respectively carried out on the visual sequence and the relation sequence, and the attention calculation between the modes is also carried out, so that the receptive fields of the obtained second visual sequence characteristics and the second relation sequence characteristics can be further increased, and further, the relation information between the richer characteristics can be contained. Correspondingly, the image description information of the initial image, which can be generated during decoding, is more accurate and fine, and the accuracy is higher.

At the same time according to

To determine the corresponding fusion weight +.>

Then according to the fusion weight +.>

To determine

and />

The target fusion feature sequences F are generated in what proportions, respectively. The input values of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improvedSex.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an image description generating method based on fusion of a relationship sequence and a visual sequence according to an embodiment of the present invention.

Fig. 2 shows the test results of the model corresponding to the method of the present invention in MSCOCO online test.

Fig. 3 shows the test results of the model corresponding to the method of the present invention in MSCOCO karpathy split.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Specifically, the method of the invention is realized by adopting a multi-mode transducer architecture as a baseline model. The model mainly comprises two parts: an encoder and a decoder. The general working process of the invention comprises (1) construction of visual sequence and relation sequence, (2) encoding by a double-stream encoder and (3) decoding by the double-stream decoder according to the target fusion characteristic sequence F so as to generate description information of an initial image.

As a possible embodiment of the present invention, as shown in fig. 1, there is provided an image description generating method based on fusion of a relationship sequence and a visual sequence, the method comprising the steps of:

s100, acquiring an initial image, wherein the initial image comprises images corresponding to N target objects. N is E [10,100].

S101, performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics. The initial target features include image features corresponding to the N target objects. Wherein each target object corresponds to a target frame.

Specifically, at the input end, a group of 2048-dimensional object features corresponding to N target objects are firstly extracted through a target detection network Fast-RCNN, and then the object features are mapped to 512 dimensions so as to adapt to the input dimension of the encoder.

S200, performing sequence coding on the initial target characteristics to generate a visual sequence

. wherein ,/>

Including a visual sub-feature sequence corresponding to each target frame.

S300, generating a relation sequence according to the initial target characteristics

. Comprising the following steps:

s301, acquiring a joint frame corresponding to each target object in the initial target characteristics. The combined frame selected image area is larger than the corresponding target frame selected image area.

Through the step, a joint frame corresponding to any two target objects in the initial target characteristics can be generated. And the image area contained in the joint box may represent a relationship between two target objects, such as a positional relationship. Therefore, feature extraction is carried out on the image area selected by the combined frame, and the generated relation feature can be ensured to contain the semantic feature of the relation between the two corresponding target objects.

And S302, extracting features of each combined framed image area by using a ResNet152 network. And generating the corresponding relation characteristic of each joint frame.

S303, performing sequence coding on all the relation features to generate a relation sequence

. wherein ,/>

After the above steps, each of the relationship features in the relationship sequence has environmental information about the periphery of the target, while the features in the visual sequence mainly express specific details of a certain target.

S400, respectively performing self-attention coding on v and r to respectively generate corresponding first visual sequence characteristics

First relation sequence feature->

. wherein ,/>

Specifically, for example, the target objects in the initial image are 10, that is, n=10. V is a visual sequence composed of the coded sub-sequences (visual sub-feature sequences) of the images framed by the target frames to which the 10 target objects respectively correspond. I.e. the 10 visual sub-feature sequences are the 10 corresponding elements that make up the visual sequence.

Similarly, r is a relation sequence composed of coding subsequences (relation sub-feature sequences) of images framed by the joint frames corresponding to the 10 target objects respectively. I.e. the 10 relational sub-feature sequences are the 10 corresponding elements that make up the relational sequence.

The self-attention encoding in this step is to perform self-attention encoding on 10 elements in v to generate

； wherein ,/>

Similarly containing 10 recoded elements in one-to-one correspondence with v.

Similarly, 10 elements in r are self-attentively encoded to generate

； wherein ,/>

Similarly, 10 recoded elements corresponding one-to-one to r.

wherein ,

The vector is a vector with 4 dimensions, and the 4 dimensions are specifically the distance between the center points of the target frames corresponding to the two target objects for self-attention calculation in the X direction and the distance between the center points in the Y direction; and the ratio of the length and the ratio of the width between the two target frames. The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.

S500 is to

Each of the first visual sub-featuresSequence and->

Specifically, in this step, at least one target relationship sub-feature sequence corresponding to each first visual sub-feature sequence is obtained according to the above conditions, so as to generate a corresponding first combined sequence. I.e. will

Is associated with +.>

Form a new input sequence for which self-attention calculations are to be made. Such as: />

The 1 st visual sub-feature sequence of (2) and +.>

The 3 rd and 6 th first relation sub-feature sequences form corresponding first combined sequences.

S501, performing first cross attention coding on each first combined sequence to generate a second visual sequence characteristic

。

Self-attention of the ith first combined sequence in the first cross-attention code

I.e. the i-th first visual sub-feature sequence is self-attentive +.>

The method comprises the steps of carrying out a first treatment on the surface of the Meets the following conditions：

wherein ,

is the i first visual sub-feature sequence, < ->

Is->

Corresponding target relation sub-feature sequences.

Is->

And the corresponding image area selected by the target frame. />

Is->

Corresponding to each of the image areas selected by the joint frame. />

、/>

Such as:

the geometric relationship features between the target frames corresponding to the 1 st first visual sub-feature sequence and the joint frames corresponding to the 3 rd and 6 th first relationship sub-feature sequences can be obtained.

The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art.

S600 will be

Each of the first relational sub-feature sequences and +.>

The principle of formation of the second combined sequence in this step is similar to that in S500, except that the screening conditions are different, and will not be described here.

In practical use, too much noise is introduced in order to avoid introducing too many associated target boxes. Preferably, the method comprises the steps of,

=0.3。

s601, performing second cross attention coding on each second combined sequence to generate second relation sequence characteristics

Meets the following conditions:

wherein ,

is the ith first relationship sub-feature sequence. />

Is->

A corresponding sequence of target visual sub-features.

Is->

Corresponding combined frame selected image areas. />

Is->

Is selected for each of the target frame-selected image areas.

、/>

The calculation in this step is the same as that in S501 described above, except that the input value for performing the cross-attention calculation is different.

The calculation formula of the self-attention weight in this embodiment is the prior art. The physical meaning of each parameter to be marked is the same as that in the prior art, and is not repeated here. Correspondingly, the value of each parameter can be adaptively adjusted according to the input value for self-attention calculation, and the adjustment mode is not described in detail in the prior art. The attention operators are expanded by adopting a multi-head attention mode. Each operator can calculate in each characteristic subspace, and then the calculation result is used as a final output in a cascading mode.

S100-S600 are mainly implemented by dual stream encoders, i.e. co-representing learning encoders.

Since each of the relational features in the sequence of relationships has environmental information about the surroundings of the object after processing in S100-S300, the features in the visual sequence mainly express specific details of a certain object. Thus, through the following S500-S700, both can be enabled to mutually compensate for information by means of cross-attention during execution of the attention mechanism.

The present invention divides the vision and relationship attention calculation process into two stages. In the first self-attention computation phase, the visual sequence and the relationship sequence each execute an attention operator. Whereby the model first learns intra-modal interactions on visual and relational modalities, respectively. Then in a second self-attention computing phase, the visual sequence and the relation sequence are interacted to execute attention operators respectively. Thereby, the visual features and the relational features can be mutually utilized to further promote the representation.

S700, according to the text sequence generated before the current time t

Generating fusion weight corresponding to the current time t>

。/>

For indicating->

Corresponding fusion ratio. Wherein (1)>

To be at the timeAnd (5) corresponding to the image description information of the generated initial image at the time of t-1.

S800 according to

、/>

And fusion weight corresponding to current time t>

。

Steps S700-S800 are mainly implemented by a dual stream decoder. The dual stream decoder is also referred to as a co-representation learning decoder. Specifically, in the process of generating the description sentence word by the visual-relation double-stream decoder, when the description word of the initial image at the time t is generated, the description sentence can be generated according to the content generated above

To determine the word category (visual vocabulary or relational vocabulary) that needs to be generated at the current time t. Then, the ratio of the visual sequence to the relation sequence is determined according to the part-of-speech class, namely, the fusion weight is +.>

Finally, according to->

、/>

And fusion weight corresponding to current time t>

。

Thus, according to

To determine the corresponding fusion weight +.>

Then according to the fusion weight +.>

To determine

and />

The target fusion feature sequences F are generated in what proportions, respectively. Therefore, the input value of the double-stream encoder corresponding to each moment can be dynamically adjusted, and the accuracy of finally generated image description information of the initial image is further improved.

As a possible embodiment of the present invention, S301, obtaining a joint box corresponding to each target object in the initial target feature includes:

s311, counting the co-occurrence value set A of each category in the MSCOCO data set ₁ ，A ₂ ，…，A _i ，…，A _z ，A _i =（A _i1 ，A _i2 ，…，A _im ，…，A _iz）, wherein ,A_i Is the co-occurrence value set for the i-th category. A is that _im Is the co-occurrence value between the i-th category and the m-th category. A is that _im The total number of co-occurrences of the ith category of object and the mth category of object in all images of the MSCOCO dataset. z is the total number of categories in the MSCOCO dataset. i=1, 2, …, z.

S321 according to A ₁ ，A ₂ ，…，A _i ，…，A _z And determining a joint object corresponding to each target object from the initial target characteristics. A federated object is another target object in the initial target feature that has the greatest co-occurrence value with the target object.

S331, generating a joint frame corresponding to each target object according to each target object and the corresponding joint object. The image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.

In S301 of the above embodiment, the generated joint box contains all the pairwise relationships in the initial image. If used directly to learn the relational feature map, a significant amount of time and computing resources are consumed due to the large amount of data.

Meanwhile, since all the generated pairwise relationships exist, error relationships which do not meet the theorem exist. For example, the target object includes a boat, a person, water, and a hat; the relationship of "person" and "boat" and the relationship of "water" and "boat" belong to the relationship that frequently occurs at the same time in reality, and thus the relationship is a more rational correct relationship. However, the relationship of "cap" and "boat" and the relationship of "cap" and "water" belong to relationships which hardly occur at the same time in reality, and thus the relationship is an unreasonable erroneous relationship. It is necessary to remove noise relationships (erroneous relationships) among all the pair relationships to thereby improve generalization and effectiveness of the resulting joint box.

In this embodiment, the denoising process is performed according to the co-occurrence value, so as to solve the above-described problems.

As a possible embodiment of the invention, S700 is based on a text sequence generated before the current time t

Generating fusion weight corresponding to the current time t>

Comprising:

s701, inputting S into a first Multi-Layer perceptron (MLP) to generate part-of-speech probability corresponding to each piece of image description information in S. Part of speech probability of

Probability of the corresponding part-of-speech category.

Further, S701 includes:

s711, inputting S into the first full connection layer. And generating part-of-speech features corresponding to each piece of image description information in S. The part-of-speech feature represents the likelihood that the corresponding image description information is a relational vocabulary.

And S721, carrying out normalization processing on all the part-of-speech features by using a first sigmoid activation function, and generating part-of-speech probability corresponding to each part-of-speech feature. Each part-of-speech probability is within a preset value interval [0,1 ].

S702, inputting all part-of-speech probabilities corresponding to the S into a second multi-layer perceptron to generate

Corresponding fusion weight->

。

Further, the second multi-layer perceptron includes a second fully connected layer. The second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.

Further, the second multi-layer perceptron further comprises a second sigmoid activates a function. The second sigmoid activation function is used for generating according to the fusion weight characteristics

。

In this embodiment, the fused weight feature generation is generated by composing a gate function with two multi-layer perceptrons

。

As a possible embodiment of the invention, S800 is according to

、/>

Fusion weight corresponding to current time t

Comprising:

s801 according to

For->

Is->

And fusing to generate a target fusion characteristic sequence F. F satisfies the following conditions:

。

s802, decoding according to S and F to generate image description information of initial image corresponding to current time t

。

In the present embodiment, according to

To determine the corresponding fusion weight +.>

Then according to the fusion weight +.>

To determine->

and />

Fig. 2 and 3 are test data of various performance indexes after the model of the method of the present invention is tested by two conventional test methods. Specifically, fig. 2 shows an on-line test: MSCOCO online test (MSCOCO in-line test). Fig. 3 shows an off-line test: MSCOCOkarpathy split test results.

The test result corresponding to the outer is the test result of the model corresponding to the method. The other items are test results corresponding to the existing correlation model.

According to the test results, the method disclosed by the invention has the advantages that all indexes corresponding to the method are basically improved to a certain extent, and compared with the existing method, the method has a higher effect.

Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.

Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.

Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention described in the present specification when the program product is run on the electronic device.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. An image description generation method based on fusion of a relation sequence and a visual sequence is characterized by comprising the following steps:

acquiring an initial image, wherein the initial image comprises images corresponding to N target objects; n is E [10,100];

performing target extraction on the initial image by using Fast-RCNN to generate initial target characteristics; the initial target features comprise image features corresponding to N target objects; wherein each target object corresponds to a target frame;

performing sequence coding on the initial target characteristics to generate a visual sequence

The method comprises the steps of carrying out a first treatment on the surface of the Wherein v comprises a visual sub-feature sequence corresponding to each target frame;

acquiring a joint frame corresponding to each target object in the initial target characteristics; the image area selected by the combined frame is larger than the image area selected by the corresponding target frame; the image region contained in the joint box may represent a relationship between two target objects;

extracting features of the image area selected by each combined frame by using a ResNet152 network; generating a corresponding relation characteristic of each joint frame;

performing sequence coding on all the relation features to generate a relation sequence

The method comprises the steps of carrying out a first treatment on the surface of the Wherein r comprises a relationship sub-feature sequence corresponding to each joint frame;

First relation sequence feature->

； wherein ,/>

The method comprises a first visual sub-feature sequence corresponding to each target frame; />

The method comprises a first relation sub-feature sequence corresponding to each joint frame;

the self-attention weight W in the self-attention encoding meets the following condition:

wherein ,

is a feature of the geometrical relationship between two objects performing self-attention computation in self-attention encoding;

will be

Is associated with +.>

Corresponding target relation sub-feature sequences in the sequence are combined into a first combined sequence corresponding to each first vision sub-feature sequence; the target relationship sub-feature sequence is a collection of first relationship sub-feature sequences corresponding to at least one joint frame overlapped with a target frame corresponding to the first visual sub-feature sequence;

；

Meets the following conditions:

wherein ,

is the i first visual sub-feature sequence, < ->

Is->

A corresponding target relationship sub-feature sequence; />

Is->

Corresponding image areas selected by the target frame; />

Is->

Corresponding to each of the image areas selected by the combined frame; />

、/>

In the first cross attention code corresponding to the ith first combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;

will be

Each of the first relational sub-feature sequences and +.>

The corresponding target vision sub-feature sequences are combined into a second combined sequence corresponding to each first relation sub-feature sequence; the target visual sub-feature sequence is overlapped with the joint frame corresponding to the first relation sub-feature sequence, and the overlapped area is larger than the area threshold value +.>

A collection of first visual sub-feature sequences corresponding to at least one target frame;

performing second cross-attention encoding on each second combined sequence to generate second relation sequence characteristics

The method comprises the steps of carrying out a first treatment on the surface of the Self-attention of the ith second combined sequence in said second cross-attention code +.>

Meets the following conditions:

wherein ,

is the ith first relation sub-feature sequence; />

Is->

A corresponding target visual sub-feature sequence; />

Is->

The corresponding combined frame selects the image area; />

Is->

The image area selected by each target frame; />

、/>

In the second cross attention code corresponding to the ith second combined sequence, the geometrical relationship features between two objects for self-attention calculation and the dimension of each feature sequence in the corresponding key sequence are respectively calculated;

from a text sequence generated before the current time t

Generating fusion weight corresponding to the current time t>

；/>

For indicating->

Corresponding fusion proportion; wherein (1)>

Image description information of the initial image correspondingly generated at the time t-1;

according to

、/>

And fusion weight corresponding to current time t>

Generating image description information of the initial image corresponding to the current time t>

；

From a text sequence generated before the current time t

Generating fusion weight corresponding to the current time t>

Comprising:

inputting S into a first multi-layer perceptron to generate part-of-speech probability corresponding to each piece of image description information in S; the part-of-speech probability is

Probability of the corresponding part-of-speech class;

inputting all part-of-speech probabilities corresponding to the S into a second multi-layer perceptron to generate

Corresponding fusion weight->

；

According to

、/>

And fusion weight corresponding to current time t>

Comprising:

according to

For->

Is->

Fusing to generate a target fusion characteristic sequence F; f satisfies the following conditions:

；

decoding according to S and F to generate image description information of the initial image corresponding to the current time t

。

2. The method of claim 1, wherein obtaining a bounding box for each target object in the initial target feature comprises:

counting the co-occurrence value set A of each category in the MSCOCO data set ₁ ，A ₂ ，…，A _i ，…，A _z ，A _i =（A _i1 ，A _i2 ，…，A _im ，…，A _iz）, wherein ,A_i A co-occurrence value set for the i-th category; a is that _im A co-occurrence value between the i-th category and the m-th category; a is that _im The total number of co-occurrences of the ith category of target and the mth category of target in all images of the MSCOCO dataset; z is the total number of categories in the MSCOCO dataset; i=1, 2, …, z;

according to A ₁ ，A ₂ ，…，A _i ，…，A _z Determining a joint object corresponding to each target object from the initial target characteristics; the joint object is another target object with the maximum co-occurrence value with the target object in the initial target feature;

generating a joint frame corresponding to each target object according to each target object and the corresponding joint object; the image areas selected by the joint frame comprise image areas corresponding to the target object and the joint object respectively.

3. The method of claim 1, wherein inputting S into the first multi-layer perceptron to generate part-of-speech probabilities for each image description in S comprises:

inputting S into a first full connection layer; generating part-of-speech features corresponding to each piece of image description information in S;

normalizing all the part-of-speech features by using a first sigmoid activation function to generate part-of-speech probability corresponding to each part-of-speech feature; each part-of-speech probability is within a preset numerical interval.

4. The method of claim 1, wherein the second multi-layer perceptron comprises a second fully-connected layer;

and the second full-connection layer is used for carrying out weighted average processing on all part-of-speech probabilities to generate fusion weight characteristics.

5. The method of claim 4, wherein the second multi-layer perceptron further comprises a second sigmoid activation function;

the second sigmoid activation function is used for generating according to the fusion weight characteristics

。

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

=0.3。

7. a non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 6.

8. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements a method of generating an image description based on fusion of a relational sequence with a visual sequence according to any one of claims 1 to 6.