CN112836709A

CN112836709A - Automatic image description method based on spatial attention enhancement mechanism

Info

Publication number: CN112836709A
Application number: CN202110168114.7A
Authority: CN
Inventors: 方玉明; 朱旻炜; 姜文晖
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-05-25

Abstract

The invention provides an automatic image description method based on a spatial attention enhancement mechanism, which comprises the steps of extracting potential target areas in an image, setting the target areas as image areas to be processed, acquiring spatial features and position information of a plurality of image areas, and extracting the image features of the image areas; selecting an image area rich in positioning information as a candidate frame according to the information of the entity data set from the extracted image area, and obtaining an attention characteristic label based on a cluster; calculating the attention intensity of the image candidate region at each moment according to the extracted image features; calculating cross entropy loss on the descriptive content and significance loss based on the attention feature label of the cluster, calculating total loss; and calculating the loss between the real value label and the initial predicted value, judging the difference between the initial predicted value and the real result, carrying out self-learning by the image description model according to the difference, and inputting the image characteristics into the self-learned image description model to obtain the final predicted value. The invention can improve the performance of the automatic image description method.

Description

Automatic image description method based on spatial attention enhancement mechanism

Technical Field

The invention relates to the technical field of image description, in particular to an automatic image description method based on a spatial attention enhancement mechanism.

Background

Image description generation is a comprehensive problem combining computer vision and natural language processing, and the task of image description is very easy for human beings, but is limited by the heterogeneous characteristics of different modality data, and requires a machine to understand the content of pictures and describe the content in natural language, so that the machine is required to generate smooth and human-understandable sentences and the sentences are required to represent complete image content.

Inspired by the application of attention mechanisms in machine translation, some researchers have introduced attention mechanisms in the traditional "encode-decode" framework, significantly improving the performance of the automatic image description task. The attention mechanism focuses on key visual content in the image, providing more discriminative visual information to guide the sentence generation process during the input of the image context vector to the "encode-decode" framework. Although the attention mechanism can effectively improve the performance of the automatic image description method, the current method still has the problems of insufficient attention and the like, and the object description which does not appear in the image appears in the image description.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an automatic image description method based on a spatial attention enhancement mechanism, which improves the attention accuracy.

In order to achieve the purpose, the invention is realized by the following technical scheme: an automatic image description method based on a spatial attention enhancement mechanism comprises the following steps: after an image to be described is obtained, potential target areas in the image are extracted, the target areas are set as image areas to be processed, spatial features and position information of a plurality of image areas are obtained, and image features of the image areas are extracted; selecting an image area rich in positioning information as a candidate frame according to the information of the entity data set from the extracted image area, and obtaining an attention characteristic label based on a cluster; calculating the attention intensity of the image candidate region at each moment according to the extracted image features; calculating cross entropy loss and cluster-based attention feature tag significance loss with respect to the descriptive content, and calculating total loss; and calculating the loss between the real value label and the initial predicted value, judging the difference between the initial predicted value and the real result, carrying out self-learning by the image description model according to the difference, and inputting the image characteristics into the self-learned image description model to obtain the final predicted value.

Preferably, the acquiring the spatial features and the position information of the plurality of image areas comprises: and extracting the bottom-up features in the image and the position information of the corresponding target boundary box in the image by using a target detection algorithm pre-trained by the visual gene data set.

Preferably, the selecting the image region rich in the positioning information as the candidate frame based on the information of the physical data set includes: and describing positioning nouns based on the content of the entity data set, matching the spatial features and the position information of the image area with the nouns in the entity data set, and selecting a candidate frame rich in positioning information by using a cluster information screening method.

Preferably, the selecting the candidate frame rich in the positioning information by using the cluster information screening method comprises: and combining the spatial features and the position information of the image area with nouns in the entity data set by using a cluster information screening method, and selecting a candidate frame rich in positioning information according to an intersection ratio criterion and an overlap ratio criterion.

Preferably, the selecting the candidate boxes rich in positioning information according to the intersection ratio criterion and the overlap ratio criterion comprises: calculating the intersection ratio of the target noun rectangular frame G and the candidate frame B, wherein the calculation formula of the intersection ratio is as follows:

g ≈ B represents the area of the intersection area of the candidate frame and the target noun rectangular frame, when the intersection ratio is greater than a first threshold, the candidate frame is retained, and the intersection ratio of the candidate frame is marked as positive;

calculating the overlapping ratio of the target noun rectangular frame G and the candidate frame B, wherein the calculating formula of the overlapping ratio is as follows:

when the overlapping ratio is larger than a preset second threshold value, the candidate frame is reserved, and the overlapping ratio of the candidate frame is marked as positive.

Preferably, a candidate frame whose intersection ratio of the target noun rectangular frame G to the candidate frame B is smaller than a first threshold value is marked as negative, and a candidate frame whose overlap ratio of the target noun rectangular frame G to the candidate frame B is smaller than a second threshold value is marked as negative.

Preferably, the calculating the attention intensity of the image candidate region at each moment according to the extracted image features comprises: inputting the spatial features and the position information of the image region into a feature mapping module, extracting semantic features from the feature regions of the N objects, and marking as

Inputting the extracted semantic features into an attention module to obtain an attention weight alpha at the moment t^t。

Preferably, calculating the cross-entropy loss and cluster-based attention feature tag significance loss with respect to the descriptive content, and calculating the total loss comprises: the significance loss with respect to cross-entropy loss and cluster-based attention feature labels describing the content is calculated using the following formula:

L(θ)＝λ·L_grd(θ)+L_XE(θ)

wherein L is the total loss, L_grdAnd L_XERespectively, the significance loss and the cross-entropy loss of the attention feature label, theta is a parameter of the image description model,

and

respectively representing the word vector at time t and the word vector before time t, p represents the conditional probability, N_PRepresenting positive candidate boxes, N being the total number of all candidate boxes, B_nIs a negative candidate box, α_iThe attention weight of the ith candidate box is represented, and λ represents the weight ratio of the cluster-based candidate clustering loss function in the total loss function.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides an automatic image description method based on a spatial attention enhancement mechanism, which uses an attention label based on a cluster to provide better reference for the attention weight in the description generation process, thereby generating more accurate description and improving the performance of the automatic image description method. The method of the invention achieves superior results by performing extensive experiments on mainstream datasets such as Flickr30k and COCO, and comparing with the most advanced methods. The method has practical significance for the scene of the visually impaired people to which the automatic image description method is applied.

Drawings

FIG. 1 is a block diagram of the structure used in an embodiment of the automatic image description method based on the spatial attention enhancement mechanism of the present invention;

FIG. 2 is a flow chart of an embodiment of the automatic image description method based on the spatial attention enhancement mechanism of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention relates to an automatic image description method based on a spatial attention enhancement mechanism, which can be realized by a computer device, for example, the computer device comprises a processor and a memory, the memory stores a computer program, and the computer program can realize the automatic image description method based on the spatial attention enhancement mechanism.

The method of the invention is applied to the system shown in fig. 1, the image 10 to be described passes through the target detection algorithm module 11 to be extracted with the image features 13, the image features 13 are input to the attention module 14, and the attention weight is obtained through calculation. Meanwhile, the image feature 13 is also combined with noun matching 23 in the entity data set, and the attention weight 15 is calculated by using cluster information 24, the attention weight 15 can obtain the image description information 17 through calculation of the decoder 16, and the image description information 17 is also obtained by using the description tag 25. The location tag 21 can be obtained from the image 10 to be described, and the description location noun 23 can be obtained through the noun filtering 22.

Referring to fig. 2, the embodiment first executes step S1 to obtain an image to be described, for example, an image to be described is input to the image description model, and then executes step S2 to extract potential target regions in the image, which are regions of the image to be processed. Then, spatial features and position information of the image region are acquired, and image features are extracted. Specifically, spatial features of potential target regions in the image to be described are extracted, and the features are used as input of subsequent content. For example, a target detection algorithm pre-trained by a Visual gene dataset (Visual Genome) is used to extract bottom-up features in an image I to be described and corresponding target bounding boxes, and the extraction of image features can be implemented by applying known technologies such as area candidate networks and region-of-interest pooling, and the target bounding boxes determine the position of a target area in the image.

Next, step S3 is executed to extract a cluster-based attention feature label. In this embodiment, a candidate frame rich in positioning information is selected by using a cluster information screening method, and specifically, a candidate frame rich in positioning information is selected by using a cluster information screening method to combine spatial features and position information of an image region with nouns in an entity data set according to an intersection ratio criterion and an overlap ratio criterion.

For example, according to sentence division in the entity data set, a noun with positioning information in the sentence is found, the positioning area is the target noun rectangular frame G, and the candidate frame is a boundary frame corresponding to each bottom-up feature in the image. The candidate frame is a frame in the image to be described, and is a position corresponding to the image feature obtained before in the image. In this embodiment, the entity data set is a preset data set, and the entity data set has position labels of noun phrases for sentences described by coco or flickr.

Then, the candidate frames are screened. For example, a candidate box rich in positioning information is selected according to the intersection ratio criterion and the overlap ratio criterion.

When the intersection ratio criterion is applied, the intersection ratio of the target noun rectangular frame G and the candidate frame B is calculated (IoU), and the calculation formula of the intersection ratio is as follows:

wherein G.andgate.B represents the area of the intersection region of the candidate frame and the rectangular frame of the target noun, and when the intersection ratio is greater than a first threshold, preferably 0.5, the candidate frame is retained and the intersection ratio of the candidate frame is marked as positive. Therefore, the present embodiment retains the candidate frame B having a high intersection ratio with the target noun rectangular frame G.

When the overlap ratio is applied, the overlap ratio of the target noun rectangular frame G and the candidate frame B is calculated (IoP), and the calculation formula of the overlap ratio is as follows:

when the overlap ratio is greater than a preset second threshold, preferably 0.9, the candidate box is retained, and the overlap ratio of the candidate box is marked as positive. Therefore, the present embodiment retains the candidate frame B having a high overlap ratio with the target noun rectangular frame G.

Further, a candidate frame whose intersection ratio of the target noun rectangular frame G to the candidate frame B is smaller than a first threshold value is marked as negative, and a candidate frame whose overlap ratio of the target noun rectangular frame G to the candidate frame B is smaller than a second threshold value is marked as negative. Thus, the image features can be divided into two cluster classes, namely, positive and negative cluster classes according to the positive and negative mark definitions, and the division of the cluster classes is the attention feature label of the embodiment.

Next, step S4 is executed to calculate the attention intensity of the image candidate area at each time. For example, the attention intensity of the image candidate region at each moment is calculated according to the extracted image features, namely, the spatial features and the position information of the image region are input into the feature mapping module, semantic features are extracted from the feature regions of the N objects, and the semantic features are recorded as the semantic features

Then, the extracted semantic features are input into an attention module to obtain the attention weight alpha at the time t^t。

Specifically, the extracted semantic features K are input into an attention module, and the attention module generates an attention weight α at a certain time t by combining semantic information S included in a word generated at present_tA higher intensity means that the candidate region is more noticed. The semantic information is S is a word generated at the last moment, and the attention weight alpha of the current moment t can be obtained according to the word and the semantic feature K at the last moment_t. Wherein the attention weight α at the time t_tThe calculation formula is as follows:

α^tsoftmax (a) (formula 4)

Where S is the text sequence at the previous moment, W_sAnd W_kRespectively mapping matrices for mapping S, K to a uniform mapping space, d is a scale of the mapping space, aⁱDenotes the ith component of a, e is the base of the natural logarithm.

Then, step S5 is performed, the cross entropy loss with respect to the descriptive contents and the significance loss of the cluster-based attention feature label are calculated, and the total loss is calculated. Specifically, the following formula is adopted for calculation:

L(θ)＝λ·L_grd(θ)+L_XE(θ)

and

respectively representing the word vector at time t and the word vector before time t, p represents the conditional probability, N_PRepresenting positive candidate boxes, N being the total number of all candidate boxes, B_nIs a negative candidate box, α_iIndicates the ith waiting timeAttention weight of the box, α_jThe attention weight of the jth candidate box is represented, and λ represents the weight ratio of the cluster-based candidate clustering loss function in the overall loss function.

Then, step S6 is executed to calculate the loss between the true value tag and the initial predicted value calculated by the image description model, and to determine the difference between the initial predicted value and the true result, based on which the image description model performs self-learning.

Finally, step S7 is executed, after the image features are input into the image description model after the self-learning is completed, the image description model obtains a final predicted value according to the input image features, and the final predicted value is a final statement of the image description that needs to be obtained in this embodiment.

To verify the feasibility of the present embodiment, the present embodiment was verified, and specifically, the COCO data set and the Flickr30k data set were used for testing and comparison. Where the COCO dataset contains twelve million images and the Flickr30k dataset contains thirty thousand images, for both datasets, each image has at least five artificially labeled image description statements called true value tags. In the experiment, the original training verification set and the original Flickr30k data set of the COCO data set are divided into a training set, a verification set and a test set by using a Karpathy segmentation method, and finally, the result on the test set is taken for verification. The invention uses five evaluation criteria: bilingual assessment assistant (BLEU), alternative summary assessment assistant (ROUGE) based on recall rate, explicit ordering translation assessment index (METEOR), common sense-based image description assessment (CIDER), semantic propositional image title assessment (SPICE) to quantitatively evaluate the performance of various image description methods. The CIDER can represent semantic accuracy better, and a good automatic image description method has a higher CIDER value. This application is made to

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An automatic image description method based on a spatial attention enhancement mechanism is characterized by comprising the following steps:

after an image to be described is obtained, potential target areas in the image are extracted, the target areas are set as image areas to be processed, spatial features and position information of the image areas are obtained, and image features of the image areas are extracted;

selecting an image area rich in positioning information as a candidate frame according to the information of the entity data set from the extracted image area, and obtaining an attention characteristic label based on a cluster;

calculating the attention intensity of the image candidate region at each moment according to the extracted image features;

calculating cross entropy loss and cluster-based attention feature tag significance loss with respect to the descriptive content, and calculating total loss;

and calculating the loss between the real value label and the initial predicted value, judging the difference between the initial predicted value and the real result, carrying out self-learning by the image description model according to the difference, and inputting the image characteristics into the self-learned image description model to obtain the final predicted value.

2. The method for automatic image description based on the spatial attention enhancement mechanism according to claim 1, wherein:

acquiring spatial features and positional information of a plurality of image regions includes:

and extracting the bottom-up features in the image and the position information of the corresponding target boundary box in the image by using a target detection algorithm pre-trained by a visual gene data set.

3. The method for automatic image description based on the spatial attention enhancement mechanism according to claim 1, wherein:

selecting an image region enriched with positioning information as a candidate frame based on information of the entity data set includes:

and describing positioning nouns based on the content of the entity data set, matching the spatial features and the position information of the image area with the nouns in the entity data set, and selecting a candidate frame rich in positioning information by using a cluster information screening method.

4. The method according to claim 3, wherein the method comprises:

selecting a candidate frame rich in positioning information by using a cluster information screening method comprises the following steps:

and combining the spatial features and the position information of the image area with nouns in the entity data set by using a cluster information screening method, and selecting a candidate frame rich in positioning information according to a cross-over ratio criterion and an overlap-over ratio criterion.

5. The method according to claim 4, wherein the method comprises:

selecting the candidate boxes rich in positioning information according to the intersection ratio criterion and the overlap ratio criterion comprises:

calculating the intersection ratio of the target noun rectangular frame G and the candidate frame B, wherein the calculation formula of the intersection ratio is as follows:

g ≈ B represents the area of the intersection region of the candidate frame and the target noun rectangular frame, and when the intersection ratio is greater than a first threshold, the candidate frame is retained, and the intersection ratio of the candidate frame is marked as positive;

6. The method according to claim 5, wherein the method comprises:

a candidate frame whose intersection ratio of the target noun rectangular frame G to the candidate frame B is smaller than the first threshold value is marked as negative, and a candidate frame whose overlap ratio of the target noun rectangular frame G to the candidate frame B is smaller than the second threshold value is marked as negative.

7. The method for automatic image description based on spatial attention enhancement mechanism according to any one of claims 1 to 6, characterized in that:

calculating the attention intensity of the image candidate region at each moment according to the extracted image features comprises the following steps:

inputting the spatial features and the position information of the image region into a feature mapping module, extracting semantic features from the feature regions of the N objects, and marking as

Inputting the extracted semantic features into an attention module to obtain the semantic features at the time tAttention weight α_t。

8. The method for automatic image description based on spatial attention enhancement mechanism according to any one of claims 1 to 6, characterized in that:

calculating a cross-entropy loss and a cluster-based attention feature label significance loss with respect to the descriptive content, and calculating a total loss comprises:

the significance loss with respect to cross-entropy loss and cluster-based attention feature labels describing the content is calculated using the following formula:

L(θ)＝λ·L_grd(θ)+L_XE(θ)

and

respectively representing the word vector at time t and the word vector before time t, p represents the conditional probability, N_PRepresenting positive candidate boxes, N being the total number of all candidate boxes, B_nIs a negative candidate box, α_iThe attention weight of the ith candidate box is represented, and lambda represents the cluster-based candidate clustering loss function in totalThe weight ratio in the loss function.