CN112037239B

CN112037239B - Text guidance image segmentation method based on multi-level explicit relation selection

Info

Publication number: CN112037239B
Application number: CN202010882340.7A
Authority: CN
Inventors: 刘宇; 李新宇; 徐凯平; 冯毅强; 张海洋
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2022-09-13
Anticipated expiration: 2040-08-28
Also published as: CN112037239A

Abstract

The invention provides a text guidance image segmentation method based on multi-level explicit relation selection, which guides image segmentation from multiple angles and levels such as entity relation in image semantics and multi-scale text, so that the method can obtain accurate results for rich and complex language description. The method mainly comprises the following steps: feature extraction, pyramid pooling, spatial entity relationship capture and multilayer image-text relationship reinforcement. Extracting semantic features in the picture by using a convolutional neural network; obtaining picture features with global information by a pyramid pooling method of adding boxes with different sizes; the relation between the entities on the picture space is obtained through a self-attention mechanism, and when a sentence contains a plurality of entity descriptions, the accuracy of entity positioning can be effectively improved; and finally, circularly enhancing the relation between the image and the language through natural language text vectors with different scales, and correcting the result of the previous step for multiple times to obtain a more robust result.

Description

Text guidance image segmentation method based on multi-level explicit relation selection

Technical Field

The invention belongs to the technical field of intersection of computer vision and natural language processing, relates to a text guidance image segmentation method based on multi-level explicit relation selection, and takes a natural language text which is complex and has a plurality of description entities as a starting point.

Background

With the advent of the artificial intelligence era, the demand for human interaction with computers and intelligent machines is increasing. The problem of how to make a machine understand complex natural language, have the same visual angle as human, observe the world observed by human, and do corresponding operations according to human intentions has become a hot topic of interest in the industry. Image segmentation is a traditional research field of computer vision, but is always concerned by people, and has wide application in various fields such as automatic driving, human-computer interaction, virtual reality, medical images and the like in recent years, so that the development of human-computer interaction can be promoted by combining natural language with image processing, and barrier-free communication between a machine and a human is realized.

Text-based image segmentation is a research branch in segmentation tasks that is more suitable for practical application requirements, and can segment specified areas in pictures according to descriptions of natural language texts. Compared with the common segmentation task, the method needs to understand natural language expressing abundant and changeful, reason and correctly position the multi-entity relation in the picture according to the object relation mentioned in the language, and accurately segment the positioning area. Most of the existing text-based image segmentation methods connect language features with image features to perform pixel-by-pixel classification prediction on a final result, lack of explicit generation of language-guided segmentation results, and lack of processes of capturing and reasoning relationships between entities in images, which easily causes the problems of inaccurate segmentation areas, inaccurate boundary contours and the like of prediction results.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a text-guided image segmentation method based on multi-level explicit relationship selection, which is used for explicitly reasoning global information by capturing the relationship between entities in a picture and using natural language texts to finally guide the generation of segmentation results. The method can deal with the complex natural language text with a plurality of entity descriptions, and can effectively improve the accuracy of the segmentation result on the premise of inputting the complex natural language text with a plurality of entity descriptions.

In order to achieve the purpose, the invention adopts the technical scheme that:

a text guidance image segmentation method based on multi-level explicit relationship selection comprises the following steps:

(1) feature extraction:

and performing feature extraction on the input RGB picture and the natural language text. The RGB picture adopts the convolutional neural network to extract semantic features in the picture, and as the method belongs to the image segmentation branch, a deplab semantic segmentation model pre-training parameter is adopted as an initial parameter of the convolutional neural network, and the network training time can be effectively reduced and the generalization capability of the network can be improved by using the deplab pre-training parameter. For natural language texts, a one-hot method is used for representing each word as a vector, the vector is embedded into a low-dimensional vector and is input into an LSTM long-time memory network, a final hidden state is used as a vector representation of the whole natural language text, and the process is that the low-order word vector is input into the LSTM and is finally obtained through multiple cycles to be used as a vector representation of the whole sentence.

(2) Pyramid pooling:

since the task of text-based image segmentation requires reasoning about the image global according to language, global information is required in the image features. Thus a pyramidal pooling approach is employed to add global information. Firstly, the picture features in the step (1) are connected with natural language text vectors and regular spatial position vectors generated according to the spatial positions of the pixels to generate mixed features, and then the mixed features with global information are generated by adopting a pyramid pooling method.

Specifically, after copying the mixed features, pyramid pooling is divided into four parts according to the number of channels, the four parts of feature maps are divided into 1 × 1,2 × 2, 3 × 3 and 6 × 6 bins, then each bin is subjected to average pooling, and the pooling result is connected to the original feature map to obtain global information with different sizes.

(3) Spatial entity relationship capture:

in order to obtain the spatial entity relationship in the mixed feature generated in step (2), a self-attention mechanism is introduced. The self-attention mechanism is widely accepted in the field of natural language processing, and is gradually applied to the field of computer vision in recent years, and the self-attention mechanism can effectively acquire long-distance relationship and global information. This step obtains the relationship between different feature space entities in the picture features by using a self-attention mechanism. For any two mixed space feature vectors in the space, when the multiplication result of the two vectors is larger, the similarity of the two vectors is larger, and the two vectors have certain correlation.

In addition, because the mixed features have natural language text vectors, the natural language vectors can be used for guiding the capture and generation of effective entity relations in the graph, and meanwhile, the method is beneficial to the subsequent explicit positioning of multiple image-text relations by using a multilayer image-text relation strengthening method. The added regular spatial position can solve the absolute position relation in the natural language text description, such as the description information of the upper left corner of the picture, the right side of the picture and the like.

(4) Strengthening the relationship of multilayer pictures and texts:

and (3) generating similarity between the Attention (Q, K, V) by calculating the natural language text vector in the step (1) and the self-Attention mechanism result generated in the step (3) so as to strengthen the image-text relationship, and circularly adopting the natural language text vectors with different scales for multiple times to strengthen the image-text relationship so as to guide the generation of the text image segmentation result.

Wherein the more similar the spatial vector weight to the natural language text vector, the greater the likelihood of being the final segmentation result. The scale of the natural language text vector is the same as the number of image channels, and therefore, the natural language text vector is reduced along with the reduction of the number of image channels in the network upsampling process.

The invention has the beneficial effects that:

compared with the prior art, the text-based image segmentation method can adapt to the complex natural language scene with a plurality of description entities, effectively captures the relationships between the entities and between the languages and the images in the picture, and correctly positions the description areas.

The method can be applied to various fields such as man-machine interaction and the like.

Drawings

FIG. 1 is a diagram illustrating an overall architecture of a text-guided image segmentation method based on multi-level explicit relationship selection according to the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

Fig. 1 shows a frame design of the text-based image segmentation method of the present invention, the main process is as follows:

all pictures are first resized to 320 x 320. The image features are extracted by using the feature extraction network of deplab pre-training, and the pre-training network can effectively save a large amount of training time and computing resources. Initializing word vectors in a random mode for natural language, embedding one-hot word vectors into 1000-dimensional vectors, and obtaining vector representation of sentences through an LSTM long-time memory network. The longest word number of the LSTM text is 20, and the specific calculation process of the long-time and short-time memory network is shown as a formula:

h _t ＝LSTM(x _t ,h _t-1 )

wherein h is _t Representing the LSTM output vector, x _t Representing the LSTM input vector, h _t-1 Indicating the output hidden state of the upper layer of LSTM.

And then, solving the problems of limited receptive field and lack of global information of the convolutional network by adopting a pyramid pooling method. The entity relation in the picture is captured by a pooling operation with the size of 1 × 1,2 × 2, 3 × 3, 6 × 6 boxes, and the picture feature with global information is generated and supplied to a space entity relation capturing method.

In order to acquire the entity relationship in the picture and realize the capture of the spatial entity relationship, the invention adopts a self-attention mechanism to capture the entity correlation relationship in the picture, and the process of specifically calculating the vector similarity of the self-attention mechanism is shown as a formula:

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )

where head _i ＝Attention(Q _i ,K _i ,V _i )

wherein Q _i ＝MW _i ^Q ,K _i ＝MW _i ^K ,V _i ＝MW _i ^V . M is the picture spatial feature of the mixture with global information output by the pyramid pooling layer, W _i ^Q ,W _i ^K ,W _i ^V Learnable embedded weight vectors representing Q, K, V, respectively, where all weights are not shared with each other and have the same output dimension, i ∈ {1,2, …, w × hDenotes the ith space vector in the w-wide h-long feature map. d is a radical of _K The number of dimensions representing the picture features. In the experiment, w is 40, h is 40, d _K 500. By adopting the multi-head self-attention mechanism, the model can be allowed to learn related information in different expression subspaces, and the accuracy of the model is improved. Through repeated experiments, the multiple h is set to be 5, so that the model prediction result is guaranteed, and meanwhile, lower computing resources are consumed.

However, spatial entity relationship capture is the entity relationship of capturing pictures on a low-resolution feature map, and for the task of dividing the images and needing to generate accurate boundary contour, the low-resolution feature map blurs boundary information to reduce prediction accuracy. Therefore, in the up-sampling process, the bilinear up-sampling is adopted, and the features generated by the convolution network during feature extraction are used for multiplexing. And after the connection of the sampling feature and the multiplexing feature, the image-text relationship is strengthened for many times, and the position of the description entity in the image is reconfirmed by using the multi-scale language vector so as to improve the positioning accuracy. Wherein the picture-referring entity is repositioned for confirmation by calculating a similarity between the language text vector and the picture spatial feature vector. The greater the similarity between the two vectors, the higher the probability that the space vector is the entity pixel described by the text language, and the higher the weight. The specific calculation process is as follows:

S＝ReLU(Wh _t ·[G；V _i-1 ])

V _i ＝S[G；V _i-1 ]

wherein W represents h _t Of a learnable embedded weight vector, h _t A language vector is represented, which is compressed into the same dimension as the number of picture passes using a linear transformation. G represents a corresponding multiplexing feature generated when the picture features are extracted;]denotes connection, V _i-1 ,V _i All represent the characteristic result output by the graph-text relationship strengthening method, wherein i-1 represents the output result strengthened by the graph-text relationship of the previous layer, i represents the output result strengthened by the current layer, and V ₀ Is a spatial entity relationship capture result. The invention uses the multiple image-text relationship reinforcement to adjust the result. Finally, the feature diagram result generated by strengthening the relationship between the last layer of graphics and texts is convoluted by 1 multiplied by 1And (5) compressing the network into a one-dimensional characteristic diagram, and performing pixel-by-pixel classification through a sigmoid activation function to generate a final segmentation result.

Examples

In this example, on the GTX 10808G graphics card, the deep learning framework Tensorflow was used.

Data set: experimental evaluation was performed on a standard public data set G-ref. The data set comprised 26711 pictures, 104560 sentences of natural language text with an average text length of 8.43 words, belonging to the longest data set of natural language text in the text-based image segmentation data set.

Ablation experiment: in order to prove the effectiveness of each step in the text-guided image segmentation method based on multi-level explicit relationship selection, IoU indexes are tested on a G-Re data set. The results are shown in Table 1. Ablation experiments prove that the method can effectively improve the accuracy of results.

TABLE 1 segmentation results of different step combination ablation experiments

Compared with the prior art, the method has more accurate positioning and robustness for the description of the complex multi-entity natural language text.

Claims

1. The text guidance image segmentation method based on multi-level explicit relationship selection is characterized by comprising the following steps of:

(1) feature extraction:

performing feature extraction on the input RGB picture and the natural language text; the method comprises the following steps that semantic features in RGB pictures are extracted through a convolutional neural network, and a deplab semantic segmentation model is adopted to pre-train parameters to serve as initial parameters of the convolutional neural network; for the natural language text, expressing each word as a vector by using a one-hot method, embedding the obtained vector into a low-dimensional vector, inputting the low-dimensional vector into an LSTM long-time memory network, and expressing the final hidden state as the vector of the whole natural language text;

(2) pyramid pooling:

firstly, connecting picture features in the step (1) with natural language text vectors and generating mixed features according to regular space position vectors generated by the space positions of pixels; then, generating a mixed feature with global information by adopting a pyramid pooling method, which specifically comprises the following steps: after copying the mixed features, dividing the copied mixed features into four parts according to the number of channels, dividing the four parts of feature maps into 1 × 1,2 × 2, 3 × 3 and 6 × 6 boxes respectively, then performing average pooling on each box, and connecting the pooling results to the original feature maps to obtain global information with different sizes;

(3) spatial entity relationship capture:

in order to obtain the spatial entity relationship in the mixed features generated in the step (2), obtaining the relationship between different feature spatial entities in the picture features by using a self-attention mechanism; for any two mixed space feature vectors in the space, when the multiplication result of the two vectors is larger, the similarity of the two vectors is larger, and the two vectors have correlation;

the process of calculating the vector similarity of the self-attention mechanism is shown as the formula:

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )

where head _i ＝Attention(Q _i ,K _i ,V _i )

wherein Q _i ＝MW _i ^Q ,K _i ＝MW _i ^K ,V _i ＝MW _i ^V (ii) a M is the picture spatial feature of the mixture with global information output by the pyramid pooling layer, W _i ^Q ,W _i ^K ,W _i ^V The weight vector can be embedded in a learning mode, wherein the weight vectors respectively represent Q, K and V, all weights are not shared mutually and have the same output dimension, and i belongs to {1,2, …, w multiplied by h } represents the ith space vector in the characteristic diagram with w width and h length; d _K A dimension number representing a picture feature;

(4) strengthening the relation of multilayer pictures and texts:

performing image-text relationship reinforcement by calculating the similarity between the natural language text vector in the step (1) and the Attention mechanism result Attention (Q, K, V) generated in the step (3), and guiding the generation of a text image segmentation result by circularly performing the image-text relationship reinforcement by adopting the natural language text vectors with different scales for multiple times; the method comprises the following specific steps:

in the up-sampling process, bilinear up-sampling is adopted, and simultaneously, the features generated by the convolution network during feature extraction are used for multiplexing; after the connection of the sampling feature and the multiplexing feature, the image-text relationship is strengthened for many times, and the position of the description entity in the image is reconfirmed by using the multi-scale language vector; wherein the picture reference entity is repositioned for confirmation by calculating a similarity between the language text vector and the picture spatial feature vector; when the similarity of the two vectors is higher, the probability that the space vector is the entity pixel described by the text language is higher, and the weight is higher; the calculation process is as follows:

S＝ReLU(Wh _t ·[G；V _i-1 ])

V _i ＝S[G；V _i-1 ]

wherein W represents h _t Of a learnable embedded weight vector, h _t Expressing a language vector, and compressing the language vector into the dimension same as the number of picture channels by using linear transformation; g represents a corresponding multiplexing feature generated when the picture features are extracted;]denotes connection, V _i-1 ,V _i All represent the characteristic result output by the graph-text relationship strengthening method, wherein i-1 represents the output result strengthened by the graph-text relationship of the previous layer, i represents the output result strengthened by the current layer, and V ₀ Is a spatial entity relationship capture result;

the result is adjusted by using multiple image-text relationship reinforcement; and finally, compressing the feature map result generated by strengthening the image-text relationship of the last layer into a one-dimensional feature map through a 1 multiplied by 1 convolution network, and performing pixel-by-pixel classification through a sigmoid activation function to generate a final segmentation result.