CN112037239B - Text guidance image segmentation method based on multi-level explicit relation selection - Google Patents

Text guidance image segmentation method based on multi-level explicit relation selection Download PDF

Info

Publication number
CN112037239B
CN112037239B CN202010882340.7A CN202010882340A CN112037239B CN 112037239 B CN112037239 B CN 112037239B CN 202010882340 A CN202010882340 A CN 202010882340A CN 112037239 B CN112037239 B CN 112037239B
Authority
CN
China
Prior art keywords
text
vector
feature
relationship
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010882340.7A
Other languages
Chinese (zh)
Other versions
CN112037239A (en
Inventor
刘宇
李新宇
徐凯平
冯毅强
张海洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202010882340.7A priority Critical patent/CN112037239B/en
Publication of CN112037239A publication Critical patent/CN112037239A/en
Application granted granted Critical
Publication of CN112037239B publication Critical patent/CN112037239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a text guidance image segmentation method based on multi-level explicit relation selection, which guides image segmentation from multiple angles and levels such as entity relation in image semantics and multi-scale text, so that the method can obtain accurate results for rich and complex language description. The method mainly comprises the following steps: feature extraction, pyramid pooling, spatial entity relationship capture and multilayer image-text relationship reinforcement. Extracting semantic features in the picture by using a convolutional neural network; obtaining picture features with global information by a pyramid pooling method of adding boxes with different sizes; the relation between the entities on the picture space is obtained through a self-attention mechanism, and when a sentence contains a plurality of entity descriptions, the accuracy of entity positioning can be effectively improved; and finally, circularly enhancing the relation between the image and the language through natural language text vectors with different scales, and correcting the result of the previous step for multiple times to obtain a more robust result.

Description

Text guidance image segmentation method based on multi-level explicit relation selection
Technical Field
The invention belongs to the technical field of intersection of computer vision and natural language processing, relates to a text guidance image segmentation method based on multi-level explicit relation selection, and takes a natural language text which is complex and has a plurality of description entities as a starting point.
Background
With the advent of the artificial intelligence era, the demand for human interaction with computers and intelligent machines is increasing. The problem of how to make a machine understand complex natural language, have the same visual angle as human, observe the world observed by human, and do corresponding operations according to human intentions has become a hot topic of interest in the industry. Image segmentation is a traditional research field of computer vision, but is always concerned by people, and has wide application in various fields such as automatic driving, human-computer interaction, virtual reality, medical images and the like in recent years, so that the development of human-computer interaction can be promoted by combining natural language with image processing, and barrier-free communication between a machine and a human is realized.
Text-based image segmentation is a research branch in segmentation tasks that is more suitable for practical application requirements, and can segment specified areas in pictures according to descriptions of natural language texts. Compared with the common segmentation task, the method needs to understand natural language expressing abundant and changeful, reason and correctly position the multi-entity relation in the picture according to the object relation mentioned in the language, and accurately segment the positioning area. Most of the existing text-based image segmentation methods connect language features with image features to perform pixel-by-pixel classification prediction on a final result, lack of explicit generation of language-guided segmentation results, and lack of processes of capturing and reasoning relationships between entities in images, which easily causes the problems of inaccurate segmentation areas, inaccurate boundary contours and the like of prediction results.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a text-guided image segmentation method based on multi-level explicit relationship selection, which is used for explicitly reasoning global information by capturing the relationship between entities in a picture and using natural language texts to finally guide the generation of segmentation results. The method can deal with the complex natural language text with a plurality of entity descriptions, and can effectively improve the accuracy of the segmentation result on the premise of inputting the complex natural language text with a plurality of entity descriptions.
In order to achieve the purpose, the invention adopts the technical scheme that:
a text guidance image segmentation method based on multi-level explicit relationship selection comprises the following steps:
(1) feature extraction:
and performing feature extraction on the input RGB picture and the natural language text. The RGB picture adopts the convolutional neural network to extract semantic features in the picture, and as the method belongs to the image segmentation branch, a deplab semantic segmentation model pre-training parameter is adopted as an initial parameter of the convolutional neural network, and the network training time can be effectively reduced and the generalization capability of the network can be improved by using the deplab pre-training parameter. For natural language texts, a one-hot method is used for representing each word as a vector, the vector is embedded into a low-dimensional vector and is input into an LSTM long-time memory network, a final hidden state is used as a vector representation of the whole natural language text, and the process is that the low-order word vector is input into the LSTM and is finally obtained through multiple cycles to be used as a vector representation of the whole sentence.
(2) Pyramid pooling:
since the task of text-based image segmentation requires reasoning about the image global according to language, global information is required in the image features. Thus a pyramidal pooling approach is employed to add global information. Firstly, the picture features in the step (1) are connected with natural language text vectors and regular spatial position vectors generated according to the spatial positions of the pixels to generate mixed features, and then the mixed features with global information are generated by adopting a pyramid pooling method.
Specifically, after copying the mixed features, pyramid pooling is divided into four parts according to the number of channels, the four parts of feature maps are divided into 1 × 1,2 × 2, 3 × 3 and 6 × 6 bins, then each bin is subjected to average pooling, and the pooling result is connected to the original feature map to obtain global information with different sizes.
(3) Spatial entity relationship capture:
in order to obtain the spatial entity relationship in the mixed feature generated in step (2), a self-attention mechanism is introduced. The self-attention mechanism is widely accepted in the field of natural language processing, and is gradually applied to the field of computer vision in recent years, and the self-attention mechanism can effectively acquire long-distance relationship and global information. This step obtains the relationship between different feature space entities in the picture features by using a self-attention mechanism. For any two mixed space feature vectors in the space, when the multiplication result of the two vectors is larger, the similarity of the two vectors is larger, and the two vectors have certain correlation.
In addition, because the mixed features have natural language text vectors, the natural language vectors can be used for guiding the capture and generation of effective entity relations in the graph, and meanwhile, the method is beneficial to the subsequent explicit positioning of multiple image-text relations by using a multilayer image-text relation strengthening method. The added regular spatial position can solve the absolute position relation in the natural language text description, such as the description information of the upper left corner of the picture, the right side of the picture and the like.
(4) Strengthening the relationship of multilayer pictures and texts:
and (3) generating similarity between the Attention (Q, K, V) by calculating the natural language text vector in the step (1) and the self-Attention mechanism result generated in the step (3) so as to strengthen the image-text relationship, and circularly adopting the natural language text vectors with different scales for multiple times to strengthen the image-text relationship so as to guide the generation of the text image segmentation result.
Wherein the more similar the spatial vector weight to the natural language text vector, the greater the likelihood of being the final segmentation result. The scale of the natural language text vector is the same as the number of image channels, and therefore, the natural language text vector is reduced along with the reduction of the number of image channels in the network upsampling process.
The invention has the beneficial effects that:
compared with the prior art, the text-based image segmentation method can adapt to the complex natural language scene with a plurality of description entities, effectively captures the relationships between the entities and between the languages and the images in the picture, and correctly positions the description areas.
The method can be applied to various fields such as man-machine interaction and the like.
Drawings
FIG. 1 is a diagram illustrating an overall architecture of a text-guided image segmentation method based on multi-level explicit relationship selection according to the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
Fig. 1 shows a frame design of the text-based image segmentation method of the present invention, the main process is as follows:
all pictures are first resized to 320 x 320. The image features are extracted by using the feature extraction network of deplab pre-training, and the pre-training network can effectively save a large amount of training time and computing resources. Initializing word vectors in a random mode for natural language, embedding one-hot word vectors into 1000-dimensional vectors, and obtaining vector representation of sentences through an LSTM long-time memory network. The longest word number of the LSTM text is 20, and the specific calculation process of the long-time and short-time memory network is shown as a formula:
h t =LSTM(x t ,h t-1 )
wherein h is t Representing the LSTM output vector, x t Representing the LSTM input vector, h t-1 Indicating the output hidden state of the upper layer of LSTM.
And then, solving the problems of limited receptive field and lack of global information of the convolutional network by adopting a pyramid pooling method. The entity relation in the picture is captured by a pooling operation with the size of 1 × 1,2 × 2, 3 × 3, 6 × 6 boxes, and the picture feature with global information is generated and supplied to a space entity relation capturing method.
In order to acquire the entity relationship in the picture and realize the capture of the spatial entity relationship, the invention adopts a self-attention mechanism to capture the entity correlation relationship in the picture, and the process of specifically calculating the vector similarity of the self-attention mechanism is shown as a formula:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )
where head i =Attention(Q i ,K i ,V i )
Figure BDA0002654419240000041
wherein Q i =MW i Q ,K i =MW i K ,V i =MW i V . M is the picture spatial feature of the mixture with global information output by the pyramid pooling layer, W i Q ,W i K ,W i V Learnable embedded weight vectors representing Q, K, V, respectively, where all weights are not shared with each other and have the same output dimension, i ∈ {1,2, …, w × hDenotes the ith space vector in the w-wide h-long feature map. d is a radical of K The number of dimensions representing the picture features. In the experiment, w is 40, h is 40, d K 500. By adopting the multi-head self-attention mechanism, the model can be allowed to learn related information in different expression subspaces, and the accuracy of the model is improved. Through repeated experiments, the multiple h is set to be 5, so that the model prediction result is guaranteed, and meanwhile, lower computing resources are consumed.
However, spatial entity relationship capture is the entity relationship of capturing pictures on a low-resolution feature map, and for the task of dividing the images and needing to generate accurate boundary contour, the low-resolution feature map blurs boundary information to reduce prediction accuracy. Therefore, in the up-sampling process, the bilinear up-sampling is adopted, and the features generated by the convolution network during feature extraction are used for multiplexing. And after the connection of the sampling feature and the multiplexing feature, the image-text relationship is strengthened for many times, and the position of the description entity in the image is reconfirmed by using the multi-scale language vector so as to improve the positioning accuracy. Wherein the picture-referring entity is repositioned for confirmation by calculating a similarity between the language text vector and the picture spatial feature vector. The greater the similarity between the two vectors, the higher the probability that the space vector is the entity pixel described by the text language, and the higher the weight. The specific calculation process is as follows:
S=ReLU(Wh t ·[G;V i-1 ])
V i =S[G;V i-1 ]
wherein W represents h t Of a learnable embedded weight vector, h t A language vector is represented, which is compressed into the same dimension as the number of picture passes using a linear transformation. G represents a corresponding multiplexing feature generated when the picture features are extracted;]denotes connection, V i-1 ,V i All represent the characteristic result output by the graph-text relationship strengthening method, wherein i-1 represents the output result strengthened by the graph-text relationship of the previous layer, i represents the output result strengthened by the current layer, and V 0 Is a spatial entity relationship capture result. The invention uses the multiple image-text relationship reinforcement to adjust the result. Finally, the feature diagram result generated by strengthening the relationship between the last layer of graphics and texts is convoluted by 1 multiplied by 1And (5) compressing the network into a one-dimensional characteristic diagram, and performing pixel-by-pixel classification through a sigmoid activation function to generate a final segmentation result.
Examples
In this example, on the GTX 10808G graphics card, the deep learning framework Tensorflow was used.
Data set: experimental evaluation was performed on a standard public data set G-ref. The data set comprised 26711 pictures, 104560 sentences of natural language text with an average text length of 8.43 words, belonging to the longest data set of natural language text in the text-based image segmentation data set.
Ablation experiment: in order to prove the effectiveness of each step in the text-guided image segmentation method based on multi-level explicit relationship selection, IoU indexes are tested on a G-Re data set. The results are shown in Table 1. Ablation experiments prove that the method can effectively improve the accuracy of results.
TABLE 1 segmentation results of different step combination ablation experiments
Figure BDA0002654419240000051
Figure BDA0002654419240000061
Compared with the prior art, the method has more accurate positioning and robustness for the description of the complex multi-entity natural language text.

Claims (1)

1. The text guidance image segmentation method based on multi-level explicit relationship selection is characterized by comprising the following steps of:
(1) feature extraction:
performing feature extraction on the input RGB picture and the natural language text; the method comprises the following steps that semantic features in RGB pictures are extracted through a convolutional neural network, and a deplab semantic segmentation model is adopted to pre-train parameters to serve as initial parameters of the convolutional neural network; for the natural language text, expressing each word as a vector by using a one-hot method, embedding the obtained vector into a low-dimensional vector, inputting the low-dimensional vector into an LSTM long-time memory network, and expressing the final hidden state as the vector of the whole natural language text;
(2) pyramid pooling:
firstly, connecting picture features in the step (1) with natural language text vectors and generating mixed features according to regular space position vectors generated by the space positions of pixels; then, generating a mixed feature with global information by adopting a pyramid pooling method, which specifically comprises the following steps: after copying the mixed features, dividing the copied mixed features into four parts according to the number of channels, dividing the four parts of feature maps into 1 × 1,2 × 2, 3 × 3 and 6 × 6 boxes respectively, then performing average pooling on each box, and connecting the pooling results to the original feature maps to obtain global information with different sizes;
(3) spatial entity relationship capture:
in order to obtain the spatial entity relationship in the mixed features generated in the step (2), obtaining the relationship between different feature spatial entities in the picture features by using a self-attention mechanism; for any two mixed space feature vectors in the space, when the multiplication result of the two vectors is larger, the similarity of the two vectors is larger, and the two vectors have correlation;
the process of calculating the vector similarity of the self-attention mechanism is shown as the formula:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )
where head i =Attention(Q i ,K i ,V i )
Figure FDA0002654419230000011
wherein Q i =MW i Q ,K i =MW i K ,V i =MW i V (ii) a M is the picture spatial feature of the mixture with global information output by the pyramid pooling layer, W i Q ,W i K ,W i V The weight vector can be embedded in a learning mode, wherein the weight vectors respectively represent Q, K and V, all weights are not shared mutually and have the same output dimension, and i belongs to {1,2, …, w multiplied by h } represents the ith space vector in the characteristic diagram with w width and h length; d K A dimension number representing a picture feature;
(4) strengthening the relation of multilayer pictures and texts:
performing image-text relationship reinforcement by calculating the similarity between the natural language text vector in the step (1) and the Attention mechanism result Attention (Q, K, V) generated in the step (3), and guiding the generation of a text image segmentation result by circularly performing the image-text relationship reinforcement by adopting the natural language text vectors with different scales for multiple times; the method comprises the following specific steps:
in the up-sampling process, bilinear up-sampling is adopted, and simultaneously, the features generated by the convolution network during feature extraction are used for multiplexing; after the connection of the sampling feature and the multiplexing feature, the image-text relationship is strengthened for many times, and the position of the description entity in the image is reconfirmed by using the multi-scale language vector; wherein the picture reference entity is repositioned for confirmation by calculating a similarity between the language text vector and the picture spatial feature vector; when the similarity of the two vectors is higher, the probability that the space vector is the entity pixel described by the text language is higher, and the weight is higher; the calculation process is as follows:
S=ReLU(Wh t ·[G;V i-1 ])
V i =S[G;V i-1 ]
wherein W represents h t Of a learnable embedded weight vector, h t Expressing a language vector, and compressing the language vector into the dimension same as the number of picture channels by using linear transformation; g represents a corresponding multiplexing feature generated when the picture features are extracted;]denotes connection, V i-1 ,V i All represent the characteristic result output by the graph-text relationship strengthening method, wherein i-1 represents the output result strengthened by the graph-text relationship of the previous layer, i represents the output result strengthened by the current layer, and V 0 Is a spatial entity relationship capture result;
the result is adjusted by using multiple image-text relationship reinforcement; and finally, compressing the feature map result generated by strengthening the image-text relationship of the last layer into a one-dimensional feature map through a 1 multiplied by 1 convolution network, and performing pixel-by-pixel classification through a sigmoid activation function to generate a final segmentation result.
CN202010882340.7A 2020-08-28 2020-08-28 Text guidance image segmentation method based on multi-level explicit relation selection Active CN112037239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010882340.7A CN112037239B (en) 2020-08-28 2020-08-28 Text guidance image segmentation method based on multi-level explicit relation selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010882340.7A CN112037239B (en) 2020-08-28 2020-08-28 Text guidance image segmentation method based on multi-level explicit relation selection

Publications (2)

Publication Number Publication Date
CN112037239A CN112037239A (en) 2020-12-04
CN112037239B true CN112037239B (en) 2022-09-13

Family

ID=73586770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010882340.7A Active CN112037239B (en) 2020-08-28 2020-08-28 Text guidance image segmentation method based on multi-level explicit relation selection

Country Status (1)

Country Link
CN (1) CN112037239B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651982A (en) * 2021-01-12 2021-04-13 杭州智睿云康医疗科技有限公司 Image segmentation method and system based on image and non-image information
CN113269021B (en) * 2021-03-18 2024-03-01 北京工业大学 Non-supervision video target segmentation method based on local global memory mechanism

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033321B (en) * 2018-07-18 2021-12-17 成都快眼科技有限公司 Image and natural language feature extraction and keyword-based language indication image segmentation method
CN110598713B (en) * 2019-08-06 2022-05-06 厦门大学 Intelligent image automatic description method based on deep neural network
CN110674850A (en) * 2019-09-03 2020-01-10 武汉大学 Image description generation method based on attention mechanism
CN110619313B (en) * 2019-09-20 2023-09-12 西安电子科技大学 Remote sensing image discriminant description generation method

Also Published As

Publication number Publication date
CN112037239A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
RU2691214C1 (en) Text recognition using artificial intelligence
CN111160343B (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN110782420A (en) Small target feature representation enhancement method based on deep learning
US20160364633A1 (en) Font recognition and font similarity learning using a deep neural network
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN110414344B (en) Character classification method based on video, intelligent terminal and storage medium
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN111563502A (en) Image text recognition method and device, electronic equipment and computer storage medium
CN111626297A (en) Character writing quality evaluation method and device, electronic equipment and recording medium
CN114037674B (en) Industrial defect image segmentation detection method and device based on semantic context
CN112037239B (en) Text guidance image segmentation method based on multi-level explicit relation selection
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN112070174A (en) Text detection method in natural scene based on deep learning
CN113837366A (en) Multi-style font generation method
CN113762269A (en) Chinese character OCR recognition method, system, medium and application based on neural network
CN111985525A (en) Text recognition method based on multi-mode information fusion processing
CN113936195A (en) Sensitive image recognition model training method and device and electronic equipment
Shi et al. CloudU-Netv2: A cloud segmentation method for ground-based cloud images based on deep learning
CN112836702A (en) Text recognition method based on multi-scale feature extraction
Fan et al. A novel sonar target detection and classification algorithm
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN113837290A (en) Unsupervised unpaired image translation method based on attention generator network
Ling et al. A facial expression recognition system for smart learning based on YOLO and vision transformer
CN116485860A (en) Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features
CN113421314B (en) Multi-scale bimodal text image generation method based on generation countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant