CN109033321B

CN109033321B - Image and natural language feature extraction and keyword-based language indication image segmentation method

Info

Publication number: CN109033321B
Application number: CN201810790480.4A
Authority: CN
Inventors: 李宏亮; 石恒璨
Original assignee: Chengdu Kuaiyan Technology Co ltd
Current assignee: Chengdu Kuaiyan Technology Co ltd
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2021-12-17
Anticipated expiration: 2038-07-18
Also published as: CN109033321A

Abstract

The invention provides a method for extracting image and natural language features and segmenting a language indication image based on keywords, which is characterized in that on the basis of image feature extraction and natural language feature extraction, for an input image and an input natural language, the keywords are corresponding to the features f of an image area i according to the keywords contained in the natural language_iKeyword weighted sentence feature q_iAnd corresponding keyword-based visual context features c_iA total of three features are cascaded; and inputting the cascaded features into a multilayer perceptron for classification to obtain a segmentation result. Compared with the prior art, the method has the advantages that the characteristics of the image and the natural language are extracted, so that the method for segmenting the language indication image based on the key words can be conveniently realized; the language indication image segmentation method reduces the processing difficulty of long sentences, improves the accuracy of object positioning and identification, and further improves the segmentation precision of the language indication image.

Description

Image and natural language feature extraction and keyword-based language indication image segmentation method

Technical Field

The invention relates to a method for extracting image and natural language features and segmenting a language indication image based on a keyword, which relates to the fields of image processing, computer vision, image segmentation and joint processing of language and images.

Background

With the advent of the big data era, mass data of different types flow in networks, and combining data of different types is a new requirement of the big data era. Among them, image processing has received a wide attention in combination with natural language. The language-specified image segmentation is to segment an object described in a natural language in an image, and is a key step in the joint processing of a language and an image.

The existing technology for solving the problem of segmenting the language indication image mainly utilizes a deep neural network to respectively extract natural language and image characteristics, and then combines the natural language and the image characteristics as new characteristics to segment the image. The method can be divided into two types: a sentence-based language indication image segmentation method and a word-based language indication image segmentation method. Extracting the characteristics of the whole sentence based on the language indication image segmentation method of the sentence, and combining the characteristics with the image characteristics; the word-based language indication image segmentation method extracts the features of each word and combines the features with the image features respectively. These methods have two major drawbacks:

1. neglecting the difference of importance between words, equally processing each word, resulting in difficult processing for long sentences;

2. the appearance, position and other contextual relationships between different regions inside the image are not considered, and these visual contextual relationships are often crucial to finding natural language described objects in the image.

Disclosure of Invention

The invention provides an image and natural language feature extraction method, which has the characteristic of being convenient for realizing a language indication image segmentation method based on key words.

The invention also provides a method for segmenting the language indication image based on the keywords, which has the characteristics of reducing the processing difficulty of long sentences and improving the accuracy of object positioning identification.

The invention provides an image and natural language feature extraction method, which comprises an image feature extraction method and a natural language feature extraction method; wherein the content of the first and second substances,

the image feature extraction method comprises the steps of extracting image features F from an input image by adopting a depth convolution neural network; the image feature is a two-dimensional feature map with each feature vector f_iEncoding the characteristics of the corresponding area i in the image; indicating the position information of the object required by the image segmentation task according to the natural language characteristics;

the natural language feature extraction method comprises the steps of encoding each word into a one-hot feature vector for the input natural language, and then performing dimension reduction by word embedding; word press after dimension reductionInputting the sequence in the original sentence into a recurrent neural network in sequence; for the t-th word in the sentence, the recurrent neural network learns the feature q of the word_t(ii) a Features q of said word_tThe semantic information of the word t itself and the context information of the word t itself and the whole sentence are encoded; the feature vectors of the different words form a matrix Q representing the features of the whole sentence.

The method for indicating the position information of the object required by the image segmentation task according to the natural language characteristics comprises the steps of extracting the relative position coordinates of each image area, and cascading the relative position coordinates with the characteristics F to obtain the final visual characteristics V of each image area.

The method for segmenting the language indication image based on the keywords is realized based on the image and the natural language feature extraction method, and comprises the following steps,

for an input image and an input natural language, according to a keyword contained in the natural language, the keyword is associated with the feature f of the image area i_iKeyword weighted sentence feature q_iAnd corresponding keyword-based visual context features c_iA total of three features are cascaded; inputting the cascaded features into a multilayer perceptron for classification to obtain a segmentation result;

the multilayer perceptron is composed of two layers of neural networks, wherein the first layer comprises a ReLU activation function, and the second layer comprises a sigmoid activation function;

wherein, the method for acquiring the image areas i corresponding to the keywords comprises the steps of training and extracting the keywords corresponding to each image area i, the process of training and extracting comprises,

extracting keywords by using a language attention model according to the obtained characteristics of each word; the language attention model is composed of two layers of neural networks, wherein the first layer comprises a tanh activation function, and the second layer has no activation function; for each image area i, firstly cascading each word t with the characteristics of the image area, and then inputting a language attention model to obtain an attention score; normalizing the attention scores, wherein the normalized score value is between 0 and 1, and the closer to 1, the more critical the word t is to the image area i; conversely, a closer to 0 indicates that the word t is less important for the image region i;

the characteristics of the modified sentences are scored by attention, so that the influence of the keywords in the sentences is improved, and the influence of the non-keywords is reduced; the normalized attention score is assigned to the corresponding word feature q_tMultiplying, and weighting the word features; then adding the weighted features of all the words to generate a sentence feature q of the whole sentence for the image area i_i；

Setting a keyword screening threshold, and if the normalized attention score is larger than the threshold, indicating that the word t is regarded as a keyword by the image area i;

for each word t, finding out all image areas which are regarded as keywords, and learning the context relationship of the areas; firstly, averaging the characteristics of the regions for integrating region information; then learning context features g based on the averaged features using a full connectivity layer_t；

Learning the visual context characteristics g corresponding to each keyword_tThen, integrating the visual context characteristics into the visual context characteristics corresponding to the whole sentence; for the image area i, the visual context characteristic g of each corresponding keyword_tAdding to generate the visual context characteristic c corresponding to the whole sentence_i。

The method further comprises normalizing the attention scores with softmax.

The method further comprises setting a keyword screening threshold value to 0.05.

Compared with the prior art, the method has the advantages that the characteristics of the image and the natural language are extracted, so that the method for segmenting the language indication image based on the key words can be conveniently realized; the language indication image segmentation method reduces the processing difficulty of long sentences, improves the accuracy of object positioning and identification, and further improves the segmentation precision of the language indication image.

Drawings

Fig. 1 is a schematic diagram of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Any feature disclosed in this specification (including any accompanying drawings) may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

An image and natural language feature extraction method comprises an image feature extraction method and a natural language feature extraction method; wherein the content of the first and second substances,

the image feature extraction method comprises the steps of extracting an image feature F from an input image by adopting a deep Convolutional Neural Network (CNN); the image feature is a two-dimensional feature map with each feature vector f_iEncoding the characteristics of the corresponding area i in the image; indicating the position information of the object required by the image segmentation task according to the natural language characteristics;

the natural language feature extraction method comprises the steps of coding each word into one-hot feature vectors and then performing dimension reduction by word embedding (word embedding) for the input natural language; the words after dimensionality reduction are sequentially input into a Recurrent Neural Network (RNN) according to the sequence in the original sentence; for the t-th word in the sentence, the recurrent neural network learns the feature q of the word_t(ii) a Features q of said word_tThe semantic information of the word t itself and the context information of the word t itself and the whole sentence are encoded; the feature vectors of the different words form a matrix Q representing the features of the whole sentence.

Based on the image feature extraction method and the natural language feature extraction method, the language indication image segmentation method based on the keywords is convenient to realize.

As an embodiment of the present invention, the method for indicating the position information of the object required by the image segmentation task according to the natural language features includes extracting the relative position coordinates of each image region, and cascading the relative position coordinates with the feature F to obtain the final visual feature V of each image region.

As shown in FIG. 1, a method for segmenting language indication image based on keywords is realized based on the image and natural language feature extraction method, and comprises,

for an input image and an input natural language, according to a keyword contained in the natural language, the keyword is associated with the feature f of the image area i_iKeyword weighted sentence feature q_iAnd corresponding keyword-based visual context features c_iA total of three features are cascaded; inputting the cascaded features into a multilayer perceptron (MLP) for classification to obtain a segmentation result;

In the prior art, on one hand, each word in a sentence is processed equally, so that a long sentence is difficult to process; on the other hand, the context of appearance, location, etc. between different regions within the image is not taken into account, which is crucial for locating and identifying objects described in natural language in the image. The invention provides a keyword-based language indication image segmentation algorithm, and the keywords in the natural language are extracted, so that the processing difficulty of long sentences is reduced. And the visual context relation based on the key words is learned, so that the accuracy of object positioning and identification is improved, and the segmentation precision of the language indication image is further improved.

As an embodiment of the present invention, the method further includes performing normalization processing on the attention score using softmax.

As an embodiment of the present invention, the method further includes setting a keyword screening threshold to 0.05.

The following is a further detailed description of an embodiment.

And determining a database, and determining a language indication image segmentation database, such as a Google reference database.

And data preprocessing, namely preprocessing the database, extracting an original image, a natural language and a segmentation result. The relative position coordinates of each point need to be extracted from the original image. Natural language requires that each word in a sentence be converted into a one-hot vector.

And (5) building a deep network model. The Convolutional Neural Network (CNN) is derived from deep lab101, which outputs 60 × 60 image regions, and the feature fi of each region is set to 1000 dimensions. The long-time and short-time memory unit (LSTM) is selected as the Recurrent Neural Network (RNN), the maximum word number of each sentence is set as 20, and the feature qt of each word is set as 1000 dimensions.

A keyword threshold is determined. The keyword threshold Thr is set to 0.05.

Model initialization, the Convolutional Neural Network (CNN) initializes the parameters with the pre-trained model on ImageNet. The rest of the model is initialized randomly.

Learning rate and gradient descent strategy are set, the learning rate of a Convolutional Neural Network (CNN), a language attention model, a keyword-based visual context learning model and a multilayer perceptron (MLP) is set to be 0.0001, and the learning rate of a Recurrent Neural Network (RNN) is set to be 0.001. The optimization mode adopts an ADAM gradient descent strategy.

And training the model, wherein after the model is built and initialized, the learning and gradient descent strategy is determined, and then training is carried out. And sequentially inputting the data of the training set in the database into the model, and training 5 epochs.

And testing the model, inputting the images and sentences of the test set in the database after the model training is finished, and obtaining the language indication image segmentation result.

Claims

1. A method for segmenting language indication image based on key words is realized based on image and natural language feature extraction method, and comprises,

Learning the visual context characteristics g corresponding to each keyword_tThen, integrating the visual context characteristics into the visual context characteristics corresponding to the whole sentence; for the image area i, the visual context characteristic g of each corresponding keyword_tAdding to generate visual context character corresponding to whole sentenceSymbol c_i；

The method comprises an image feature extraction method and a natural language feature extraction method; wherein the content of the first and second substances,

the natural language feature extraction method comprises the steps of encoding each word into a one-hot feature vector for the input natural language, and then performing dimension reduction by word embedding; the words after dimensionality reduction are sequentially input into a recurrent neural network according to the sequence in the original sentence; for the t-th word in the sentence, the recurrent neural network learns the feature q of the word_t(ii) a Features q of said word_tThe semantic information of the word t itself and the context information of the word t itself and the whole sentence are encoded; the feature vectors of the different words form a matrix Q representing the features of the whole sentence.

2. The method for segmenting the language-indicating image according to claim 1, wherein the method for indicating the position information of the object required by the image segmentation task according to the natural language features comprises the steps of extracting the relative position coordinates of each image region, and cascading the relative position coordinates with the features F to obtain the final visual features V of each image region.

3. The method of language-specific image segmentation as set forth in claim 1, further comprising normalizing the attention scores with softmax.

4. The method of segmenting a language-specific image according to claim 1 or 3, further comprising setting a keyword filtering threshold of 0.05.