CN109033321A

CN109033321A - It is a kind of that image is with natural language feature extraction and the language based on keyword indicates image partition method

Info

Publication number: CN109033321A
Application number: CN201810790480.4A
Authority: CN
Inventors: 李宏亮; 石恒璨
Original assignee: Chengdu Quick Eye Technology Co Ltd
Current assignee: Chengdu Quick Eye Technology Co Ltd
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2018-12-18
Anticipated expiration: 2038-07-18
Also published as: CN109033321B

Abstract

The present invention provides a kind of image and natural language feature extraction and the language based on keyword indicates image partition method, on the basis of image characteristics extraction and natural language feature extraction, to the natural language of input picture and input, the keyword according to included in natural language, by the feature f of keyword correspondence image region i_i, keyword weighting after sentence feature q_iAnd the corresponding vision contextual feature c based on keyword_iThree features are cascaded altogether；Feature input multi-layer perception (MLP) after cascade is classified, and segmentation result is obtained.Compared with prior art, the feature extraction of image and natural language can be easy to implement the language instruction image partition method based on keyword；Language of the invention indicates image partition method, reduces the processing difficulty to long sentence, improves the accuracy of object positioning and identification, and then improves language instruction image segmentation precision.

Description

It is a kind of that image is with natural language feature extraction and the language based on keyword indicates image Dividing method

Technical field

The present invention relates to a kind of image and natural language feature extraction and the language based on keyword indicates image segmentation side Method is related to image procossing, computer vision, image segmentation, language and image Combined Treatment and leads.

Background technique

With the arrival of big data era, different types of mass data circulates in network, by different types of data phase In conjunction with being the new demand of big data era.Wherein, image procossing has received widespread attention in conjunction with natural language.Language instruction Image segmentation refers to, is partitioned into the object of natural language description in image, is the committed step in language and image Combined Treatment.

The technology for solving language instruction image segmentation problem at present mainly utilizes deep neural network to extract nature respectively Language and characteristics of image, then by natural language and feature combinations, image is split as new feature.It specifically can be with It is divided into two classes: the language instruction image partition method based on sentence and the language instruction image partition method based on word.It is based on The language instruction image partition method of sentence extracts the feature of entire sentence, with feature combinations；Language based on word refers to Show that image partition method extracts the feature of each word, by its respectively with feature combinations.These methods are primarily present two Defect:

1, the difference of importance between word is had ignored, each word of equality processing causes to be difficult to handle to long sentence；

2, the context relations such as appearance, the position inside image between different zones are not accounted for, and these vision contexts close It is often most important for finding the object of natural language description in the picture.

Summary of the invention

The present invention provides a kind of images and natural language feature extracting method, and having can be easy to implement based on keyword Language indicate image partition method the characteristics of.

The present invention also provides a kind of, and the language based on keyword indicates image partition method, and having reduces to long sentence Processing difficulty improves the characteristics of object is positioned at the accuracy of identification.

A kind of image for there is provided according to the present invention and natural language feature extracting method, including image characteristic extracting method and Natural language feature extracting method；Wherein,

Image characteristic extracting method includes, and for input picture, extracts characteristics of image F using depth convolutional neural networks；Institute Stating characteristics of image is a two dimensional character figure, the inside each feature vector f_iEncode the feature of corresponding region i in image；Root The location information of object is needed according to natural language feature instruction image segmentation task；

Natural language feature extracting method includes, for the natural language of input, by each word be encoded to one-hot feature to After amount, dimensionality reduction is carried out with word insertion；Word after dimensionality reduction presses the sequence in former sentence, sequentially inputs Recognition with Recurrent Neural Network；For T-th of word in sentence, Recognition with Recurrent Neural Network learn the feature q to word_t；The feature q of the word_tEncode word t itself Contextual information with entire sentence of semantic information and word t itself；The feature vector of various words constitutes a matrix Q indicates the feature of entire sentence.

The method of the location information that object is needed according to natural language feature instruction image segmentation task includes extracting The relative position coordinates of each image-region cascade with feature F, obtain the final visual signature V of each image-region.

A kind of language instruction image partition method based on keyword provided according to the present invention, based on above-mentioned image and certainly Right language feature extracting method realizes, specific method includes,

To the natural language of input picture and input, the keyword according to included in natural language, by keyword correspondence image The feature f of region i_i, keyword weighting after sentence feature q_iAnd the corresponding vision contextual feature c based on keyword_iAltogether Three features are cascaded；Feature input multi-layer perception (MLP) after cascade is classified, and segmentation result is obtained；

The multi-layer perception (MLP) is made of two layers of neural network, and first layer includes ReLU activation primitive, and the second layer includes sigmoid Activation primitive；

Wherein, the acquisition methods of image-region i corresponding to keyword include, for key corresponding to each image-region i Word is trained extraction, and the process that training is extracted includes,

For the feature of obtained each word, keyword is extracted using language attention model；The language attention model It is made of two layers of neural network, first layer includes tanh activation primitive, and the second layer does not have activation primitive；For each image-region I first cascades the feature of each word t and the image-region, and then input language attention model, the power that gains attention are beaten Point；To the attention marking be normalized, the marking value after normalization between 0 to 1, closer to 1 illustrate word t for Image-region i is more crucial；Conversely, illustrating that word t is more inessential for image-region i closer to 0；

It is given a mark with attention and corrects the feature of sentence, improved influence power of the keyword in sentence, reduce the influence of non-key word Power；Attention after normalization is given a mark and corresponding word feature q_tIt is multiplied, word feature is weighted；Then all words are added Feature after power is added, and generates entire sentence for the sentence feature q of image-region i_i；

It sets a keyword screening threshold value and illustrates image district if the attention marking after normalization is greater than the threshold value Domain i thinks that word t is keyword；

For each word t, all image-regions for thinking that it is keyword are found out, the context relation in these regions is learnt； The feature in these regions is averaged first, for integrating area information；Then with a full articulamentum based on the spy after being averaged Sign study contextual feature g_t；

Learn vision contextual feature g corresponding to each keyword_tAfterwards, be integrated into entire sentence it is corresponding visually under Literary feature；For image-region i, the vision contextual feature g of corresponding each keyword_tIt is added, generates entire sentence pair The vision contextual feature c answered_i。

The method also includes being normalized attention marking using softmax.

The method also includes it is 0.05 that setting keyword, which screens threshold value,.

Compared with prior art, the feature extraction of image and natural language can be easy to implement the language based on keyword Indicate image partition method；Language of the invention indicates image partition method, reduces the processing difficulty to long sentence, improves The accuracy of object positioning and identification, and then improve language instruction image segmentation precision.

Detailed description of the invention

Fig. 1 is the schematic illustration of a wherein embodiment of the invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.

Any feature disclosed in this specification (including abstract and attached drawing) unless specifically stated can be equivalent by other Or the alternative features with similar purpose are replaced.That is, unless specifically stated, each feature is a series of equivalent or class Like an example in feature.

A kind of image and natural language feature extracting method, including image characteristic extracting method and natural language feature extraction Method；Wherein,

Image characteristic extracting method includes, and for input picture, extracts characteristics of image using depth convolutional neural networks (CNN) F；Described image is characterized in a two dimensional character figure, the inside each feature vector f_iEncode the spy of corresponding region i in image Sign；Indicate that image segmentation task needs the location information of object according to natural language feature；

Natural language feature extracting method includes, for the natural language of input, by each word be encoded to one-hot feature to After amount, dimensionality reduction is carried out with word insertion (word embedding)；Word after dimensionality reduction presses the sequence in former sentence, sequentially inputs and follows Ring neural network (RNN)；For t-th of word in sentence, Recognition with Recurrent Neural Network learns the feature q to word_t；The word Feature q_tEncode contextual information of semantic information and the word t itself of word t with entire sentence itself；Various words Feature vector constitute a matrix Q, indicate the feature of entire sentence.

Based on above-mentioned image characteristic extracting method and natural language feature extracting method, it is easy to implement the language based on keyword Speech instruction image partition method.

It is described to indicate that image segmentation task needs object according to natural language feature as one embodiment of the present invention The method of location information include extracting the relative position coordinates of each image-region, cascaded with feature F, obtain each image The final visual signature V in region.

As shown in Figure 1, a kind of language based on keyword indicates image partition method, it is based on above-mentioned image and natural language Feature extracting method realizes that specific method includes,

To the natural language of input picture and input, the keyword according to included in natural language, by keyword correspondence image The feature f of region i_i, keyword weighting after sentence feature q_iAnd the corresponding vision contextual feature c based on keyword_iAltogether Three features are cascaded；Feature input multi-layer perception (MLP) (MLP) after cascade is classified, and segmentation result is obtained；

Prior art one side equality handles each of sentence word, causes to be difficult to handle to long sentence；Another party Face does not account for the context relations such as appearance, the position inside image between different zones, these context relations are for scheming It is positioned as in and the object of identification natural language description is most important.The present invention proposes the language instruction image point based on keyword Algorithm is cut, the keyword in natural language is extracted, to reduce the processing difficulty to long sentence.And learn based on keyword Vision context relation improves the accuracy of object positioning and identification, and then improves language and indicate image segmentation precision.

As one embodiment of the present invention, the method also includes returning attention marking using softmax One change processing.

As one embodiment of the present invention, the method also includes it is 0.05 that setting keyword, which screens threshold value,.

It is described in more detail below with a specific embodiment.

It determines database, determines that language indicates image segmentation database, such as Google Referit database.

Data prediction pre-processes database, extracts original image, natural language, segmentation result.It is wherein original Image need to extract the relative position coordinates of each point.Natural language need to convert one-hot vector for word each in sentence.

Build depth network model.Wherein convolutional neural networks (CNN) select DeepLab101, export 60 × 60 figures As region, each provincial characteristics fi is set as 1000 dimensions.Recognition with Recurrent Neural Network (RNN) selects long memory unit (LSTM) in short-term, Every sentence maximum word number is set as 20, and each word feature qt is set as 1000 dimensions.

Determine keyword threshold value.Keyword threshold value Thr is set as 0.05.

Model initialization, the model initialization parameter of convolutional neural networks (CNN) pre-training on ImageNet.Model its Remaining part divides random initializtion.

Learning rate is set and gradient decline strategy, convolutional neural networks (CNN), are based on keyword at language attention model Vision context relational learning model, multi-layer perception (MLP) (MLP) learning rate be set as 0.0001, Recognition with Recurrent Neural Network (RNN) learning rate is set as 0.001.Optimal way is using ADAM gradient decline strategy.

Training pattern, model buildings and initialization finish, and after determining that study and gradient decline are tactful, are trained.It will count Model, 5 epoch of training are sequentially input according to the data of training set in library.

Test model, after model training, the image and sentence of test set in input database obtain language instruction figure As segmentation result.

Claims

1. a kind of image and natural language feature extracting method, including image characteristic extracting method and natural language feature extraction side Method；Wherein,

2. image according to claim 1 and natural language feature extracting method, described to be indicated according to natural language feature It includes extracting the relative position coordinates of each image-region that image segmentation task, which needs the method for the location information of object, with spy F cascade is levied, the final visual signature V of each image-region is obtained.

3. a kind of language based on keyword indicates image partition method, based on image of any of claims 1 or 2 and natural language Say that feature extracting method realizes that specific method includes,

4. language according to claim 3 indicates image partition method, the method also includes using softmax to note Meaning power marking is normalized.

5. language according to claim 3 or 4 indicates image partition method, the method also includes setting keyword is sieved Selecting threshold value is 0.05.