CN111523534B

CN111523534B - Image description method

Info

Publication number: CN111523534B
Application number: CN202010240856.1A
Authority: CN
Inventors: 王俊豪; 罗雪妮; 罗轶凤; 钱卫宁; 周傲英
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2022-04-05
Anticipated expiration: 2040-03-31
Also published as: CN111523534A

Abstract

The invention discloses an image description method, which is characterized in that a bilinear encoder and a multimode decoder are adopted to improve the image description with the fine-grained regional object characteristics. In the encoder, bilinear pooling is used for encoding fine-grained region image features, a simple encoder of a transformer is used for encoding region-of-interest features of an image, and all the features are encoded and fused with a gate structure to serve as the overall encoding features of the image. In the decoder, multi-modal features are extracted from the fine-grained region image features and the category features, fused with the overall coding features, and the semantic information is decoded to generate the description. Compared with the prior art, the method provides a new solution for image description and application work thereof, and has the advantages of simplicity and convenience and high efficiency.

Description

Image description method

Technical Field

The invention relates to the field of computer vision, in particular to a method for enriching image description by fusing a multi-level transform model and fine-grained characteristics.

Background

An Image description (Image Caption) generates a natural language description for an Image and utilizes the generated description to assist an application in understanding the semantics expressed in the visual scene of the Image. For example, the image description may convert an image search into a text search for classifying images and improving image search results. People usually only need to quickly browse to describe the details of the visual scene of the image, and automatically adding descriptions to the image is a comprehensive and arduous computer vision task, and complex information contained in the image needs to be converted into natural language descriptions. In contrast to common computer vision tasks, image captions not only need to identify objects from an image, but also need to associate the identified objects with natural semantics and describe them in natural language. Thus, image description requires one to extract deep features of an image, associate with semantic features and transform for generating the description.

Early image description methods based on traditional machine learning tended to extract objects and attributes from the image and then to populate the pre-defined sentence templates with the obtained objects and attributes. With the popularity of deep learning, modern image description methods mainly follow an encoder-decoder architecture, in which a Convolutional Neural Network (CNN) is generally used as an encoder for feature extraction, and a Recurrent Neural Network (RNN) is used as a decoder for generating descriptions. The encoder-decoder architecture can generate descriptive statements that go beyond predefined templates, greatly increasing the diversity of the generated statements.

Conventional encoder-decoder image description models typically generate an image description based on global features extracted from an image. Even if the attention mechanism is integrated with the encoder-decoder architecture, extracting region-of-interest features from global features to focus on image regions-of-interest, a lot of detailed information in the image visual scene is lost in the generation process. Thus, the encoder-decoder model with attention mechanism faces the following two challenges: 1) when complex objects and attributes are contained in an image, the region features extracted from the global image feature map do not represent the semantics of the objects well. 2) The inherent sequential nature of RNNs makes it difficult to perform parallel optimization computations, resulting in excessive time cost for model training.

Disclosure of Invention

The invention aims to provide an image description method aiming at the defects of the prior art, which adopts a novel coder-decoder model to generate image description by extracting the fine-grained region characteristics of a detected object and utilizing a transformers model to encode and decode semantic information contained in an image so as to improve the quality of the image description. Specifically, a pre-trained ResNet model is used to extract image features of object regions detected from an image. Then, a bilinear pooling feature extractor is used in an encoder to encode fine-grained semantic features from object region image features of the image, a multilayer simple encoder in a transformer is used to encode the region-of-interest features extracted from global features of the image from bottom to top to generate multilayer image features, each layer of image features and the fine-grained semantic features are fused into refined features, and all the refined features of the layers are fused with a door structure to serve as the overall encoding features of the image. And multi-modal features are extracted from the region object features with fine granularity in a decoder, and are fused with the overall coding features to decode semantic information for description generation, so that a new solution is provided for image description.

The purpose of the invention is realized as follows:

a method for image description, which comprises the following steps:

step 1, finding an open-source image description data set marked with description, and segmenting the data set into a training set, a verification set and a test set;

step 2, identifying each word in the description of the image by using a BERT tool for the description in the step 1, obtaining word vectors with fixed length and forming a corresponding vocabulary;

step 3, extracting the feature vectors of the image interesting regions by using a Faster-RCNN tool and identifying image entity region frames and image entity types of the images;

step 4, identifying the entity category name of the image entity category in the step 3 by using a BERT tool to obtain a category feature word vector with a fixed length;

step 5, identifying the picture in the entity area by using a ResNet tool for the image entity area frame in the step 3, and acquiring an image entity feature vector with a fixed length;

step 6, extracting image features of the training set images and the verification set images in the step 1 by using the steps 3-5, training image region-of-interest feature vectors, category feature word vectors and image entity feature vectors of the training set images by using an Ml-Transformer model to obtain an image description model, wherein the image description model can describe the images, and the verification set image feature verification model training effect is adopted in the training process, so that the verification set does not participate in training;

and 7, performing test set image feature extraction on the test set image in the step 1 by using the steps 3-5, and inputting the image description model generated in the step 6 into the test set image region-of-interest feature vector, the category feature word vector and the image entity feature vector to realize description on the image in the test set and obtain the precision of the test set.

The invention is also characterized in that: the image description data set in the step 1 is an MSCOCO 2014 data set, the data set is divided into 113,287 training set pictures, 5000 verification set pictures and 5000 test set pictures.

In the step 2, each word in the sentence description is identified by using a BERT tool, each dimension of the obtained word vector represents the word characteristics, and the dimension is 1024.

In step 3, the characteristic vector of the image interesting region is extracted by using a fast-RCNN tool, image entity region frames and image entity types are identified, each dimension of the obtained characteristic vector of the image interesting region represents image characteristics, the dimension is 2048, each image is fixed into 50 interesting regions, the number of the image entity region frames and the number of the image entity types are the number of entities identified in the image, and the image entity types comprise: person, car, dog, etc. 80 categories.

And 4, identifying entity category names by using a BERT tool, wherein each dimension of the category feature word vector represents word features, and the dimension is 1024. The number of the category feature word vectors is 5, the number of the entities in the image is fixed, 0 is used for filling the entities with the number less than 5 in the image, and only 5 entities with the number more than 5 are selected.

And 5, identifying the picture in the entity region by using a ResNet tool, wherein each dimension of the image entity feature vector represents the image feature, and the dimension is 2048. The number of the image entity feature vectors is 5, the number of the entity in the image is fixed, 0 is used for filling the number of the entity in the image which is less than 5, and only 5 of the entity in the image which is more than 5 is selected.

In the step 6, the image features of the object region detected from the image are extracted through a ResNet model in the Ml-transform model, a bilinear pooling feature extractor is used for encoding semantic features with fine granularity from the image features of the object region of the image, a multi-layer simple encoder in the transform is used for compiling the feature vector of the region of interest, each layer of image features and the semantic features with fine granularity are fused into refined features, and all the layers of refined features are fused with the gate structure to serve as the integral encoding features of the image. Extracting multi-modal features from the region object features with fine granularity in a decoder, and fusing the multi-modal features with the overall coding features to decode semantic information for description generation, wherein the specific steps of training a model are as follows:

and 6.1, model training comprises two processes, namely a training process based on cross entropy loss and a training process based on reinforcement learning, wherein the data sets used in the two training processes are MSCOCO 2014 training sets and corresponding labels thereof. In the training set, one image is characterized by a feature group formed by the extracted image region-of-interest feature vector, the category feature word vector and the image entity feature vector, and 5 sentences of description are corresponded. The words of each sentence contained in the description are mapped into a distributed representation and embedded according to a pre-trained word vector. Describing a feature group of each image and a sentence corresponding to the feature group after distributed representation as a training sample pair, firstly applying the training sample pair to a training process of the Ml-Transformer model based on cross entropy loss, and then firstly applying the training sample pair to a training process of the Ml-Transformer model based on reinforcement learning optimization;

step 6.2, the Ml-Transformer model consists of an encoder and a decoder. The encoder consists of a bilinear pooling feature extractor, a multi-layer simple image feature encoder, a self attribute extractor named multi-head attribute, and a simple feed-forward network named position-wise feed-forward networks. The decoder consists of a multi-modal bilinear pooling feature extractor, a masked multi-head enrollment feature extractor, a multi-head enrollment feature extractor, and a simple feed-forward network.

And 6.3, in the encoder based on the cross entropy loss training process, firstly, sending the image entity feature vectors in the image feature group into a coding end bilinear pooling feature extractor, extracting second-order fine features of the image, then sending the image region-of-interest feature vectors in the image feature group into a multilayer simple encoder, and encoding multilayer image information. The one-layer simple encoder consists of a multi-head attribute extractor and a simple feed-forward network. And fusing image information of each layer with second-order fine features of the image through a multi-head attribute feature extractor, and then flowing through a simple feedforward network to obtain fine image fusion information of the layer. And performing dot product on the refined image fusion information of each layer and the threshold flux of the refined image fusion information passing through the sigmoid function to obtain the flux information component of each layer, and finally summing to obtain the result of the encoding end.

And 6.4, in a decoder based on the cross entropy loss training process, firstly adding position vector information in the description after distributed representation, sending the position vector information to a multi-head attribution feature extractor with a mask to obtain sequence features which cover word information after the generated words, and fusing the output of a coding end and the sequence features into multi-modal image features through the multi-head attribution feature extractor. And putting the category feature word vectors and the image entity feature vectors in the feature group into a multi-mode bilinear pooling feature extractor to extract refined multi-mode features of the multi-mode bilinear pooling feature extractor. The multimodal image features and the refined multimodal features are fused by a multi-head attribute feature extractor, and then the stream passes through a simple feed-forward network to obtain the sequence feature input of the next layer encoder. And circulating multiple layers, and taking the simple feedforward network output of the encoder in the last layer as a final result of a decoding end. And (5) obtaining the probability of the output sequence by passing the result through a softmax layer, and calculating the cross entropy loss of the real description in the sample pair. And verifying the fitting state of the current model on the verification set every time an epoch is trained, wherein reverse iteration is not performed in the verification process.

And 6.5, in the training process based on reinforcement learning, the CIDER-D is regarded as a reward function. Firstly, the description of the image feature set and the distributed representation flows through an encoding end and a decoding end to obtain the simple feedforward network output of the encoder of the last layer. The sentence is then obtained in two ways: the sentence composed of the words with the maximum probability value and the sentence composed of Monte Carlo samples. And respectively calculating reward scores with the real description, making a difference as a reward coefficient, then obtaining the final loss and then carrying out reverse iteration. And verifying the fitting state of the current model on the verification set every time an epoch is trained, wherein reverse iteration is not performed in the verification process.

The specific steps of testing the model in step 7 are as follows:

and 7.1, evaluating an MSCOCO 2014 data set which is segmented and comprises 113,287 training set pictures, 5000 verification set pictures and 5000 test set pictures. Each picture comprises a plurality of task labels, including target detection frame task labels, key point detection task labels, segmentation task labels and image description task labels. The image detection task annotation used included 5 sentence per picture description.

And 7.2, performing model training by using the training set image, and performing model fitting convergence judgment by using the verification set image in the training process. And after the training is finished, performing model effect testing by using the test set data set. In the cross entropy loss training process, a self-attenuation strategy is adopted for learning rate, the learning rate is automatically changed along with the training process, and an Adam optimizer is used for optimizing parameters participating in the training. During reinforcement learning training, the learning rate is set to 0.0000004, and an Adam optimizer is used to optimize the parameters involved in the training. The input dimensionality of the model bilinear pooling feature extractor is 1024, the output dimensionality is 8192, the input dimensionality of the multi-mode bilinear pooling feature extractor is 1024, the output dimensionality is 8192, the input dimensionality of the multi-head attribution feature extractor is 1024, the output dimensionality is 1024, the number of heads is 8, the processing dimensionality of each head is 128, the input dimensionality of a simple feedforward network is 1024, the output dimensionality is 1024, a 3-layer simple encoder is adopted to extract image features, a 3-layer decoder is adopted to decode the image features, and description is generated. The batch size during training is 64. The word vectors used in the experiment were pre-trained and extracted by the BERT tool with the dimensions set to 1024. BLEU, METEOR, ROUGE-L and CIDER-D were used as performance evaluation indicators.

Compared with the prior art, the invention provides a new solution for image description and application work, has simple and convenient method and high efficiency, and has the following beneficial technical effects:

(1) the Ml-Transformer model generates image description with fine semantics mainly by extracting object features. The first proposes the concept of fine-grained image description, the first uses a bilinear pooling model to fuse coding features for image description.

(2) The Ml-Transformer model encodes and decodes image features by using transformations.

(3) Experiments were conducted on the MSCOCO 2014 dataset compared to other up-to-date image description models to evaluate the performance of the image description model of the present invention.

Drawings

FIG. 1 is a schematic diagram of an image description model according to the present invention;

FIG. 2 is a schematic view of a region of interest of the present invention;

FIG. 3 is a diagram of image entity region frames and image entity categories according to the present invention;

FIG. 4 is a data set annotation diagram of the present invention;

FIG. 5 is a graph comparing experimental results on the MSCOCO test data set of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

Example 1

Referring to fig. 1, the present invention performs an image description of a multi-level Transformer fusing fine-grained features according to the following steps:

step 1: finding out an open-source image description data set marked with description, and segmenting the data set into a training set, a verification set and a test set, wherein the image description data set is an MSCOCO 2014 data set, the data set is segmented into 113,287 images in the training set, 5000 images in the verification set and 5000 images in the test set.

Step 2: identifying each word in the description of the image by using a BERT tool, acquiring word vectors with fixed lengths and forming a corresponding vocabulary list, identifying each word in the description of the sentence by using the BERT tool, wherein each dimension of the obtained word vectors represents word characteristics, and the dimension is 1024;

and step 3: referring to fig. 2, for an image, a fast-RCNN tool is used to extract image region-of-interest feature vectors, the image region-of-interest feature vectors are extracted by the fast-RCNN tool, each dimension of the obtained image region-of-interest feature vectors represents an image feature, the dimension is 2048, and each image is fixed to 50 regions of interest;

referring to fig. 3, the fast-RCNN tool is used to identify image entity region frames and image entity categories, the number of the image entity region frames and the number of the image entity categories are the number of entities identified in the image, and the image entity categories include: person, car, dog, etc. 80 categories;

and 4, step 4: and (3) identifying the entity category in the step (3) by utilizing a BERT tool to obtain a category feature word vector with a fixed length, wherein each dimension of the category feature word vector represents word features, and the dimension is 1024. The number of the category feature word vectors is the number of the entities in the image, the category feature word vectors are fixed to be 5, 0 is used for filling the entities with the number less than 5 in the image, and only 5 entities with the number more than 5 are selected;

and 5: and (3) identifying the picture in the entity area by using a ResNet tool to the entity area frame of the image in the step (3) to obtain an image entity feature vector with a fixed length, identifying the picture in the entity area by using the ResNet tool, wherein each dimension of the image entity feature vector represents the image feature and the dimension is 2048. The number of the image entity feature vectors is the number of entities in the image, the number of the entities in the image is fixed to be 5, 0 is used for filling the entities with the number less than 5, and only 5 entities with the number more than 5 are selected;

step 6: extracting image features of the training set images and the verification set images in the step 1 by using the steps 3-5, training the training set by using an Ml-Transformer model to obtain an image description model, adopting a verification set image feature verification model training effect in the training process, not performing reverse iteration on the verification set, extracting image features of an object region detected from the images by using a ResNet model in the Ml-Transformer model, coding fine-grained semantic features from the image features of the object region of the images by using a bilinear pooling feature extractor, compiling a region-of-interest feature vector by using a multilayer simple encoder in the Transformer, fusing each layer of image features and the fine-grained semantic features into refined features, fusing all the layers of refined features with a door structure, and taking the refined features as the overall coding features of the images. Extracting multi-modal features from the region object features with fine granularity in a decoder, and fusing the multi-modal features with the overall coding features to decode semantic information for description generation, wherein the specific steps of training a model are as follows:

a. the model training comprises two processes, namely a training process based on cross entropy loss and a training process based on reinforcement learning, wherein the two training processes use data sets which are MSCOCO 2014 training sets and corresponding labels thereof. In the training set, one image is characterized by a feature group formed by the extracted image region-of-interest feature vector, the category feature word vector and the image entity feature vector, and 5 sentences of description are corresponded. The words of each sentence contained in the description are mapped into their distributed representation and embedded according to a pre-trained word vector. Describing a feature group of each image and a sentence corresponding to the feature group after distributed representation as a training sample pair, firstly applying the training sample pair to a training process of the Ml-Transformer model based on cross entropy loss, and then firstly applying the training sample pair to a training process of the Ml-Transformer model based on reinforcement learning optimization;

b. the Ml-Transformer model consists of an encoder and a decoder. The encoder consists of a bilinear pooling feature extractor, a multi-layer simple image feature encoder, a self attribute extractor named multi-head attribute, and a simple feed-forward network named position-wise feed-forward networks. The decoder consists of a multi-layer multi-modal bilinear pooling feature extractor, a masked multi-head attribution feature extractor, a multi-head attribution feature extractor, and a simple feed-forward network.

c. In the encoder based on the cross entropy loss training process, firstly, image entity feature vectors in an image feature group are sent to a bilinear pooling feature extractor at an encoding end, second-order fine features of an image are extracted, and then image region-of-interest feature vectors in the image feature group are sent to a multilayer simple encoder to encode multilayer image information. The one-layer simple encoder consists of a multi-head attribute extractor and a simple feed-forward network. And fusing image information of each layer with second-order fine features of the image through a multi-head attribute feature extractor, and then flowing through a simple feedforward network to obtain fine image fusion information of the layer. And performing dot product on the refined image fusion information of each layer and the threshold flux of the refined image fusion information passing through the sigmoid function to obtain the flux information component of each layer, and finally summing to obtain the result of the encoding end.

d. In a decoder based on a training process of cross entropy loss, firstly, adding position vector information in a description after distributed representation, sending the position vector information to a multi-head attribution feature extractor with a mask to obtain sequence features which cover word information after a generated word, and fusing the output of a coding end and the sequence features into multi-modal image features through the multi-head attribution feature extractor. And putting the category feature word vectors and the image entity feature vectors in the feature group into a multi-mode bilinear pooling feature extractor to extract refined multi-mode features of the multi-mode bilinear pooling feature extractor. The multimodal image features and the refined multimodal features are fused by a multi-head attribute feature extractor, and then the stream passes through a simple feed-forward network to obtain the sequence feature input of the next layer encoder. And circulating multiple layers, and taking the simple feedforward network output of the encoder in the last layer as a final result of a decoding end. And (5) obtaining the probability of the output sequence by passing the result through a softmax layer, and calculating the cross entropy loss of the real description in the sample pair. And verifying the fitting state of the current model on the verification set every time an epoch is trained, wherein reverse iteration is not performed in the verification process.

e. In the training process based on reinforcement learning, CIDER-D is regarded as a reward function. Firstly, the description of the image feature set and the distributed representation flows through an encoding end and a decoding end to obtain the simple feedforward network output of the encoder of the last layer. The sentence is then obtained in two ways: the sentence composed of the words with the maximum probability value and the sentence composed of Monte Carlo samples. And respectively calculating reward scores with the real description, making a difference as a reward coefficient, then obtaining the final loss and then carrying out reverse iteration. And verifying the fitting state of the current model on the verification set every time an epoch is trained, wherein reverse iteration is not performed in the verification process.

For example, a picture A and a description of the picture A, an image interesting region feature vector, a category feature word vector and an image entity feature vector of the A are extracted, the description of the A is expressed in a distributed mode, the image entity feature vector of the A is placed into a bilinear pooling feature extractor at an encoding end to extract refined features, the image interesting region feature vector of the A is placed into a simple encoder to obtain multilayer image features, the refined features are combined with the image features of each layer, the multilayer combined features are fused through a threshold unit, and the image interesting region feature vector is sent to a decoding end. At a decoding end, adding position vector information into the description of the distributed representation of the A, sending the description into a multi-head attribute feature extractor with a mask to extract sequence information, combining the sequence information with the output of the encoding end to form multi-mode image features, performing multi-mode fusion on the category feature word vector and the image entity feature vector of the A to form multi-mode fine features, fusing the multi-mode image features and the multi-mode fine features to form output, obtaining final decoding description through a plurality of decoding ends, calculating loss with the description of the A in a distributed representation mode, and performing reverse iterative training.

And 7: using the steps 3-5 to extract the image features of the test set from the image of the test set in the step 1, inputting the image description model generated in the step 6 into the image feature vectors of the region of interest, the category feature word vectors and the image entity feature vectors of the image of the test set, so as to realize the description of the image in the test set and obtain the precision of the test set, and the specific steps are as follows:

a. referring to fig. 4, the MSCOCO 2014 data set is segmented and includes 113,287 training set pictures, 5000 verification set pictures and 5000 test set pictures. Each picture comprises a plurality of task labels, including target detection frame task labels, key point detection task labels, segmentation task labels and image description task labels. The image detection task annotation used included 5 sentence per picture description.

b. And performing model training by using the training set image, and performing model fitting convergence judgment by using the verification set image in the training process. And after the training is finished, performing model effect testing by using the test set data set. In the cross entropy loss training process, a self-attenuation strategy is adopted for learning rate, the learning rate is automatically changed along with the training process, and an Adam optimizer is used for optimizing parameters participating in the training. During reinforcement learning training, the learning rate is set to 0.0000004, and an Adam optimizer is used to optimize the parameters involved in the training. The input dimensionality of the model bilinear pooling feature extractor is 1024, the output dimensionality is 8192, the input dimensionality of the multi-mode bilinear pooling feature extractor is 1024, the output dimensionality is 8192, the input dimensionality of the multi-head attribution feature extractor is 1024, the output dimensionality is 1024, the number of heads is 8, the processing dimensionality of each head is 128, the input dimensionality of a simple feedforward network is 1024, the output dimensionality is 1024, a 3-layer simple encoder is adopted to extract image features, a 3-layer decoder is adopted to decode the image features, and description is generated. The batch size during training is 64. The word vectors used in the experiment were pre-trained and extracted by the BERT tool with the dimensions set to 1024. BLEU, METEOR, ROUGE-L and CIDER-D were used as performance evaluation indicators.

For example, trained models are available, test sets in the MSCOCO 2014 are given, prediction is performed on the test sets by using the models, and then calculation is performed by using predicted results and real results, so that the BLEU, the METEOR, the route-L and the CIDEr-D can be obtained respectively.

Referring to fig. 5, it can be seen from the experimental effect on the MSCOCO data set test set that the model of the present invention is best when viewed from the results of the cross entropy training process alone, i.e., BLEU-4, METEOR, ROUGE-L, and CIDEr-D are the highest, and when viewed from the results after the reinforcement learning optimization, BLEU-1, BLEU-4, METEOR, and CIDEr-D are the highest.

The above description is only for the best mode of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution of the present invention and the inventive concept thereof within the scope of the present invention, which is disclosed by the present invention, and the equivalent or modified technical solution and the inventive concept thereof belong to the protection scope of the present invention.

Claims

1. A method of image description, characterized in that the method performs the image description by the following steps:

step 1: finding an open-source image description data set marked with description, and segmenting the data set into a training set, a verification set and a test set;

step 2: for the description in the step 1, recognizing each word in the sentence by using a BERT tool, acquiring word vectors with fixed lengths and forming a corresponding vocabulary;

and step 3: extracting feature vectors of an image region of interest by using a fast-RCNN tool and identifying an image entity region frame and an image entity category;

and 4, step 4: identifying the entity category name of the image entity category in the step 3 by using a BERT tool to obtain a category feature word vector with a fixed length;

and 5: identifying the picture in the entity area by using a ResNet tool for the entity area frame of the image in the step 3 to obtain the entity characteristic vector of the image with fixed length;

step 6: performing image feature extraction on the training set images and the verification set images in the step 1 by using the steps 3-5, training the feature vectors of the region of interest of the training set images, the category feature word vectors and the image entity feature vectors by using an Ml-Transformer model to obtain an image description model, and training an effect of the verification set image feature verification model by using a verification set image feature verification model in the training process, wherein the verification set does not participate in the training;

and 7: performing test set image feature extraction on the test set image in the step 1 by using the steps 3-5, inputting the test set image region-of-interest feature vector, the category feature word vector and the image entity feature vector into the image description model generated in the step 6, and realizing the description of the image in the test set to obtain the precision of the test set; wherein:

the step 6 specifically includes:

a. the model training comprises two processes, namely a training process based on cross entropy loss and a training process based on reinforcement learning, wherein the two training processes use data sets which are MSCOCO 2014 training sets and corresponding labels thereof; in the training set, one image is characterized by a feature group formed by extracted image interesting region feature vectors, category feature word vectors and image entity feature vectors, and 5 sentences of description are corresponded; the words of each sentence contained in the description are mapped into distributed representation and are embedded according to pre-trained word vectors; describing a feature group of each image and a sentence corresponding to the feature group after distributed representation as a training sample pair, firstly applying the training sample pair to a training process of the Ml-Transformer model based on cross entropy loss, and then firstly applying the training sample pair to a training process of the Ml-Transformer model based on reinforcement learning optimization;

b. the Ml-Transformer model consists of an encoder and a decoder; the encoder consists of a bilinear pooling feature extractor, a multi-layer simple image feature encoder, a self attribute extractor named multi-head attribute and a simple feedforward network named position-wise feed-forward networks; the decoder consists of a multi-mode bilinear pooling feature extractor, a multi-head attribution feature extractor with a mask, a multi-head attribution feature extractor and a simple feedforward network;

c. in the encoder based on the cross entropy loss training process, firstly, image entity feature vectors in an image feature group are sent to a bilinear pooling feature extractor at an encoding end to extract second-order fine features of an image, and then image region-of-interest feature vectors in the image feature group are sent to a multilayer simple encoder to encode multilayer image information; one layer of simple coder is composed of a multi-head attribute extractor and a simple feedforward network; fusing image information of each layer with second-order fine features of the image through a multi-head attribute feature extractor, and then flowing through a simple feedforward network to obtain fine image fusion information of the layer; performing dot product on the refined image fusion information of each layer and the threshold traffic of the refined image fusion information passing through the sigmoid function to obtain the traffic information component of each layer, and finally summing to obtain the result of the coding end;

d. in a decoder based on a cross entropy loss training process, firstly adding position vector information in a description after distributed representation, sending the position vector information into a multi-head attribution feature extractor with a mask to obtain sequence features which cover word information after a generated word, and fusing the output of a coding end and the sequence features into multi-modal image features through the multi-head attribution feature extractor; putting category feature word vectors and image entity feature vectors in the feature group into a multi-mode bilinear pooling feature extractor to extract refined multi-mode features of the multi-mode bilinear pooling feature extractor; fusing multi-modal image features and refined multi-modal features through a multi-head attribute feature extractor, and then streaming through a simple feedforward network to obtain sequence feature input of a next layer encoder; circulating multiple layers, and outputting the simple feedforward network of the last layer of encoder as a final result of a decoding end; obtaining the probability of an output sequence by passing the result through a softmax layer, and calculating the cross entropy loss of the real description in the sample pair; verifying the fitting state of the current model on a verification set every time an epoch is trained, and not performing reverse iteration in the verification process;

e. in the training process based on reinforcement learning, CIDER-D is regarded as a reward function; firstly, flowing the description of image feature group and distributed representation through an encoding end and a decoding end to obtain the simple feedforward network output of the last layer of encoder; the sentence is then obtained in two ways: the sentence composed of the words with the maximum probability value and the sentence composed of Monte Carlo sampling are adopted; calculating reward scores with the real description and then subtracting the reward scores as reward coefficients, and then obtaining the final loss and then carrying out reverse iteration; and verifying the fitting state of the current model on the verification set every time an epoch is trained, wherein reverse iteration is not performed in the verification process.

2. The method for image description according to claim 1, wherein the image description dataset in the step 1 is an MSCOCO 2014 dataset, and the dataset is segmented and comprises 113,287 training set pictures, 5000 verification set pictures and 5000 test set pictures.

3. The method of claim 1, wherein the dimension of the word vector in step 2 is 1024, each dimension of the word vector represents a word feature, and the vocabulary size is 10201.

4. The method of claim 1, wherein the dimension of the image region-of-interest feature vector in step 3 is 2048, each dimension of the image region-of-interest feature vector represents an image feature, each image includes 50 regions of interest, the number of image entity region frames and the number of image entity classes are the number of entities identified in the image, and the image entity classes include 80 classes.

5. The method of image description according to claim 1, wherein the dimension of the class feature word vector in step 4 is 1024, and each dimension represents a word feature; the number of the category feature word vectors is 5, the number of the entities in the image is fixed, 0 is used for filling the entities with the number less than 5 in the image, and only 5 entities with the number more than 5 are selected.

6. The method of claim 1, wherein the dimension of the image entity feature vector in step 5 is 2048, where each dimension represents an image feature; the number of the image entity feature vectors is 5, the number of the entity in the image is fixed, 0 is used for filling the number of the entity in the image which is less than 5, and only 5 of the entity in the image which is more than 5 is selected.

7. The method for describing the image according to claim 1, wherein the step 7 of describing the image of the test set to obtain the accuracy of the test set comprises the following specific steps:

a. evaluating an MSCOCO 2014 data set which is segmented and comprises 113,287 training set pictures, 5000 verification set pictures and 5000 test set pictures; each picture comprises a plurality of task labels, including a target detection frame task label, a key point detection task label, a segmentation task label and an image description task label; the used image detection task label comprises 5 sentences of description of each picture;

b. performing model training by using the training set image, and performing model fitting convergence judgment by using the verification set image in the training process; after training is finished, performing model effect test by using a test set data set; in the cross entropy loss training process, a self-attenuation strategy is adopted for learning rate, the learning rate is automatically changed along with the training process, and an Adam optimizer is used for optimizing parameters participating in the training; in the reinforcement learning training process, the learning rate is set to 0.0000004, and an Adam optimizer is used for optimizing parameters participating in training; the input dimensionality of the model bilinear pooling feature extractor is 1024, the output dimensionality is 8192, the input dimensionality of the multi-mode bilinear pooling feature extractor is 1024, the output dimensionality is 8192, the input dimensionality of the multi-head attribution feature extractor is 1024, the output dimensionality is 1024, the number of heads is 8, the processing dimensionality of each head is 128, the input dimensionality of a simple feedforward network is 1024, the output dimensionality is 1024, a 3-layer simple encoder is adopted to extract image features, a 3-layer decoder is adopted to decode the image features, and description is generated; the batch size during training is 64.