CN114549888A

CN114549888A - Image semantic understanding analysis method based on global interaction

Info

Publication number: CN114549888A
Application number: CN202011253160.9A
Authority: CN
Inventors: 库涛; 熊艳彬; 杨琦瑞; 南琳; 刘金鑫; 林乐新; 王海; 张志东; 马岩
Original assignee: Shenyang Institute of Automation of CAS
Current assignee: Shenyang Institute of Automation of CAS
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-05-27

Abstract

The invention constructs a global interactive image semantic analysis method based on machine vision, is applied to the generation of image semantic titles, and specifically comprises the following steps: 1) selecting a target image feature extraction model, and performing feature extraction and coding on image data; 2) constructing a globally interactive bidirectional recurrent neural network to analyze the image characteristics; 3) carrying out standard regularization processing on the extracted image characteristic information, and sending the image characteristic information into a semantic analysis model in real time in a global information mode for model training; 4) and carrying out semantic analysis on the new target image through the trained model. The image semantics generated by the image semantic understanding model and the algorithm established by the invention have the characteristics of strong logicality, rich semantics, high model convergence speed, high semantic analysis precision and the like, and are successfully applied to the fields of content-based image retrieval, medical image analysis, auxiliary blind guiding, early education of children and the like.

Description

Image semantic understanding analysis method based on global interaction

Technical Field

The invention relates to an analytic method for constructing an image semantic understanding model based on global interaction, which is applied to the fields of content-based image retrieval, medical image analysis, auxiliary blind guiding, early education of children and the like. The method is characterized in that a picture is given to generate the text description of the picture, and belongs to the technical fields of target detection, semantic analysis algorithm and the like.

Background

The image semantic understanding is based on image recognition, and integrates cross discipline research of multiple disciplines such as computer science, psychology, linguistics and the like, and also makes an important contribution to cross-modal interaction research between images and texts. The image semantic understanding technology is intended to understand the whole target image rationally or perceptually and generate natural language description conforming to human habits, and not only needs to extract and identify scenes, objects and attributes contained in the target image, but also needs to analyze the interrelationship among the objects and attributes, including the action, form and human psychology and emotion of each object, and generate text description of the image according to the information, so that the technology is a very complex and challenging task.

At present, the image semantic understanding technology is widely applied to various fields such as image retrieval, medical image analysis, auxiliary blind guiding, early education of children, news automation, space military security and the like, and plays an increasingly important role in each field.

The traditional image semantic understanding method is mainly a template-based method and a transfer generation method, and has the limitations that the whole model is excessively dependent on a certain grammar template or a reference image text database, and the process of flexibly analyzing an image and generating a brand-new text by a language model is omitted, so the output result of the model is not satisfactory. In recent years, this task has made a drastic advancement and achievement with the application of encoder-decoder based neural network models in the field of image semantic understanding.

The invention mainly creates innovation on how to effectively improve and improve the image semantic understanding effect of a neural network model based on an encoder-decoder, and mainly solves the problem that a bidirectional cyclic neural network model is used for semantic analysis of images; regularization processing is carried out on image data and text data on the basis of a bidirectional cyclic neural network model, and image information is introduced into the bidirectional cyclic neural network in a global interaction mode; and a word2vec text mapping mode is adopted to express text information to solve the problems of data sparseness and skewness. The model has the characteristics of strong logicality of the generated image semantic understanding result, high accuracy of image content identification, high model running speed and the like.

Disclosure of Invention

Aiming at the problems of low model convergence speed, poor generated semantic analysis logicality and the like of a NIC (Neural Image Caption Generator) which is a representative model of a coding and decoding structure, the invention provides that global Image information is sent to a semantic analysis model in real time in the semantic analysis stage of an Image semantic task for guiding semantic generation, and a bidirectional recurrent Neural network model is adopted to carry out semantic analysis on an Image, so that Image semantic description with stronger logicality and higher precision is obtained. The general structure diagram of the model is shown in fig. 1.

The technical scheme adopted by the invention is as follows: an image semantic understanding model based on global interaction is used for image semantic generation, namely, a bidirectional recurrent neural network model is adopted to carry out semantic analysis in the process of generating a semantic text corresponding to an image, so that the model can pay attention to context information before and after the model pays attention to the semantic analysis process in real time, and semantic consistency and logicality are ensured; in the semantic analysis process, the global information of the image is paid attention to in real time to guide semantic generation; regularizing the extracted image characteristic data and the text data, and representing the text information by adopting a word2vec text mapping mode, thereby reducing the influence of data noise and solving the problems of high-dimensional data sparseness and data skewness.

An image semantic understanding analysis method based on global interaction comprises the following steps:

1) performing feature extraction on an input image through an image feature extraction encoder to obtain high-dimensional image feature information, and sending the feature vector serving as global information of the image to a decoding end;

2) the decoding end comprises a double-layer GRU structure, and the decoding end analyzes according to the global information of the image to obtain the text description corresponding to the input image.

In the decoding end, the analysis is carried out according to the global information of the image, and the method comprises the following steps:

the double-layer GRU network comprises a forward GRU and a backward GRU; after receiving the global information of the image at each moment, the forward GRU and the backward GRU respectively and independently generate the updating states at the moment t; the forward GRU and the backward GRU respectively output the updating state of the time t to the previous forward GRU and the next backward GRU, linear superposition is carried out on the GRU output in the two directions, and the image text corresponding to the input image at the current time is predicted by utilizing the GRU linear superposition results in the two directions.

When the image feature extraction encoder and decoder are trained in advance through the steps 1) to 2), in the step 2), text information corresponding to an input image is respectively input to a forward GRU and a backward GRU after being overlapped and weighted with the input image.

And performing semantic analysis by using the bidirectional GRU network, and inputting the global information of the image into a forward GRU and a backward GRU in the bidirectional GRU network at each cycle time t to guide the generation of image semantics.

The image feature extraction encoder adopts a convolutional neural network VGG-16 model.

The invention has the following advantages:

the semantic description logicality of the generated target image is strong; the accuracy of identifying the image content is high; the algorithm model has the advantages of high convergence rate, difficult overfitting and the like. The method has a very obvious effect on improving the accuracy of semantic parsing of the image.

Drawings

FIG. 1 is a general structural diagram of an image semantic understanding model based on global interaction;

FIG. 2 is a schematic diagram of a GRU unit structure;

FIG. 3 is a diagram of a globally interactive GRU unit structure;

FIG. 4 is a schematic diagram of a bi-directional GRU recurrent neural network in time sequence;

FIG. 5 is a graph of loss function fluctuation for a NIC baseline model training process;

FIG. 6 is a graph of loss function fluctuation of an image semantic understanding model training process based on global interaction;

FIG. 7 is a diagram of an example of image semantic understanding model semantic understanding test based on global interaction.

Detailed description of the preferred embodiments

The present invention will be described in further detail below.

Step 1: image feature information extraction and encoding

1.1) image feature extraction and encoding

The image feature extraction encoder in the model adopts a convolutional neural network VGG-16 model to extract features of an input image, 4096-dimensional high-dimensional image feature information is obtained at the output end of the network, and the feature vector is used as global information of the image and is sent to a decoding end to carry out cross-mode interaction.

Step 2: image feature information decoding

2.1) gated cycle cell

In the global interaction model, in order to improve the accuracy and richness of language description, a bidirectional circulation neural network model is adopted, and in order to avoid the problem that the parameter scale is increased rapidly because a Long Short-term memory Unit (LSTM) is directly adopted, a Gated circulation Unit (GRU) is adopted as a logic Unit of the circulation neural network, so that the purposes of reducing the model parameter scale, improving the model convergence speed, reducing the overfitting degree of the model and the like are achieved.

The GRU unit combines and changes an input gate and a forgetting gate of the LSTM unit into an updating gate, linear self-updating is not established on an additional memory unit, but is established on a hidden state after direct linear accumulation and is regulated and controlled by the gate. After the gate control logic assembly inside the LSTM unit is simplified by the GRU unit, the scale of the whole parameter is reduced by about thirty percent, the convergence speed of the model training process is accelerated, and the possibility of model overfitting is reduced. The working principle and the updating formula of the GRU unit gating logic component are as follows:

r_t＝δ(W_rx_t+U_rh_t-1)

Reset Gate：r_tresponsible for determining the hidden unit h at the previous moment_t-1For New Memory

Of importance if r_tApproximately equal to 0, h_t-1Will not be passed to New Memory

Wherein, W_rAnd U_rIs the Reset Gate update parameter.

z_t＝δ(W_zx_t+U_zh_t-1)

Update Gate：z_tResponsible for deciding how much hidden state h of the previous moment is transferred_t-1Giving hidden state h at the current moment_t。W_zAnd U_zIs the Update Gate Update parameter. If z is_tH is equal to 1_t-1Almost directly copy to h_tOn the contrary, if z_tAbout 0, New Memory

Is directly transmitted to h_tAs shown in formula (3.5):

new Memory: novel memory

Is for the current input x_tAnd the last time Hidden Stateh_t-1To summarize (a). W and U are New Memory update parameters. Calculating the summarized new vector

Containing the above information and the new input x_t。

Hidden State：h_tFrom h_t-1And

the weights of both are controlled by Update Gate z (t).

Where δ represents a sigmoid function, and an indicates a bitwise AND operation. A schematic diagram of the structure of the GRU unit is shown in fig. 2.

2.2) gated-cycle cells under Global interaction

In the decoding-end recurrent neural network of the NIC baseline model, the generation of the image semantic information mainly depends on the input information at the current time point and the hidden state at the previous moment (implying the image information input at the beginning), and the process is gradually carried out until the end mark of a sentence is met. However, as this process continues, the image information that is initially entered into the language model becomes weaker and weaker, and a phenomenon of blurring or losing part of the image information occurs in the whole semantic generation process, resulting in that the semantic description cannot express the image content richly and comprehensively. Thus, for images that need to be described with a long sentence, the model executes almost "blindly" to the end of the sentence at the end stage of generating the semantics. In order to solve the problem, a global interaction mechanism of image information is introduced into the model, namely, the global image information is introduced into each time sequence in the semantic generation process of the recurrent neural network. A structural diagram of a globally interactive GRU unit is shown in fig. 3.

The V in the figure within the green dashed box represents the feature vector fed into the language model representing the global information of the image. The global image information interaction unit update formula is as follows:

r_t＝δ(W_rx_t+U_rh_t-1+G_rV_t-1)

z_t＝δ(W_zx_t+U_zh_t-1+G_zV_t-1)

the structure and the working principle of the gating logic component of the GRU unit after the image global information is added are the same as those of the GRU unit before the image global information is added, but the input of the image global information is added in the updating formula. Wherein r is_tUpdate formula for ResetGate, W_r、U_r、G_rUpdating parameters corresponding to the formula; z is a radical of_tUpdate the formula for Update Gate, W_z、U_z、G_zUpdating parameters corresponding to the formula;

updating a formula for the New Memory, wherein W, U is an updating parameter corresponding to the formula; h is_tUpdating a formula for the Hidden State; wherein x in the formula_t、h_t-1、V_t-1Respectively representing the current external input, the hidden state at the previous moment and the image global state at the previous momentInformation, δ represents a sigmoid function, and an indicates a bitwise AND operation.

2.3) bidirectional gated round robin network under Global interaction

In the generation training process of the semantic description of the target image, if the unidirectional cyclic neural network is adopted, each word only pays attention to the text information in front of the word sequence in the generation process, but does not pay attention to the text information behind the word, but when a word is decoded in the whole analysis process, the information around the word, namely the front context information and the back context information, is generally required to be known, and the introduction of the bidirectional cyclic neural network model solves the problem to a great extent. The unidirectional circulation neural network only outputs information in one direction at the same time t, and the main structure of the bidirectional circulation neural network is formed by superposing two unidirectional circulation neural networks in opposite directions. At each time sequence t, two recurrent neural networks in opposite directions receive extrinsic inputs simultaneously. Each layer network can independently update parameters without influencing each other, independently generate the updating state and the output result at the moment t, and output information in two directions. And finally, directly linearly superposing the output results of the two unidirectional circulation neural networks in opposite directions to serve as the final output of the bidirectional circulation network. The basic structure of the two recurrent neural networks is completely symmetrical except for different directions, and completely follows the updating rules in 2.1 and 2.2. The New Memory update formula is exemplified as follows:

wherein the content of the first and second substances,

for the forward propagating New Memory calculation at time t of the gated round robin unit,

as a result of the New Memory calculation at time t of the backward-propagating gated-loop cell, parameter W, U is a corresponding updated parameter of the equation, which still represents a bit-wise AND operation. To distinguish between forward and backward update formulas, only the superscript F, B is used to represent the forward and backward update formulas, respectively. The schematic diagram of the bidirectional GRU recurrent neural network developed in time sequence is shown in fig. 4.

2.4) Global interactive image semantic understanding model integral framework

The model overall architecture follows the encoding-decoding basic structure of input images to output text. Global interaction is mainly embodied in two aspects: the method includes the steps that firstly, a bidirectional GRU unit network is introduced for image semantic generation, namely context information and relation before and after semantic information generation are concerned in real time, and unidirectional semantic information is not concerned any more; secondly, the image global information is introduced into the GRU unit on the basis of the bidirectional GRU unit, the global information of the image is concerned in real time to guide semantic generation in the process of generating the text, and the phenomenon that the image characteristic information begins to analyze t of the image information only in the recurrent neural network is avoided₀And sending the image information into a cyclic neural network at any time to cause the problem of fuzzy image information at the end of semantic analysis. The overall structure diagram of the image semantic understanding model based on the global interaction is shown in FIG. 1.

2.5) training the model

In the encoding stage, namely the image feature extraction stage, the VGG-16 model weight pre-trained on ImageNet is directly used as an initial value and is led into a text model to accelerate the training process, the main parameters of the encoder model are not updated, and only the parameters of the output layer of the model are updated.

The decoding stage adopts a global interaction mechanism which is characterized in that the image information is not just before the semantic parsing begins, i.e. t_-1The language model is input at the time (t is 0, which is the first time sequence of the analytic model), but the image information in each time sequence is sent to the GRU unit without being influenced by the time sequence, and the image information is interacted with the semantic information in real time in the whole process of image semantic generation so as to guide the semantic generation. And not only a single concern in generating image semanticsThe semantic information of the direction, but the context information before and after paying attention to the current word at the same time, so that the generated semantic description is more consistent with the described image content.

By comparing the NIC baseline model with the image semantic understanding model loss fluctuation curve based on the global interaction, the convergence speed of the image semantic understanding model loss function curve based on the global interaction is obviously accelerated; the fluctuation amplitude is greatly reduced compared with that of the NIC baseline model, and the phenomenon of overfitting of the NIC baseline model along with continuous deepening of the training depth is effectively avoided by the loss curve fluctuation smoothing based on the image semantic understanding model of global interaction. The NIC model training process loss function fluctuation curve is shown in fig. 5, and the image semantic understanding model training process loss function fluctuation curve based on global interaction is shown in fig. 6.

And 4, step 4: model testing

The experimental results of the image semantic understanding model of the global interaction between the baseline model and the invention are shown in table 1:

TABLE 1

Wherein Baseline represents a NIC Baseline model and MMD-1 represents an image semantic understanding model of global interaction. As can be seen from the comparison of the results in the table, the image semantic understanding model effect of the global interaction is better than the result of the NIC baseline model under 4 BLEU indexes.

And acquiring new third-party data to verify the model effect on the premise that all the image data are based on the assumption of independent and same distribution. And performing multiple rounds of model training in the experimental process, and storing the model result of each round of training. An example of semantic understanding of an image semantic understanding model based on global interaction under third-party image data is as follows, and a diagram such as that shown in fig. 7.

(1) a white dog and a black dog ear running on the grass field (a white dog and a black dog chasing on grass)

(2) a motorcycle in black coat is designing the motorcycle on the road (a motorcyclist wearing a black coat is riding a motorcycle on the road)

(3) a little girl in a white shirt bathing on the grass hanging a flowers (a small girl wearing a white shirt is sitting on the grass, holding several flowers in the hand)

(4) a man in white shirts and shorts is playing society (a man wearing a white shirt and shorts is playing a football).

Claims

1. An image semantic understanding analysis method based on global interaction is characterized by comprising the following steps:

2. The image semantic understanding parsing method based on global interaction as claimed in claim 1, wherein in a decoding end, parsing is performed according to global information of an image, comprising the following steps:

3. The method according to claim 1, wherein when the image feature extraction encoder and decoder are trained in advance through steps 1) -2), in step 2), text information corresponding to an input image is respectively input to the forward GRU and the backward GRU after being superimposed and weighted with the input image.

4. The image semantic understanding and parsing method based on global interaction as claimed in claim 3, wherein a bidirectional GRU network is used for semantic parsing, and global information of an image is input to a forward GRU and a backward GRU in the bidirectional GRU network at each cycle time t while the bidirectional GRU network is used for semantic parsing so as to guide generation of image semantics.

5. The image semantic understanding parsing method based on global interaction as claimed in claim 1, wherein the image feature extraction encoder employs convolutional neural network VGG-16 model.