CN111144410A

CN111144410A - Cross-modal image semantic extraction method, system, device and medium

Info

Publication number: CN111144410A
Application number: CN201911368306.1A
Authority: CN
Inventors: 杨振宇; 刘侨
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-12
Anticipated expiration: 2039-12-26
Also published as: CN111144410B

Abstract

The invention discloses a cross-modal image semantic extraction method, a cross-modal image semantic extraction system, cross-modal image semantic extraction equipment and a cross-modal image semantic extraction medium, wherein the cross-modal image semantic extraction method comprises the following steps: acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other; the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted; the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector; and the decoder processes the final attention vector to obtain a final subtitle.

Description

Cross-modal image semantic extraction method, system, device and medium

Technical Field

The present disclosure relates to the field of image semantic extraction technologies, and in particular, to a cross-modality image semantic extraction method, system, device, and medium.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

The image caption task is to enable a computer to accurately recognize information in an image and to correctly express the information in a natural language. Image captioning is a cross-modal task, from image to text. The image captioning task combines two research fields of computer vision and natural language processing, and therefore it involves multiple knowledge. The image caption task has a plurality of application fields, can provide auxiliary diagnosis for non-medical professionals and young doctors, and can also help the visually handicapped people to understand the information of one image.

In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:

the early conventional methods are search-based research methods and template-based research methods, which solve the problems existing in image captioning tasks from different perspectives.

The retrieval-based method, as shown in fig. 1, gives a retrieval dataset in which images and their corresponding descriptions are contained. When generating image captions, the method firstly searches images similar to the current image to be described in a search data set, and then finds the captions of the similar images. And finally, taking the subtitle as the subtitle of the image to be described, or taking the subtitle after inductive recombination as the subtitle of the image to be described. The method has the advantages that the generated subtitles are smooth and natural, and syntax errors can not occur.

The template-based method, as shown in fig. 2, first detects information such as scenes, objects, attributes of the objects, and interactions between the objects in an image by methods such as object detection, attribute classification, etc., and then fills words corresponding to the information into a preset template with certain rules. The advantage of this approach is that the generated subtitles can closely fit the image information.

In recent years, image captioning tasks have also evolved greatly, benefiting from the development of deep learning networks and high performance computing devices. Meanwhile, the machine translation task successfully applies a deep learning method, which brings great inspiration to image captions. The image captioning task is understood to be a special machine translation task, the traditional one being the translation from one language (e.g., chinese) to another (e.g., english), while the image captioning task is the translation of the image into text. As shown in fig. 3, a Convolutional Neural Network (CNN) has been successful in the image processing field, and a Long Short Term Memory Network (LSTM) has also achieved a very good effect in the natural language processing field. Therefore, the two deep neural networks are introduced into the field of image captions, the convolutional neural network is used as an encoder to extract information in an image and encode the information, and the long-short memory network is used as a decoder to decode the information provided by the encoder and generate captions.

The image caption task framework of the basic encoder-decoder structure only inputs image information at the initial moment of a decoding end, so that the problem of information forgetting is easily caused. Inspired by the task of machine translation, researchers have developed a mechanism of attention. At each time of subtitle generation, the probability distribution of each image region is calculated based on the hidden state of the LSTM previous time and the encoded image information. Since then, attention is continuously improved, and the method is applied to tasks in the field of image captions, and the performance of the image captions is also continuously improved.

The image contains visual information and semantic information. The visual information is spatial position information in the image, and the semantic information is semantic concepts such as objects, attributes, and relationships contained in the image. Therefore, how to effectively select semantic information and visual information in the process of generating subtitles becomes an important issue. This captures the detail information of the image well if the visual information is focused too much in the process of generating the subtitles. However, an important problem is that sometimes only individual regions of an image can be described, and the description of the image by the generated subtitles tends to be one-sided. This can extract semantic concepts in the image very well if too much attention is paid to the semantic information. However, this easily ignores spatial position information between partial semantic concepts in the image, which also results in sometimes erroneous descriptions.

Whether generated image captions or other sentences, inherently contain syntactic structures. The existing model ignores this when generating subtitles, which also results in poor readability of the syntax of generating subtitles.

Disclosure of Invention

In order to solve the deficiencies of the prior art, the present disclosure provides a cross-modality image semantic extraction method, system, device, and medium;

in a first aspect, the present disclosure provides a cross-modality image semantic extraction method;

a cross-modal image semantic extraction method comprises the following steps:

acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;

the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;

the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;

and the decoder processes the final attention vector to obtain a final subtitle.

In a second aspect, the present disclosure also provides a cross-modality image semantic extraction system;

a cross-modality image semantic extraction system, comprising:

an input module configured to: acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;

a vector extraction module configured to: the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;

a balancing module configured to: the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;

a grammar optimization module configured to: and the decoder processes the final attention vector to obtain a final subtitle.

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

Compared with the prior art, the beneficial effect of this disclosure is:

aiming at the problems of image information selection and grammar readability, the image subtitle framework based on an attention balance mechanism and a grammar optimization module is designed. In this framework, the attention balancing mechanism is used to balance semantic information and visual information so that information in the image is efficiently selected. And the grammar optimizing module is used for optimizing the grammar of the generated caption and increasing the grammar readability of the caption.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of a semantic extraction method based on a search method in the prior art;

FIG. 2 is a flow chart of a method for semantic extraction based on a template method in the prior art;

FIG. 3 is a flow chart of a semantic extraction method based on deep learning in the prior art;

FIG. 4 is a flowchart of a semantic extraction method for an ATT-B-SOM-based model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a double-layer LSTM structure according to an embodiment of the present disclosure;

fig. 6(a) -6 (k) are schematic diagrams of a spatial attention visualization of a baseline model proposed in an embodiment of the present disclosure;

fig. 7(a) -fig. 7(o) are schematic diagrams of a model space attention visualization proposed in an embodiment of the present disclosure;

fig. 8(a) is a presentation of subtitles generated by the Baseline model according to the first embodiment of the present disclosure;

fig. 8(b) is a presentation of a model-generated subtitle according to an embodiment of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiment I provides a cross-modal image semantic extraction method;

a cross-modal image semantic extraction method comprises the following steps:

Further, the encoder includes: a convolutional neural network model for image target extraction, a pre-trained VGGNet19, a pre-trained ResNet101 network structure, a semantic attention mechanism model and a visual attention mechanism model;

further, the decoder includes: the system comprises a balance unit, a first layer LSTM model and a second layer LSTM model.

The input end of the semantic attention mechanism model is respectively connected with the output end of a convolutional neural network model for image target extraction, the output end of a pre-trained VGGNet19 and the output end of a first layer of LSTM model; the output end of the semantic attention mechanism model is connected with the input end of the balancing unit;

the input end of the visual attention mechanism model is respectively connected with the output ends of the pre-trained ResNet101 network structure and the first layer LSTM model; the output end of the visual attention mechanism model is connected with the input end of the balancing unit;

the output end of the balancing unit is connected with the input end of the first LSTM model; the output end of the first LSTM model is connected with the input end of the second LSTM model; the input end of the first LSTM model is connected with the output end of the second LSTM model; the output of the second LSTM model is used to output the final subtitles.

Further, the trained semantic extraction model completes the training process through the MS COCO2014 data set.

Further, after the step of inputting the image with the semantics to be extracted into the trained semantic extraction model, before the step of extracting the semantic attention vector and the visual attention vector from the image with the semantics to be extracted by the encoder, the method further comprises:

extracting an image target of an image with semantic to be extracted through an encoder to obtain the image target; extracting image themes of the image with the semantic to be extracted to obtain image themes; and extracting image visual features of the image with the semantic to be extracted to obtain the image visual features.

Further, the encoder extracts a semantic attention vector from the image to be subjected to semantic extraction, including:

and inputting the image target, the image theme and the hidden state of the first layer of the LSTM model at the previous moment into a semantic attention mechanism model, and outputting a semantic attention vector.

Further, the encoder extracts a visual attention vector for an image of the to-be-extracted semantics, comprising:

and inputting the image visual characteristics and the hidden state of the first layer of the LSTM model at the previous moment into a visual attention mechanism model, and outputting a visual attention vector.

Further, the decoder processes the final attention vector to obtain a final subtitle; the method comprises the following steps:

inputting the final attention vector, the hidden state of the first layer of LSTM model at the previous moment and the word generated at the second layer of LSTM model at the previous moment into the first layer of LSTM model;

and inputting the hidden state of the first layer of LSTM model at the current moment and the hidden state of the second layer of LSTM model at the previous moment into the second layer of LSTM model, and outputting the final caption.

Further, at the initial time, the hidden state of the first layer LSTM at the previous time is the set value.

Further, extracting an image target of the image with the semantic to be extracted through an encoder to obtain the image target; the method comprises the following specific steps:

and constructing a convolutional neural network model for image target extraction by adopting a multi-example learning weak supervision mode, and extracting the image target of the semantic to be extracted based on the convolutional neural network model for image target extraction to obtain the image target.

As will be appreciated, the image object extraction: the image contains many semantic concepts: objects (e.g., cars, computers), attributes (e.g., wooden, black), and relationships (e.g., riding, lying). However, the present disclosure considers that the most important semantic concept in an image is the object information it contains. Therefore, the present disclosure adds target information to the generation process of the subtitle. The method adopts a weak supervision method of multi-instance learning to construct a target extraction model. Finally 568 target vocabularies are obtained.

Further, extracting image themes of the image with the semantic to be extracted to obtain image themes; the method comprises the following specific steps:

and (3) performing image theme extraction on the image to be subjected to semantic extraction by adopting pre-trained VGGNet19 to obtain an image theme.

Further, the pre-trained VGGNet19 is trained on a data set by image-topic.

Further, the image-subject pair dataset is subject extracted using latent dirichlet distribution on a known subtitle dataset; the extracted subject is then mapped to a known image to construct an image subject pair dataset.

As will be appreciated, image subject extraction: each image has its own primary meaning, namely the subject of the image. The subject information of the image is added to the caption description process, which is beneficial to aligning the generated caption with the semantics of the original image. The present disclosure extracts topics using latent dirichlet distribution (a document topic generation model) on a caption data set. The present disclosure then constructs a dataset of image-subject pairs from the extracted subjects and images, and trains a classifier using the dataset. The present disclosure uses pre-trained VGGNet19 as a classifier to adapt the network structure to the data set of the present disclosure by way of fine-tuning.

Further, the image visual feature extraction is carried out on the image with the semantic to be extracted, and the specific steps of obtaining the image visual feature comprise:

and adopting a ResNet101 network structure pre-trained by an ImageNet data set to extract image visual features of the image with the semantic to be extracted, and acquiring the image visual features.

Further, using a ResNet101 network structure pre-trained on the ImageNet data set, carrying out image visual feature extraction on an image to be subjected to semantic extraction, wherein the feature extracted from the last convolutional layer before the full connection layer of the pre-trained ResNet101 network structure is used as an image visual feature.

It should be understood that the role of visual feature extraction is to extract visual information from an image. Given an image I, a visual extraction module extracts a set of visual feature vectors V ═ V from the image₁,v₂,v₃,...,v_mIs used to express visual information in the image. Wherein one feature vector corresponds to one region of the image. These feature vectors provide the decoder with the visual information needed to generate the subtitles. The present disclosure uses a ResNet101 network pre-trained on ImageNet datasetsThe structure extracts visual features in the image. Specifically, the present disclosure uses the features extracted by the last convolutional layer before the fully-connected layer as visual information:

V＝CNNs(I) (1)。

the ResNet101 network structure pre-trained on the ImageNet dataset extracts visual features in the image.

It should be understood that the present disclosure designs a two-layer LSTM structure at the decoding end (as shown in fig. 5), which includes an attention balancing unit and a syntax optimization unit. Firstly, semantic attention vectors and visual attention vectors need to be calculated, then the two attentions are scored through a scoring function of an attention balance unit, and finally the weighted attentions are output through a door mechanism.

Further, inputting the image target, the image theme and the hidden state of the first layer of the LSTM model at the previous moment into a semantic attention mechanism model, and outputting a semantic attention vector; the method comprises the following specific steps:

calculating target-subject attention using a fully-connected neural network with Softmax, inputting hidden state of previous time of first layer LSTM, subject of current image and target information obtained from data set, outputting probability distribution α ═ { α ═ of target information₁,α₂,...,α_n}。

The calculation formula is as follows:

α_i,t＝soft max(e_i,t) (3)

wherein W_eT,W_eO，

And

is a parameter to be learned, ⊕ represents the addition operation of vector-matrix, and T represents the subject word direction of the current imageThe quantity, O, represents the target word vector,

a hidden state representing a previous time instant of the first layer LSTM; for convenience of description, the bias term is omitted; finally, according to the attention weight, obtaining the attention vector of the target:

the fully-connected neural network with the Softmax is that a Softmax layer is added at the last of the fully-connected neural network.

Further, inputting the visual characteristics of the image and the hidden state of the first layer of the LSTM model at the previous moment into a visual attention mechanism model, and outputting a visual attention vector; the method comprises the following specific steps:

visual attention is the calculation of the probability distribution of each region of an image for the current time instant, β ═ β₁,β₂,…,β_m}; selecting a fully-connected neural network with Softmax to calculate visual attention;

the calculation formula is as follows:

β_i,t＝soft max(va_i,t) (6)

wherein, W_av，

Is a parameter to be learned, v_iAn ith region representing an image;

finally, the attention vector of the image area is obtained:

where m is the total number of regions in the image.

It should be understood that the visual attention mechanism model, again, is a fully connected neural network with Softmax.

Further, carrying out weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector; the method comprises the following specific steps:

the visual attention vector and the semantic attention vector are scored through a scoring function, and the specific process is represented as follows:

wherein phi is_t∈[0,1]Represents the importance of the visual attention vector and the semantic attention vector at the current time, σ represents a sigmoid function, and tanh () is an activation function. G is a scoring function to assess the importance of attention.

W_GO,W_GvAre parameters that need to be learned. B denotes an output after the semantic attention and the visual attention are balanced.

It should be understood that each moment of subtitle generation may focus on different information. When generating a vocabulary expressing spatial position information in an image, the model needs to pay more attention to visual information. In this case, the visual attention is weighted more than the semantic attention. When generating a vocabulary expressing semantic concepts, the model should focus more on semantic information. At this time, the weight of the sense attention should be greater than the visual attention. Therefore, in order for the model to efficiently select semantic information and visual information in the process of generation of subtitles, the present disclosure proposes an attention balancing mechanism.

Further, a first layer of LSTM models; means that; LSTM long and short term memory network neural network.

Further, a second layer of the LSTM model; the method comprises the following steps: and the ON-LSTM of the network is memorized by the ordered neurons in a long time.

The structure of the ON-LSTM of the memory network for the long and the short time of the ordered neurons is as follows:

wherein the content of the first and second substances,

representing the multiplication of corresponding elements of the matrix. f. of_t,i_tAnd o_tRespectively representing a forgetting gate, an input gate and an output gate of the ON-LSTM.

And c_tRespectively representing candidate and cellular states of the second layer LSTM. W_f，W_i，W_o，W_c，U_f，U_i，U_oAnd U_cAre all parameters that need to be learned.

Indicating the hidden state of the second layer LSTM at the previous time when inputting

When the hierarchy of (C) is higher than the hierarchical information of the history, C_tThe following updating mode is adopted:

wherein d is_fRepresenting historical information

Hierarchy information of d_iIndicating the current input

The hierarchy of (2). d_f>d_iIt means that the history information and the currently inputted information do not produce a junction.

This means that [ d_i:d_f]Part is empty and C is updated if not required_t，

Is set to 0. In this way, the present disclosure unsupervised learning generates the hierarchical information in the caption sequence to achieve the purpose of optimizing the caption grammar.

The text sequence information contains a potential hierarchical structure, and the higher the hierarchy of the information, the coarser the granularity, the larger its span in the sentence. The ordered neuron LSTM realizes the identification and the learning of the hierarchical information from the sequence by adopting the modes of updating between partitions and softening in sections.

The present disclosure was trained and tested on the MS COCO2014 data set. Specific information of the data set is shown in table 1:

TABLE 1 data set

In a formal experiment, the present disclosure was tested according to the way in which the data set of Karpathy,2015 was partitioned for comparison with models of others. The present disclosure used 113287 images, 5000 images, and 5000 images for training, validation, and testing, respectively. Each image the present disclosure uses only 5 subtitle annotations for training. The vocabulary is constructed in the subtitle data set, and the vocabulary in the vocabulary has a frequency of at least 5 occurrences in the subtitle data set. The present disclosure uses LDA to obtain topics on a given set of subtitle data, and finally the present disclosure has resulted in 80 topics. These 80 topics are then classified to construct an image-topic pair dataset. The target concept data set is constructed through a multi-instance weak supervision method, and 568 target vocabularies are finally obtained.

The evaluation criteria used in this disclosure are: BLUE-1&4, METEOR, ROUGE, SPICE, and CIDER. These evaluation criteria were all achieved by the COCO evaluation tool. Some of the images come from the field of image captions, and some come from the fields of machine translation, text summarization and the like. The details are as follows: BLUE4 and METEOR are evaluation criteria commonly used in machine translation tasks, and they calculate scores in the n-grams manner. ROUGE is the evaluation criteria from the text abstract, SPICE and CIDER are the criteria for the task of subtitling with evaluation images. As with the accuracy criteria, the score of these evaluation criteria is also as high as possible.

To demonstrate the effectiveness of the disclosed model, the present disclosure compares its own experimental results with other Baseline models. The following models are mainly available:

① Hard-Attention (Hard-ATT) first applied the Attention mechanism to the image captioning task, presented visual Attention and divided it into "Hard" Attention and "Soft" Attention.

② Semantic Attention (ATT-FCN) develops a Semantic Attention for image captioning tasks and is classified into input Attention and input Attention according to the state.

③ Adaptive Attention (AdaATT) devised an Adaptive Attention with a "sentinel" mechanism that can decide whether to use historical or visual information at the time the word is generated.

④ simNet is an integrated network that integrates a subject attention mechanism and a visual attention mechanism in the generation of subtitles.

⑤ CNN-Model proposes to use convolutional neural network instead of long-time memory network as decoder;

⑥ ATT-B-SOM is the model proposed by the present disclosure.

At the same time, the present disclosure disassembles the model into two models that use visual attention and semantic attention, respectively.

Experimental setup

For the encoder portion:

① the Resnet101 model has been pre-trained on ImageNet using the visual feature V.Resnet 101 model of 2048 dimensional convolutional layer extracted images in the ResNet101 structure, which the Torchvision provides, ultimately yields a 204814 by 14 feature map.

② the image-theme pair data set is trained by a VggNet19 network structure, and the full connection layer of VggNet19 is finely adjusted to 80 (80 themes are extracted in the subtitle data set).

③ eventually result in 568 target vocabularies.

For the decoding end:

① in a two-layer LSTM structure, the first layer LSTM model is implemented by a conventional LSTM structure, its hidden state is 512. the subject of the image is used as the initial input of the first layer LSTM, which can give a preview to the LSTM.

② the second layer LSTM model is implemented by ON-LSTM, which can recognize and learn the sequence hierarchy unsupervised, and is the same size as the first layer LSTM model.

Setting parameters: the learning rate of the encoder is 5e-4, the learning rate of the language model is 1e-5, the momentum is 0.8, and the weight attenuation is 0.999. The loss function adopts a cross entropy function and is used for calculating the loss of the model.

The present disclosure sets the batch size to 80 and the iteration cycle to 50.

ResNet is pre-training performed on the classification task, and is not able to adapt to the image caption task of the present disclosure to some extent. Therefore, after 30 iterations, the model is fine-tuned so that the network is more adaptive to the image captioning task. The model disclosed by the invention works in a 16GB Tesla V100-PCIE GPU.

Table 2 experimental results show

COCO Dataset	BLUE-1	BLUE-4	METEOR	ROUGE-L	CIDER	SPICE
							Hard-ATT	71.8	25.0	23.04	-	-	-
ATT-FCN	70.9	30.4	24.3	-	-	-
							Ada-ATT	74.8	33.6	26.4	55.0	104.2	-
CNN-Model	71.1	28.1	28.7	52.2	91.2	17.5
							Sim-Net	-	33.2	28.3	56.4	113.5	22.0
Ours(Sem-ATT)	72.5	30.4	26.3	54.9	112.3	19.6
							Ours(Spa-ATT)	73.0	31.3	27.2	55.7	113.4	20.0
Ours(ATT-B-SOM)	75.3	32.9	28.0	56.2	115.5	23.2

The test was carried out by the above-mentioned setting, and the results are shown in table 2. From the table, it can be seen that the ATT-B-SOM model of the present disclosure achieves the best results in BELU-1, CIDER and SPICE, 75.3,115.5 and 23.2, respectively, compared to the other models.

The Sem-ATT Model and Spa-ATT Model of the present disclosure are also improved over the Hard-ATT, ATT-FCN, and CNN-Model models. It is noted that the convolutional neural networks used by the different models are also different for the extraction of image features.

The Model VggNet and AlexNet used by Hard-ATT, GoogleNet used by ATT-FCN, and Resnet152 network structure used by VggNet, Ada-ATT and Sim-Net used by CNN-Model. The model of the present disclosure uses a Resnet101 network architecture.

First, it can be seen through the work of visualization that the model of the present disclosure focuses on the area associated with the current word when generating each word. For example, in generating a "boat," the present disclosure may see the current point of interest on the hull through the white area. By contrast, the Baseline model, when generating a "boat", is a fuzzy description of the hull. Because the semantic attention is considered in the model, the model of the model is more accurate in describing the attention area of the ship body. Second, the present disclosure notes that the ATT-B-SOM model of the present disclosure is able to capture more detail from an image. Therefore, the model-generated subtitles of the present disclosure are more refined. It can be seen from fig. 7(a) -7 (o) that the model of the present disclosure captures the hair color information of a dog, which is described in detail by the word "brown". This is not present in the baseline model. This example demonstrates that the attention balancing mechanism of the present disclosure can have beneficial effects on image captioning tasks. Fig. 6(a) -6 (k) are schematic diagrams of spatial attention visualization of a baseline model proposed in an embodiment of the present disclosure.

The model designed by the present disclosure can identify and learn the hierarchical structure from the sequence information, so that subtitles with more readability of the syntactic structure can be generated. To demonstrate this, the present disclosure takes three images and generates a caption as a comparison to the baseline model. In fig. 8(a), the Baseline model not only describes the relationship between "person" and "skate board" in error, but it also has poorer readability for generating the syntactic structure of the subtitle than the model of the present disclosure. The "on … on …" structure in the Baseline model hardly appears in the everyday representation of the present disclosure, and it is obvious that the readability of such a grammatical structure is poor. Compared with the Baseline model, the readability of the subtitles generated by the model is higher, and the logic expression of human beings can be better met. In fig. 8(b), it can be seen that the model of the present disclosure captures more information than the Baseline model, and the generated subtitles are more detailed. The model of the present disclosure captures the information of "tree" as background and its spatial location information and "banana". This is more useful for the present disclosure to understand an image. Some images can be described by simple subtitles, and usually, the sentences do not have complex syntactic structures. For such problems, the model of the present disclosure may exhibit effects that are not very different from other models. However, the model of the present disclosure can show its own effect when some images that require complex caption descriptions appear.

The second embodiment further provides a cross-modality image semantic extraction system;

a cross-modality image semantic extraction system, comprising:

To face these challenges, the present disclosure proposes a model named ATT-B-SOM based on the encoder-decoder architecture. The present disclosure presents a specific model diagram (as shown in fig. 4) to facilitate an intuitive view of the composition of the model. The encoder consists of a visual extraction module, an image theme extraction module and an image target extraction module, and the decoder consists of a balance module and a grammar optimization module. The encoder is used for extracting and encoding visual information and semantic information in the image.

In a third embodiment, the present embodiment further provides an electronic device, which includes a memory, a processor, and computer instructions stored in the memory and executed on the processor, where the computer instructions, when executed by the processor, implement the steps of the method in the first embodiment.

In a fourth embodiment, the present embodiment further provides a computer-readable storage medium for storing computer instructions, and the computer instructions, when executed by a processor, perform the steps of the method in the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A cross-mode image semantic extraction method is characterized by comprising the following steps:

2. The method of claim 1, wherein the encoder comprises: a convolutional neural network model for image target extraction, a pre-trained VGGNet19, a pre-trained ResNet101 network structure, a semantic attention mechanism model and a visual attention mechanism model;

the decoder, comprising: the system comprises a balance unit, a first layer of LSTM model and a second layer of LSTM model;

3. The method as claimed in claim 1, wherein after the step of inputting the image to be extracted with semantics into the trained semantic extraction model, before the step of the encoder extracting the semantic attention vector and the visual attention vector from the image to be extracted with semantics, the method further comprises:

4. The method of claim 1, wherein the encoder extracts a semantic attention vector for the image to be semantically extracted, comprising:

inputting an image target, an image theme and a hidden state of a first layer of LSTM model at the previous moment into a semantic attention mechanism model, and outputting a semantic attention vector;

alternatively, the first and second electrodes may be,

the encoder extracts a visual attention vector from an image of a semantic to be extracted, and comprises the following steps:

5. The method of claim 1, wherein the decoder processes the final attention vector to obtain a final subtitle; the method comprises the following steps:

6. The method as claimed in claim 3, wherein the image object is obtained by the encoder extracting the image object of the image whose semantic meaning is to be extracted; the method comprises the following specific steps:

7. The method as claimed in claim 3, wherein the image subject extraction is performed on the image of the semantic to be extracted to obtain the image subject; the method comprises the following specific steps:

extracting image themes of the image with the semantic to be extracted by adopting pre-trained VGGNet19 to obtain the image themes;

alternatively, the first and second electrodes may be,

the method comprises the following steps of extracting visual features of an image with semantics to be extracted, and acquiring the visual features of the image:

8. A cross-modal image semantic extraction system, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.