CN111144410A - Cross-modal image semantic extraction method, system, device and medium - Google Patents

Cross-modal image semantic extraction method, system, device and medium Download PDF

Info

Publication number
CN111144410A
CN111144410A CN201911368306.1A CN201911368306A CN111144410A CN 111144410 A CN111144410 A CN 111144410A CN 201911368306 A CN201911368306 A CN 201911368306A CN 111144410 A CN111144410 A CN 111144410A
Authority
CN
China
Prior art keywords
image
semantic
model
extracted
attention vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911368306.1A
Other languages
Chinese (zh)
Other versions
CN111144410B (en
Inventor
杨振宇
刘侨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN201911368306.1A priority Critical patent/CN111144410B/en
Publication of CN111144410A publication Critical patent/CN111144410A/en
Application granted granted Critical
Publication of CN111144410B publication Critical patent/CN111144410B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-modal image semantic extraction method, a cross-modal image semantic extraction system, cross-modal image semantic extraction equipment and a cross-modal image semantic extraction medium, wherein the cross-modal image semantic extraction method comprises the following steps: acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other; the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted; the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector; and the decoder processes the final attention vector to obtain a final subtitle.

Description

Cross-modal image semantic extraction method, system, device and medium
Technical Field
The present disclosure relates to the field of image semantic extraction technologies, and in particular, to a cross-modality image semantic extraction method, system, device, and medium.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
The image caption task is to enable a computer to accurately recognize information in an image and to correctly express the information in a natural language. Image captioning is a cross-modal task, from image to text. The image captioning task combines two research fields of computer vision and natural language processing, and therefore it involves multiple knowledge. The image caption task has a plurality of application fields, can provide auxiliary diagnosis for non-medical professionals and young doctors, and can also help the visually handicapped people to understand the information of one image.
In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:
the early conventional methods are search-based research methods and template-based research methods, which solve the problems existing in image captioning tasks from different perspectives.
The retrieval-based method, as shown in fig. 1, gives a retrieval dataset in which images and their corresponding descriptions are contained. When generating image captions, the method firstly searches images similar to the current image to be described in a search data set, and then finds the captions of the similar images. And finally, taking the subtitle as the subtitle of the image to be described, or taking the subtitle after inductive recombination as the subtitle of the image to be described. The method has the advantages that the generated subtitles are smooth and natural, and syntax errors can not occur.
The template-based method, as shown in fig. 2, first detects information such as scenes, objects, attributes of the objects, and interactions between the objects in an image by methods such as object detection, attribute classification, etc., and then fills words corresponding to the information into a preset template with certain rules. The advantage of this approach is that the generated subtitles can closely fit the image information.
In recent years, image captioning tasks have also evolved greatly, benefiting from the development of deep learning networks and high performance computing devices. Meanwhile, the machine translation task successfully applies a deep learning method, which brings great inspiration to image captions. The image captioning task is understood to be a special machine translation task, the traditional one being the translation from one language (e.g., chinese) to another (e.g., english), while the image captioning task is the translation of the image into text. As shown in fig. 3, a Convolutional Neural Network (CNN) has been successful in the image processing field, and a Long Short Term Memory Network (LSTM) has also achieved a very good effect in the natural language processing field. Therefore, the two deep neural networks are introduced into the field of image captions, the convolutional neural network is used as an encoder to extract information in an image and encode the information, and the long-short memory network is used as a decoder to decode the information provided by the encoder and generate captions.
The image caption task framework of the basic encoder-decoder structure only inputs image information at the initial moment of a decoding end, so that the problem of information forgetting is easily caused. Inspired by the task of machine translation, researchers have developed a mechanism of attention. At each time of subtitle generation, the probability distribution of each image region is calculated based on the hidden state of the LSTM previous time and the encoded image information. Since then, attention is continuously improved, and the method is applied to tasks in the field of image captions, and the performance of the image captions is also continuously improved.
The image contains visual information and semantic information. The visual information is spatial position information in the image, and the semantic information is semantic concepts such as objects, attributes, and relationships contained in the image. Therefore, how to effectively select semantic information and visual information in the process of generating subtitles becomes an important issue. This captures the detail information of the image well if the visual information is focused too much in the process of generating the subtitles. However, an important problem is that sometimes only individual regions of an image can be described, and the description of the image by the generated subtitles tends to be one-sided. This can extract semantic concepts in the image very well if too much attention is paid to the semantic information. However, this easily ignores spatial position information between partial semantic concepts in the image, which also results in sometimes erroneous descriptions.
Whether generated image captions or other sentences, inherently contain syntactic structures. The existing model ignores this when generating subtitles, which also results in poor readability of the syntax of generating subtitles.
Disclosure of Invention
In order to solve the deficiencies of the prior art, the present disclosure provides a cross-modality image semantic extraction method, system, device, and medium;
in a first aspect, the present disclosure provides a cross-modality image semantic extraction method;
a cross-modal image semantic extraction method comprises the following steps:
acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;
the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;
the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;
and the decoder processes the final attention vector to obtain a final subtitle.
In a second aspect, the present disclosure also provides a cross-modality image semantic extraction system;
a cross-modality image semantic extraction system, comprising:
an input module configured to: acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;
a vector extraction module configured to: the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;
a balancing module configured to: the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;
a grammar optimization module configured to: and the decoder processes the final attention vector to obtain a final subtitle.
In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of the first aspect.
In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.
Compared with the prior art, the beneficial effect of this disclosure is:
aiming at the problems of image information selection and grammar readability, the image subtitle framework based on an attention balance mechanism and a grammar optimization module is designed. In this framework, the attention balancing mechanism is used to balance semantic information and visual information so that information in the image is efficiently selected. And the grammar optimizing module is used for optimizing the grammar of the generated caption and increasing the grammar readability of the caption.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of a semantic extraction method based on a search method in the prior art;
FIG. 2 is a flow chart of a method for semantic extraction based on a template method in the prior art;
FIG. 3 is a flow chart of a semantic extraction method based on deep learning in the prior art;
FIG. 4 is a flowchart of a semantic extraction method for an ATT-B-SOM-based model according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a double-layer LSTM structure according to an embodiment of the present disclosure;
fig. 6(a) -6 (k) are schematic diagrams of a spatial attention visualization of a baseline model proposed in an embodiment of the present disclosure;
fig. 7(a) -fig. 7(o) are schematic diagrams of a model space attention visualization proposed in an embodiment of the present disclosure;
fig. 8(a) is a presentation of subtitles generated by the Baseline model according to the first embodiment of the present disclosure;
fig. 8(b) is a presentation of a model-generated subtitle according to an embodiment of the present disclosure.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiment I provides a cross-modal image semantic extraction method;
a cross-modal image semantic extraction method comprises the following steps:
acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;
the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;
the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;
and the decoder processes the final attention vector to obtain a final subtitle.
Further, the encoder includes: a convolutional neural network model for image target extraction, a pre-trained VGGNet19, a pre-trained ResNet101 network structure, a semantic attention mechanism model and a visual attention mechanism model;
further, the decoder includes: the system comprises a balance unit, a first layer LSTM model and a second layer LSTM model.
The input end of the semantic attention mechanism model is respectively connected with the output end of a convolutional neural network model for image target extraction, the output end of a pre-trained VGGNet19 and the output end of a first layer of LSTM model; the output end of the semantic attention mechanism model is connected with the input end of the balancing unit;
the input end of the visual attention mechanism model is respectively connected with the output ends of the pre-trained ResNet101 network structure and the first layer LSTM model; the output end of the visual attention mechanism model is connected with the input end of the balancing unit;
the output end of the balancing unit is connected with the input end of the first LSTM model; the output end of the first LSTM model is connected with the input end of the second LSTM model; the input end of the first LSTM model is connected with the output end of the second LSTM model; the output of the second LSTM model is used to output the final subtitles.
Further, the trained semantic extraction model completes the training process through the MS COCO2014 data set.
Further, after the step of inputting the image with the semantics to be extracted into the trained semantic extraction model, before the step of extracting the semantic attention vector and the visual attention vector from the image with the semantics to be extracted by the encoder, the method further comprises:
extracting an image target of an image with semantic to be extracted through an encoder to obtain the image target; extracting image themes of the image with the semantic to be extracted to obtain image themes; and extracting image visual features of the image with the semantic to be extracted to obtain the image visual features.
Further, the encoder extracts a semantic attention vector from the image to be subjected to semantic extraction, including:
and inputting the image target, the image theme and the hidden state of the first layer of the LSTM model at the previous moment into a semantic attention mechanism model, and outputting a semantic attention vector.
Further, the encoder extracts a visual attention vector for an image of the to-be-extracted semantics, comprising:
and inputting the image visual characteristics and the hidden state of the first layer of the LSTM model at the previous moment into a visual attention mechanism model, and outputting a visual attention vector.
Further, the decoder processes the final attention vector to obtain a final subtitle; the method comprises the following steps:
inputting the final attention vector, the hidden state of the first layer of LSTM model at the previous moment and the word generated at the second layer of LSTM model at the previous moment into the first layer of LSTM model;
and inputting the hidden state of the first layer of LSTM model at the current moment and the hidden state of the second layer of LSTM model at the previous moment into the second layer of LSTM model, and outputting the final caption.
Further, at the initial time, the hidden state of the first layer LSTM at the previous time is the set value.
Further, extracting an image target of the image with the semantic to be extracted through an encoder to obtain the image target; the method comprises the following specific steps:
and constructing a convolutional neural network model for image target extraction by adopting a multi-example learning weak supervision mode, and extracting the image target of the semantic to be extracted based on the convolutional neural network model for image target extraction to obtain the image target.
As will be appreciated, the image object extraction: the image contains many semantic concepts: objects (e.g., cars, computers), attributes (e.g., wooden, black), and relationships (e.g., riding, lying). However, the present disclosure considers that the most important semantic concept in an image is the object information it contains. Therefore, the present disclosure adds target information to the generation process of the subtitle. The method adopts a weak supervision method of multi-instance learning to construct a target extraction model. Finally 568 target vocabularies are obtained.
Further, extracting image themes of the image with the semantic to be extracted to obtain image themes; the method comprises the following specific steps:
and (3) performing image theme extraction on the image to be subjected to semantic extraction by adopting pre-trained VGGNet19 to obtain an image theme.
Further, the pre-trained VGGNet19 is trained on a data set by image-topic.
Further, the image-subject pair dataset is subject extracted using latent dirichlet distribution on a known subtitle dataset; the extracted subject is then mapped to a known image to construct an image subject pair dataset.
As will be appreciated, image subject extraction: each image has its own primary meaning, namely the subject of the image. The subject information of the image is added to the caption description process, which is beneficial to aligning the generated caption with the semantics of the original image. The present disclosure extracts topics using latent dirichlet distribution (a document topic generation model) on a caption data set. The present disclosure then constructs a dataset of image-subject pairs from the extracted subjects and images, and trains a classifier using the dataset. The present disclosure uses pre-trained VGGNet19 as a classifier to adapt the network structure to the data set of the present disclosure by way of fine-tuning.
Further, the image visual feature extraction is carried out on the image with the semantic to be extracted, and the specific steps of obtaining the image visual feature comprise:
and adopting a ResNet101 network structure pre-trained by an ImageNet data set to extract image visual features of the image with the semantic to be extracted, and acquiring the image visual features.
Further, using a ResNet101 network structure pre-trained on the ImageNet data set, carrying out image visual feature extraction on an image to be subjected to semantic extraction, wherein the feature extracted from the last convolutional layer before the full connection layer of the pre-trained ResNet101 network structure is used as an image visual feature.
It should be understood that the role of visual feature extraction is to extract visual information from an image. Given an image I, a visual extraction module extracts a set of visual feature vectors V ═ V from the image1,v2,v3,...,vmIs used to express visual information in the image. Wherein one feature vector corresponds to one region of the image. These feature vectors provide the decoder with the visual information needed to generate the subtitles. The present disclosure uses a ResNet101 network pre-trained on ImageNet datasetsThe structure extracts visual features in the image. Specifically, the present disclosure uses the features extracted by the last convolutional layer before the fully-connected layer as visual information:
V=CNNs(I) (1)。
the ResNet101 network structure pre-trained on the ImageNet dataset extracts visual features in the image.
It should be understood that the present disclosure designs a two-layer LSTM structure at the decoding end (as shown in fig. 5), which includes an attention balancing unit and a syntax optimization unit. Firstly, semantic attention vectors and visual attention vectors need to be calculated, then the two attentions are scored through a scoring function of an attention balance unit, and finally the weighted attentions are output through a door mechanism.
Further, inputting the image target, the image theme and the hidden state of the first layer of the LSTM model at the previous moment into a semantic attention mechanism model, and outputting a semantic attention vector; the method comprises the following specific steps:
calculating target-subject attention using a fully-connected neural network with Softmax, inputting hidden state of previous time of first layer LSTM, subject of current image and target information obtained from data set, outputting probability distribution α ═ { α ═ of target information12,...,αn}。
The calculation formula is as follows:
Figure BDA0002339012960000101
αi,t=soft max(ei,t) (3)
wherein WeT,WeO
Figure BDA0002339012960000102
And
Figure BDA0002339012960000103
is a parameter to be learned, ⊕ represents the addition operation of vector-matrix, and T represents the subject word direction of the current imageThe quantity, O, represents the target word vector,
Figure BDA0002339012960000104
a hidden state representing a previous time instant of the first layer LSTM; for convenience of description, the bias term is omitted; finally, according to the attention weight, obtaining the attention vector of the target:
Figure BDA0002339012960000105
the fully-connected neural network with the Softmax is that a Softmax layer is added at the last of the fully-connected neural network.
Further, inputting the visual characteristics of the image and the hidden state of the first layer of the LSTM model at the previous moment into a visual attention mechanism model, and outputting a visual attention vector; the method comprises the following specific steps:
visual attention is the calculation of the probability distribution of each region of an image for the current time instant, β ═ β12,…,βm}; selecting a fully-connected neural network with Softmax to calculate visual attention;
the calculation formula is as follows:
Figure BDA0002339012960000106
βi,t=soft max(vai,t) (6)
wherein, Wav
Figure BDA0002339012960000107
Is a parameter to be learned, viAn ith region representing an image;
finally, the attention vector of the image area is obtained:
Figure BDA0002339012960000108
where m is the total number of regions in the image.
It should be understood that the visual attention mechanism model, again, is a fully connected neural network with Softmax.
Further, carrying out weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector; the method comprises the following specific steps:
the visual attention vector and the semantic attention vector are scored through a scoring function, and the specific process is represented as follows:
Figure BDA0002339012960000111
Figure BDA0002339012960000112
Figure BDA0002339012960000113
Figure BDA0002339012960000114
wherein phi ist∈[0,1]Represents the importance of the visual attention vector and the semantic attention vector at the current time, σ represents a sigmoid function, and tanh () is an activation function. G is a scoring function to assess the importance of attention.
Figure BDA0002339012960000115
WGO,WGvAre parameters that need to be learned. B denotes an output after the semantic attention and the visual attention are balanced.
It should be understood that each moment of subtitle generation may focus on different information. When generating a vocabulary expressing spatial position information in an image, the model needs to pay more attention to visual information. In this case, the visual attention is weighted more than the semantic attention. When generating a vocabulary expressing semantic concepts, the model should focus more on semantic information. At this time, the weight of the sense attention should be greater than the visual attention. Therefore, in order for the model to efficiently select semantic information and visual information in the process of generation of subtitles, the present disclosure proposes an attention balancing mechanism.
Further, a first layer of LSTM models; means that; LSTM long and short term memory network neural network.
Further, a second layer of the LSTM model; the method comprises the following steps: and the ON-LSTM of the network is memorized by the ordered neurons in a long time.
The structure of the ON-LSTM of the memory network for the long and the short time of the ordered neurons is as follows:
Figure BDA0002339012960000116
Figure BDA0002339012960000117
Figure BDA0002339012960000121
Figure BDA0002339012960000122
Figure BDA0002339012960000123
Figure BDA0002339012960000124
wherein the content of the first and second substances,
Figure BDA0002339012960000125
representing the multiplication of corresponding elements of the matrix. f. oft,itAnd otRespectively representing a forgetting gate, an input gate and an output gate of the ON-LSTM.
Figure BDA0002339012960000126
And ctRespectively representing candidate and cellular states of the second layer LSTM. Wf,Wi,Wo,Wc,Uf,Ui,UoAnd UcAre all parameters that need to be learned.
Figure BDA0002339012960000127
Indicating the hidden state of the second layer LSTM at the previous time when inputting
Figure BDA0002339012960000128
When the hierarchy of (C) is higher than the hierarchical information of the history, CtThe following updating mode is adopted:
Figure BDA0002339012960000129
wherein d isfRepresenting historical information
Figure BDA00023390129600001210
Hierarchy information of diIndicating the current input
Figure BDA00023390129600001211
The hierarchy of (2). df>diIt means that the history information and the currently inputted information do not produce a junction.
This means that [ di:df]Part is empty and C is updated if not requiredt
Figure BDA00023390129600001212
Is set to 0. In this way, the present disclosure unsupervised learning generates the hierarchical information in the caption sequence to achieve the purpose of optimizing the caption grammar.
The text sequence information contains a potential hierarchical structure, and the higher the hierarchy of the information, the coarser the granularity, the larger its span in the sentence. The ordered neuron LSTM realizes the identification and the learning of the hierarchical information from the sequence by adopting the modes of updating between partitions and softening in sections.
The present disclosure was trained and tested on the MS COCO2014 data set. Specific information of the data set is shown in table 1:
TABLE 1 data set
Figure BDA0002339012960000131
In a formal experiment, the present disclosure was tested according to the way in which the data set of Karpathy,2015 was partitioned for comparison with models of others. The present disclosure used 113287 images, 5000 images, and 5000 images for training, validation, and testing, respectively. Each image the present disclosure uses only 5 subtitle annotations for training. The vocabulary is constructed in the subtitle data set, and the vocabulary in the vocabulary has a frequency of at least 5 occurrences in the subtitle data set. The present disclosure uses LDA to obtain topics on a given set of subtitle data, and finally the present disclosure has resulted in 80 topics. These 80 topics are then classified to construct an image-topic pair dataset. The target concept data set is constructed through a multi-instance weak supervision method, and 568 target vocabularies are finally obtained.
The evaluation criteria used in this disclosure are: BLUE-1&4, METEOR, ROUGE, SPICE, and CIDER. These evaluation criteria were all achieved by the COCO evaluation tool. Some of the images come from the field of image captions, and some come from the fields of machine translation, text summarization and the like. The details are as follows: BLUE4 and METEOR are evaluation criteria commonly used in machine translation tasks, and they calculate scores in the n-grams manner. ROUGE is the evaluation criteria from the text abstract, SPICE and CIDER are the criteria for the task of subtitling with evaluation images. As with the accuracy criteria, the score of these evaluation criteria is also as high as possible.
To demonstrate the effectiveness of the disclosed model, the present disclosure compares its own experimental results with other Baseline models. The following models are mainly available:
① Hard-Attention (Hard-ATT) first applied the Attention mechanism to the image captioning task, presented visual Attention and divided it into "Hard" Attention and "Soft" Attention.
② Semantic Attention (ATT-FCN) develops a Semantic Attention for image captioning tasks and is classified into input Attention and input Attention according to the state.
③ Adaptive Attention (AdaATT) devised an Adaptive Attention with a "sentinel" mechanism that can decide whether to use historical or visual information at the time the word is generated.
④ simNet is an integrated network that integrates a subject attention mechanism and a visual attention mechanism in the generation of subtitles.
⑤ CNN-Model proposes to use convolutional neural network instead of long-time memory network as decoder;
⑥ ATT-B-SOM is the model proposed by the present disclosure.
At the same time, the present disclosure disassembles the model into two models that use visual attention and semantic attention, respectively.
Experimental setup
For the encoder portion:
① the Resnet101 model has been pre-trained on ImageNet using the visual feature V.Resnet 101 model of 2048 dimensional convolutional layer extracted images in the ResNet101 structure, which the Torchvision provides, ultimately yields a 204814 by 14 feature map.
② the image-theme pair data set is trained by a VggNet19 network structure, and the full connection layer of VggNet19 is finely adjusted to 80 (80 themes are extracted in the subtitle data set).
③ eventually result in 568 target vocabularies.
For the decoding end:
① in a two-layer LSTM structure, the first layer LSTM model is implemented by a conventional LSTM structure, its hidden state is 512. the subject of the image is used as the initial input of the first layer LSTM, which can give a preview to the LSTM.
② the second layer LSTM model is implemented by ON-LSTM, which can recognize and learn the sequence hierarchy unsupervised, and is the same size as the first layer LSTM model.
Setting parameters: the learning rate of the encoder is 5e-4, the learning rate of the language model is 1e-5, the momentum is 0.8, and the weight attenuation is 0.999. The loss function adopts a cross entropy function and is used for calculating the loss of the model.
The present disclosure sets the batch size to 80 and the iteration cycle to 50.
ResNet is pre-training performed on the classification task, and is not able to adapt to the image caption task of the present disclosure to some extent. Therefore, after 30 iterations, the model is fine-tuned so that the network is more adaptive to the image captioning task. The model disclosed by the invention works in a 16GB Tesla V100-PCIE GPU.
Table 2 experimental results show
COCO Dataset BLUE-1 BLUE-4 METEOR ROUGE-L CIDER SPICE
Hard-ATT 71.8 25.0 23.04 - - -
ATT-FCN 70.9 30.4 24.3 - - -
Ada-ATT 74.8 33.6 26.4 55.0 104.2 -
CNN-Model 71.1 28.1 28.7 52.2 91.2 17.5
Sim-Net - 33.2 28.3 56.4 113.5 22.0
Ours(Sem-ATT) 72.5 30.4 26.3 54.9 112.3 19.6
Ours(Spa-ATT) 73.0 31.3 27.2 55.7 113.4 20.0
Ours(ATT-B-SOM) 75.3 32.9 28.0 56.2 115.5 23.2
The test was carried out by the above-mentioned setting, and the results are shown in table 2. From the table, it can be seen that the ATT-B-SOM model of the present disclosure achieves the best results in BELU-1, CIDER and SPICE, 75.3,115.5 and 23.2, respectively, compared to the other models.
The Sem-ATT Model and Spa-ATT Model of the present disclosure are also improved over the Hard-ATT, ATT-FCN, and CNN-Model models. It is noted that the convolutional neural networks used by the different models are also different for the extraction of image features.
The Model VggNet and AlexNet used by Hard-ATT, GoogleNet used by ATT-FCN, and Resnet152 network structure used by VggNet, Ada-ATT and Sim-Net used by CNN-Model. The model of the present disclosure uses a Resnet101 network architecture.
First, it can be seen through the work of visualization that the model of the present disclosure focuses on the area associated with the current word when generating each word. For example, in generating a "boat," the present disclosure may see the current point of interest on the hull through the white area. By contrast, the Baseline model, when generating a "boat", is a fuzzy description of the hull. Because the semantic attention is considered in the model, the model of the model is more accurate in describing the attention area of the ship body. Second, the present disclosure notes that the ATT-B-SOM model of the present disclosure is able to capture more detail from an image. Therefore, the model-generated subtitles of the present disclosure are more refined. It can be seen from fig. 7(a) -7 (o) that the model of the present disclosure captures the hair color information of a dog, which is described in detail by the word "brown". This is not present in the baseline model. This example demonstrates that the attention balancing mechanism of the present disclosure can have beneficial effects on image captioning tasks. Fig. 6(a) -6 (k) are schematic diagrams of spatial attention visualization of a baseline model proposed in an embodiment of the present disclosure.
The model designed by the present disclosure can identify and learn the hierarchical structure from the sequence information, so that subtitles with more readability of the syntactic structure can be generated. To demonstrate this, the present disclosure takes three images and generates a caption as a comparison to the baseline model. In fig. 8(a), the Baseline model not only describes the relationship between "person" and "skate board" in error, but it also has poorer readability for generating the syntactic structure of the subtitle than the model of the present disclosure. The "on … on …" structure in the Baseline model hardly appears in the everyday representation of the present disclosure, and it is obvious that the readability of such a grammatical structure is poor. Compared with the Baseline model, the readability of the subtitles generated by the model is higher, and the logic expression of human beings can be better met. In fig. 8(b), it can be seen that the model of the present disclosure captures more information than the Baseline model, and the generated subtitles are more detailed. The model of the present disclosure captures the information of "tree" as background and its spatial location information and "banana". This is more useful for the present disclosure to understand an image. Some images can be described by simple subtitles, and usually, the sentences do not have complex syntactic structures. For such problems, the model of the present disclosure may exhibit effects that are not very different from other models. However, the model of the present disclosure can show its own effect when some images that require complex caption descriptions appear.
The second embodiment further provides a cross-modality image semantic extraction system;
a cross-modality image semantic extraction system, comprising:
an input module configured to: acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;
a vector extraction module configured to: the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;
a balancing module configured to: the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;
a grammar optimization module configured to: and the decoder processes the final attention vector to obtain a final subtitle.
To face these challenges, the present disclosure proposes a model named ATT-B-SOM based on the encoder-decoder architecture. The present disclosure presents a specific model diagram (as shown in fig. 4) to facilitate an intuitive view of the composition of the model. The encoder consists of a visual extraction module, an image theme extraction module and an image target extraction module, and the decoder consists of a balance module and a grammar optimization module. The encoder is used for extracting and encoding visual information and semantic information in the image.
In a third embodiment, the present embodiment further provides an electronic device, which includes a memory, a processor, and computer instructions stored in the memory and executed on the processor, where the computer instructions, when executed by the processor, implement the steps of the method in the first embodiment.
In a fourth embodiment, the present embodiment further provides a computer-readable storage medium for storing computer instructions, and the computer instructions, when executed by a processor, perform the steps of the method in the first embodiment.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A cross-mode image semantic extraction method is characterized by comprising the following steps:
acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;
the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;
the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;
and the decoder processes the final attention vector to obtain a final subtitle.
2. The method of claim 1, wherein the encoder comprises: a convolutional neural network model for image target extraction, a pre-trained VGGNet19, a pre-trained ResNet101 network structure, a semantic attention mechanism model and a visual attention mechanism model;
the decoder, comprising: the system comprises a balance unit, a first layer of LSTM model and a second layer of LSTM model;
the input end of the semantic attention mechanism model is respectively connected with the output end of a convolutional neural network model for image target extraction, the output end of a pre-trained VGGNet19 and the output end of a first layer of LSTM model; the output end of the semantic attention mechanism model is connected with the input end of the balancing unit;
the input end of the visual attention mechanism model is respectively connected with the output ends of the pre-trained ResNet101 network structure and the first layer LSTM model; the output end of the visual attention mechanism model is connected with the input end of the balancing unit;
the output end of the balancing unit is connected with the input end of the first LSTM model; the output end of the first LSTM model is connected with the input end of the second LSTM model; the input end of the first LSTM model is connected with the output end of the second LSTM model; the output of the second LSTM model is used to output the final subtitles.
3. The method as claimed in claim 1, wherein after the step of inputting the image to be extracted with semantics into the trained semantic extraction model, before the step of the encoder extracting the semantic attention vector and the visual attention vector from the image to be extracted with semantics, the method further comprises:
extracting an image target of an image with semantic to be extracted through an encoder to obtain the image target; extracting image themes of the image with the semantic to be extracted to obtain image themes; and extracting image visual features of the image with the semantic to be extracted to obtain the image visual features.
4. The method of claim 1, wherein the encoder extracts a semantic attention vector for the image to be semantically extracted, comprising:
inputting an image target, an image theme and a hidden state of a first layer of LSTM model at the previous moment into a semantic attention mechanism model, and outputting a semantic attention vector;
alternatively, the first and second electrodes may be,
the encoder extracts a visual attention vector from an image of a semantic to be extracted, and comprises the following steps:
and inputting the image visual characteristics and the hidden state of the first layer of the LSTM model at the previous moment into a visual attention mechanism model, and outputting a visual attention vector.
5. The method of claim 1, wherein the decoder processes the final attention vector to obtain a final subtitle; the method comprises the following steps:
inputting the final attention vector, the hidden state of the first layer of LSTM model at the previous moment and the word generated at the second layer of LSTM model at the previous moment into the first layer of LSTM model;
and inputting the hidden state of the first layer of LSTM model at the current moment and the hidden state of the second layer of LSTM model at the previous moment into the second layer of LSTM model, and outputting the final caption.
6. The method as claimed in claim 3, wherein the image object is obtained by the encoder extracting the image object of the image whose semantic meaning is to be extracted; the method comprises the following specific steps:
and constructing a convolutional neural network model for image target extraction by adopting a multi-example learning weak supervision mode, and extracting the image target of the semantic to be extracted based on the convolutional neural network model for image target extraction to obtain the image target.
7. The method as claimed in claim 3, wherein the image subject extraction is performed on the image of the semantic to be extracted to obtain the image subject; the method comprises the following specific steps:
extracting image themes of the image with the semantic to be extracted by adopting pre-trained VGGNet19 to obtain the image themes;
alternatively, the first and second electrodes may be,
the method comprises the following steps of extracting visual features of an image with semantics to be extracted, and acquiring the visual features of the image:
and adopting a ResNet101 network structure pre-trained by an ImageNet data set to extract image visual features of the image with the semantic to be extracted, and acquiring the image visual features.
8. A cross-modal image semantic extraction system, comprising:
an input module configured to: acquiring an image of semantics to be extracted, and inputting the image of the semantics to be extracted into a trained semantic extraction model, wherein the trained semantic extraction model comprises an encoder and a decoder which are connected with each other;
a vector extraction module configured to: the encoder extracts a semantic attention vector and a visual attention vector from an image of a semantic to be extracted;
a balancing module configured to: the decoder performs weighted summation on the semantic attention vector and the visual attention vector to obtain a final attention vector;
a grammar optimization module configured to: and the decoder processes the final attention vector to obtain a final subtitle.
9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
CN201911368306.1A 2019-12-26 2019-12-26 Cross-modal image semantic extraction method, system, equipment and medium Active CN111144410B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911368306.1A CN111144410B (en) 2019-12-26 2019-12-26 Cross-modal image semantic extraction method, system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911368306.1A CN111144410B (en) 2019-12-26 2019-12-26 Cross-modal image semantic extraction method, system, equipment and medium

Publications (2)

Publication Number Publication Date
CN111144410A true CN111144410A (en) 2020-05-12
CN111144410B CN111144410B (en) 2023-08-04

Family

ID=70520499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911368306.1A Active CN111144410B (en) 2019-12-26 2019-12-26 Cross-modal image semantic extraction method, system, equipment and medium

Country Status (1)

Country Link
CN (1) CN111144410B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862727A (en) * 2021-03-16 2021-05-28 上海壁仞智能科技有限公司 Cross-mode image conversion method and device
CN113569932A (en) * 2021-07-18 2021-10-29 湖北工业大学 Image description generation method based on text hierarchical structure
CN116665012A (en) * 2023-06-09 2023-08-29 匀熵智能科技(无锡)有限公司 Automatic generation method and device for image captions and storage medium
CN116912629A (en) * 2023-09-04 2023-10-20 小舟科技有限公司 General image text description generation method and related device based on multi-task learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810158A (en) * 2011-05-31 2012-12-05 中国科学院电子学研究所 High-resolution remote sensing target extraction method based on multi-scale semantic model
CN107563498A (en) * 2017-09-08 2018-01-09 中国石油大学(华东) View-based access control model is combined the Image Description Methods and system of strategy with semantic notice
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN109559799A (en) * 2018-10-12 2019-04-02 华南理工大学 The construction method and the model of medical image semantic description method, descriptive model
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110414012A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of encoder construction method and relevant device based on artificial intelligence

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810158A (en) * 2011-05-31 2012-12-05 中国科学院电子学研究所 High-resolution remote sensing target extraction method based on multi-scale semantic model
CN107563498A (en) * 2017-09-08 2018-01-09 中国石油大学(华东) View-based access control model is combined the Image Description Methods and system of strategy with semantic notice
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN109559799A (en) * 2018-10-12 2019-04-02 华南理工大学 The construction method and the model of medical image semantic description method, descriptive model
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110414012A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of encoder construction method and relevant device based on artificial intelligence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
K. XU, J. L. BA, R. KIROS, K. CHO: "Show attend and tell: Neural image caption generation with visual attention" *
王兵等: "图像主题区域提取及其在图像检索中的应用" *
金汉均;段贝贝;: "基于深度视觉特征正则化的跨媒体检索研究" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862727A (en) * 2021-03-16 2021-05-28 上海壁仞智能科技有限公司 Cross-mode image conversion method and device
CN113569932A (en) * 2021-07-18 2021-10-29 湖北工业大学 Image description generation method based on text hierarchical structure
CN113569932B (en) * 2021-07-18 2023-07-18 湖北工业大学 Image description generation method based on text hierarchical structure
CN116665012A (en) * 2023-06-09 2023-08-29 匀熵智能科技(无锡)有限公司 Automatic generation method and device for image captions and storage medium
CN116665012B (en) * 2023-06-09 2024-02-09 匀熵智能科技(无锡)有限公司 Automatic generation method and device for image captions and storage medium
CN116912629A (en) * 2023-09-04 2023-10-20 小舟科技有限公司 General image text description generation method and related device based on multi-task learning
CN116912629B (en) * 2023-09-04 2023-12-29 小舟科技有限公司 General image text description generation method and related device based on multi-task learning

Also Published As

Publication number Publication date
CN111144410B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
Mathews et al. Semstyle: Learning to generate stylised image captions using unaligned text
CN107133211B (en) Composition scoring method based on attention mechanism
CN110852087B (en) Chinese error correction method and device, storage medium and electronic device
US20220245365A1 (en) Translation method and apparatus based on multimodal machine learning, device, and storage medium
Bai et al. A survey on automatic image caption generation
Yao et al. An improved LSTM structure for natural language processing
CN111144410B (en) Cross-modal image semantic extraction method, system, equipment and medium
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN107590134A (en) Text sentiment classification method, storage medium and computer
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN109086269B (en) Semantic bilingual recognition method based on semantic resource word representation and collocation relationship
CN110083710A (en) It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN110991290B (en) Video description method based on semantic guidance and memory mechanism
CN112309528B (en) Medical image report generation method based on visual question-answering method
CN113657123A (en) Mongolian aspect level emotion analysis method based on target template guidance and relation head coding
Huang et al. C-Rnn: a fine-grained language model for image captioning
Lagakis et al. Automated essay scoring: A review of the field
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
Errami et al. Sentiment Analysis onMoroccan Dialect based on ML and Social Media Content Detection
CN112488111B (en) Indication expression understanding method based on multi-level expression guide attention network
Yang et al. Att-bm-som: A framework of effectively choosing image information and optimizing syntax for image captioning
Göker et al. Neural text normalization for turkish social media
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
Li et al. TransExplain: Using neural networks to find suitable explanations for Chinese phrases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant