WO2021223323A1 - 一种中文视觉词汇表构建的图像内容自动描述方法 - Google Patents

一种中文视觉词汇表构建的图像内容自动描述方法 Download PDF

Info

Publication number
WO2021223323A1
WO2021223323A1 PCT/CN2020/102234 CN2020102234W WO2021223323A1 WO 2021223323 A1 WO2021223323 A1 WO 2021223323A1 CN 2020102234 W CN2020102234 W CN 2020102234W WO 2021223323 A1 WO2021223323 A1 WO 2021223323A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
chinese
network
visual vocabulary
vocabulary
Prior art date
Application number
PCT/CN2020/102234
Other languages
English (en)
French (fr)
Inventor
张凯
周建设
刘杰
吕学强
Original Assignee
首都师范大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 首都师范大学 filed Critical 首都师范大学
Publication of WO2021223323A1 publication Critical patent/WO2021223323A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Definitions

  • the present invention relates to image semantic understanding technology, and specifically provides an image content automatic description method constructed by a multi-channel Chinese visual vocabulary.
  • Image semantic understanding technology combines the two research directions of computer vision and natural language processing. It is a current research hotspot in the field of artificial intelligence and an effective method to reduce the semantic gap between the low-level features and high-level semantics of images.
  • Image semantic understanding technology provides machines with the ability to process multi-modal data, which can effectively reduce the semantic gap between low-level features and high-level semantics of images.
  • Its core technology is to combine computer vision and natural language processing related knowledge. Analyze and understand the content, and feedback in the form of textual semantic information.
  • the automatic generation of Chinese sentences for image content understanding is one of the key breakthrough areas.
  • the image content target detection network often can only detect part of the object information in the image, and only Can provide object noun information, unable to provide key information such as object-related attributes and actions, and encounter difficulties in automatic sentence generation; on the other hand, by mining the information that can be used in the image description text, the image description text Processing words such as word segmentation and part-of-speech tagging to obtain image tagging information, and then form a Chinese visual vocabulary is also a key.
  • the vocabulary includes not only nouns, but also predicates, adjectives, etc., and the information in the vocabulary is richer, which can make In model training, more semantic information is obtained, which can be better applied to the automatic description process of image content.
  • An automatic description method of image content constructed by Chinese visual vocabulary includes the following in order:
  • Step a Use the Chinese word segmentation tool to segment several descriptive sentences corresponding to a single picture, and selectively retain the nouns, verbs and adjectives in the vocabulary according to the statistical word frequency, and then compose the retained words into Chinese visual vocabulary surface;
  • Step b Predict the Chinese visual vocabulary based on the Chinese vocabulary list prediction network to obtain image annotation information
  • Step c Based on the image automatic description model, the encoder is used to extract the image convolution features, and then the decoder is used to decode the image convolution features as initial input into Chinese description sentences.
  • step d after step c, optimizing the loss function describing the generation network based on the model of tag information matching.
  • the Chinese vocabulary list prediction network is composed of two parts: a feature extraction network based on a convolutional neural network and a feature classification network.
  • the feature extraction network will be based on the average of the data network pre-trained on the label vocabulary data set.
  • the output of the pooling is regarded as an image feature, and is input to the input layer of the feature classification network, and the output layer of the feature classification network outputs the Chinese label information corresponding to the predicted image.
  • the decoder when the image automatic description model makes predictions, the decoder first accepts the image convolution feature and ignores the output at this moment; then after inputting a start symbol ⁇ Start> and the predicted label feature, the decoder outputs a The vector composed of the predicted probability of words in the vocabulary, select the word with the highest probability according to the output vector as the output at the current moment; then use this word and the predicted label feature as the input at the next moment, and continue to predict until the end symbol is predicted. End>:
  • the encoder includes one or more of convolution, down-sampling and activation operations.
  • the feature classification network is a three-layer fully connected network based on residual connections, which includes an input layer, three hidden layers, two residual connections, and an output layer.
  • the residual connections are added to the first layer. Between full connection and third layer full connection, and between second layer full connection and third layer full connection.
  • the decoder can receive image prediction label features, and use the predicted label features to guide the generation of Chinese description sentences.
  • the model optimization based on tag information matching in step d is specifically calculating the distance between the image vocabulary feature and the cell state of the decoder at the final moment, adding it as an additional item to the loss function, and adding it to the loss function during model training. Minimize the distance between the label feature and the cell state as much as possible.
  • the method for calculating the distance adopts Manhattan distance or Euclidean distance.
  • the present invention provides an image content automatic description method constructed by a Chinese visual vocabulary. Specifically, an image automatic description generation model constructed by a Chinese visual vocabulary is used. First, a Chinese visual vocabulary prediction network is designed, and the network is predicted through the vocabulary list. Predicting the image vocabulary can obtain image annotation information.
  • Adding a residual structure to the Chinese visual vocabulary prediction network can effectively solve the problem of network degradation caused by the deepening of the Chinese visual vocabulary prediction network;
  • An L-LSTM architecture can introduce the features of the image Chinese visual vocabulary into the description generation network; in addition, the loss function of the description generation network is optimized to shorten the gap between the image Chinese visual vocabulary feature and the cell state of the L-LSTM The distance between the generated image description sentences and the image Chinese visual vocabulary is closer, and finally the effectiveness of the model is verified by various methods.
  • Figure 1 is a schematic diagram of the overall architecture of the image automatic description model
  • Figure 2 is the L-LSTM model architecture diagram.
  • the current automatic image description generation methods can be summarized into three categories, namely, template-based methods, similarity-based retrieval methods, and deep learning-based methods.
  • the method based on similarity retrieval is to use the similarity of the traditional visual features of the image to retrieve, and the description text of the image with high similarity is used as the candidate answer, or the image feature and the text feature are mapped to the same feature space, and retrieved from the same feature space.
  • the text with high image similarity is used as the candidate result.
  • Ordonez et al. proposed to use the global features of images to search in a million image database, and use the description of the most similar image as the description text of the image to be described.
  • Gong et al. used Canonical Correlation Analysis (CCA) to map images and texts to the same feature space, establish corresponding relationships, and retrieve the text most similar to the image from the database.
  • CCA Canonical Correlation Analysis
  • Hodosh et al. proposed to use the Kernel Canonical Correlation Analysis (KCCA) method to learn the common feature space of the two modalities of image and text, and use the kernel function to map the original features with high-dimensional features, and use K nearest neighbors. Method to retrieve. Such methods cannot generate sentences based on the content of the image, nor can they generate description sentences that do not exist in the database.
  • KCCA Kernel Canonical Correlation Analysis
  • m-RNN multimodal recurrent neural network
  • NIC image description generation model
  • the NIC model uses LSTM to build language
  • the model generates descriptive sentences, and only inputs the convolutional features of the image extracted by the convolutional neural network into the LSTM at the beginning, and does not input at every moment, which has achieved good results. Subsequently, the researchers made improvements to the NIC model, and the quality of the generated description text was also improved.
  • Xu et al. introduced two attention mechanisms in the model for the first time, namely Soft-Attention and Hard-Attention so that the model can capture the local information of the image, and always generate the CNN in the process of generating the description sentence.
  • the low-level feature map (Feature Map) is input to the attention mechanism, and the attention mechanism selects certain feature maps as visual information input to the LSTM.
  • the model refocuses on some feature maps of the image, and obtains new visual information and inputs it into the LSTM.
  • Lu et al. proposed an image description model based on Spatial Attention. Through Spatial Attention, the model can independently decide whether to use image information or language model information.
  • Jia et al. used semantic information to guide LSTM to generate descriptions. Similar to the similarity-based retrieval method, first retrieved the description text features most similar to image features in the image-text vector space, and then input them as guidance information into LSTM In, a description sentence is generated.
  • Tang Pengjie and others trained the scene classification network to capture the priori information of the scene of the image and the object classification network to capture the priori information of the object category of the image through the transfer learning method, and then combined the priori information of the scene with the image.
  • the prior information of the object category is integrated into the model to synergistically generate the description sentence of the image and improve the quality of sentence generation.
  • Liu Chang and others changed the decoder structure, adding a stack hidden layer and a common hidden layer to the decoder, which improved the learning ability of the language model.
  • Liu Zeyu and Lan Weiyu conducted research on Chinese image description, and also optimized based on the NIC model.
  • Liu Zeyu and others proposed a method for generating Chinese image abstracts based on multi-modal neural networks, and introduced multiple methods in the "encoding-decoding" model.
  • the tag keyword feature prediction network first uses the keyword feature prediction network to extract the image keyword features, and then inputs the keyword features into LSTM in different ways for decoding.
  • Lan Weiyu uses the depth model to predict the image tags. And use tags to reorder the decoder results, which improves the quality of sentence generation.
  • the neural network method proposed by Lu et al. can automatically generate a sentence template, binding the empty slots in the template with the objects in the picture. When generating each word, the model decides whether to choose a text vocabulary or a visual vocabulary.
  • the invention uses the construction of a Chinese visual vocabulary to realize the automatic description of image content, and the method uses Flickr8kc and Flickr30kc Chinese image description data sets.
  • the specific implementation process is described in Figure 1-2:
  • each picture corresponds to five descriptive sentences, and each sentence can vividly describe the content of the image.
  • the method of the present invention selects nouns, verbs and adjectives in sentences as the image content labels to be predicted. More specifically, first use the Chinese word segmentation tool Boson to segment the descriptive sentences, and selectively retain the nouns, verbs and adjectives in the vocabulary according to the statistical word frequency, and then form the label vocabulary with the retained words, and according to The label vocabulary labels each picture with label information, so that the training data of the label prediction network is obtained.
  • the image label prediction network used in this method consists of two parts, one is a feature extraction network based on CNN, and the other is a feature classification network.
  • the feature extraction network uses the Resnet-152 network pre-trained on the ImageNet dataset.
  • ResNet-152 is the champion model in the ImageNet2015 image classification competition.
  • the average pooling output of the Resnet-152 network is used as the image feature. Used for the subsequent feature classification network.
  • the feature classification network is a three-layer fully connected network based on residual connections, which includes an input layer, three hidden layers, two residual connections, and an output layer.
  • the input layer is used to receive image features extracted by resnet-152, and the output layer is used to predict the label information corresponding to the image. Since each image contains more than one label, this is a multi-label classification problem.
  • We set the activation function of the output layer to the Sigmoid function.
  • the traditional deep learning feature classification network is just a single-layer fully connected network.
  • the expression ability of the model increases, but the training of the model becomes difficult, and the problem of network degradation occurs, that is, as the depth of the model increases, the accuracy of the model decreases.
  • ResNet residual structure of ResNet
  • the Chinese description generation model proposed in the present invention is composed of two parts, namely a convolutional neural network CNN and a long short-term memory network L-LSTM that integrates label information. Its architecture is to use CNN as an encoder to extract image convolutions. Features, and then use L-LSTM as a decoder to decode the image convolution feature as the initial input into the target description sentence.
  • L-LSTM when the model makes predictions, L-LSTM first accepts image convolution features and ignores the output at this moment; then, after inputting a start symbol ⁇ Start> and predicting label features, L-LSTM outputs a vocabulary The vector composed of the predicted probability of the word, select the word with the highest probability according to the output vector as the output at the current moment; then use this word and the predicted label feature as the input at the next moment, and continue to predict until the end symbol ⁇ End> is predicted.
  • the overall structure is shown in Figure 1.
  • the encoder CNN in the automatic image description model is a neural network for processing grid data.
  • the CNN model consists of a series of transformation modules, such as convolution, activation, and downsampling. Use the deep CNN model to extract image features. Image data needs to undergo multiple operations such as convolution, down-sampling and activation.
  • the extracted features are more abstract and have stronger expressive capabilities. They are used in image classification and recognition, target detection, scene understanding, etc. Significant results have been achieved on visual tasks.
  • the decoder L-LSTM in the image automatic description model is a long- and short-term memory network that can fuse label information, as shown in Figure 2.
  • L-LSTM is the same as LSTM in that it adds or removes information to the cell state c through a different "gate" structure, where the Forget Gate is used to decide to retain the cell state c t-1 from the previous moment Or which semantic information is discarded; the input gate is used to determine which semantic information is input into the cell state c t at this moment; the output gate is used to determine which semantic information is output from the cell state c t at this moment information.
  • L-LSTM can receive image prediction label features l, and use the predicted label features to guide the generation of description sentences.
  • the specific formula is as follows:
  • W, U, V, b represent the weights and biases that need to be trained in the L-LSTM
  • x t represents the input vector of the L-LSTM
  • l represents the predicted label feature
  • h t represents the hidden state of the L-LSTM
  • c t represents a unit status L-LSTM a
  • f t represents L-LSTM of "forgetting door” activation vector
  • i t represents the "enter door” activation vector
  • L-LSTM a o t represents L-LSTM "output gate”
  • the activation vector of, * means dot multiplication
  • subscript t means time
  • ⁇ g means Sigmoid function
  • ⁇ h means Tanh function.
  • the CIC model needs to maximize the probability of generating a target description sentence for a given image, which is expressed by equation (7):
  • I represents the input image
  • Y represents any target description sentence of indefinite length, composed of words Y 0 , Y 1 ,..., Y N
  • represents model parameters.
  • W e is the word embedded matrix
  • Y t represents a one hot vector
  • Y 0 and Y n are special start ⁇ Start>
  • the end character ⁇ End> is used to indicate the beginning and end of a sentence.
  • the data sets used in the present invention are Flickr8kc and Flickr30kc.
  • the English image description datasets Flickr8k and Flickr30k are translated into Chinese versions.
  • the Flickr8kc data set contains 8000 annotated images and 40,000 Chinese description sentences.
  • the Flickr30kc dataset contains 30,000 annotated images and 150,000 Chinese description sentences.
  • the present invention uses an image segmentation method to segment Flickr8kc and Flickr30kc, where Flickr8kc includes 6000 pieces of training data, 1000 pieces of verification data, and 1000 pieces of test data, and Flickr30kc includes 28000 pieces of training data, 1000 pieces of verification data, and 1000 pieces of test data.
  • the environment configuration is as follows: the operating system is Ubuntu16.03.1, the development language is Python2.7, and the deep learning framework is TensorFlow1.6.
  • the Flickr8kc training set includes 6000 images, 30000 Chinese description sentences and 7784 words.
  • the Flickr30kc training set includes 28,000 pictures, 140,000 Chinese description sentences and 19,735 words.
  • nouns, verbs and adjectives that appear at least twice in the 5 Chinese description sentences of the same picture are retained, and words with an overall word frequency greater than 20 times are used as a vocabulary list.
  • the glossary predicts the network parameter configuration, as shown in the following table.
  • Table 1 Predictive network parameter configuration of Chinese visual vocabulary
  • precision-i represents the accuracy of the first k predicted labels.
  • recall-i represents the recall rate of the first i predicted labels.
  • f-i is the harmonic average of the accuracy and recall of the first i predicted labels.
  • the Chinese image description network training set data is the same as the vocabulary list prediction network. The difference is that words with a word frequency greater than 5 are selected as the vocabulary.
  • the final Flickr8kc vocabulary includes 2625 words, and the Flickr30kc vocabulary includes 7108 words.
  • the parameter configuration of the image automatic description model of the Chinese visual vocabulary is as follows:
  • the following two tables respectively show the results of precision-i, recall-i and fi of different vocabulary prediction networks.
  • Layer fully connected network this represents the vocabulary prediction network used in the present invention.
  • the present invention uses a model optimization method based on tag information matching to optimize the original loss function generated by the image description.
  • L-LSTM selectively saves input information in cell state c through "input gate” and “forgotten gate”, and uses "output gate” to control the information in cell state c
  • the information in cell state c affects the generation of description words to a certain extent.
  • the label information of the image can describe the content of the image to a certain extent. Calculate the distance between the image vocabulary feature and the cell state c at the final moment of the L-LSTM, and add it as an additional item to the loss function.
  • Shortening the distance between the label feature and the cell state c as much as possible during model training can make the information stored in the L-LSTM closer to the label information of the image and help generate higher-quality description sentences.
  • Explore different methods of calculating distance such as Manhattan distance, Euclidean distance, etc., as follows.
  • c represents the cell state of L-LSTM
  • l represents the image label feature.
  • the original loss function outputs the negative logarithm sum of the probability of the correct word at each moment, and the distance between the label feature and the cell state c is added to the original loss function, as shown in the following formula.
  • the value range of ⁇ is (0, 1), and the step size is 0.1.
  • the value of ⁇ is 0.2, the experimental effect is the best. Therefore, the value of ⁇ is empirically set to 0.2.
  • the present invention abbreviates the image automatic description model of the Chinese visual vocabulary as IADCVV, which means that only the image label feature is introduced through the L-LSTM in the network, and the loss function is not optimized.
  • IADCVV using the loss function optimization method and calculating the similarity between the cell state c of L-LSTM and the image vocabulary feature by Manhattan distance is called IADCVV-CB.
  • IADCVV-E and IADCVV-C use Euclidean distance and cosine value to measure similarity. The experimental comparison is as follows:
  • IADCVV-CB and IADCVV-E have different degrees of improvement compared with IADCVV, which shows that shortening the distance between the image label feature and the L-LSTM cell state can further optimize the quality of the sentence generated by the image description model.
  • the value of IADCVV-C is lower than that of IADCVV, which shows that it is necessary to choose an appropriate distance calculation method to shorten the distance between the image label feature and the L-LSTM cell state.
  • the effects of IADCVV-CB and IADCVV-E are better, indicating that Manhattan distance is more suitable for calculating the distance between image label features and L-LSTM cell state than Euclidean distance.
  • the invention uses an image automatic description generation model constructed by a Chinese visual vocabulary.
  • a Chinese visual vocabulary prediction network is designed.
  • the image vocabulary is predicted by the vocabulary prediction network to obtain image annotation information. Adding residual structure to the Chinese visual vocabulary prediction network can effectively solve the problem of Chinese vision.
  • the vocabulary predicts the network degradation problem caused by the deepening of the number of network layers.
  • an L-LSTM architecture is used, which can introduce image Chinese visual vocabulary features into the description generation network.
  • the loss function of the description generation network is optimized to shorten the distance between the features of the image Chinese visual vocabulary and the cell state of the L-LSTM, so that the generated image description sentences are closer to the image Chinese visual vocabulary.
  • various methods are used to verify the effectiveness of the model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种中文视觉词汇表构建的图像内容自动描述方法,包括按顺序进行的步骤a,使用中文分词工具将单张图片对应的若干个描述语句进行分词处理,并根据统计的词频有选择地保留词表中的名词、动词和形容词,再将保留下来的词语构成中文视觉词汇表;步骤b,基于中文词汇表预测网络对中文视觉词汇表进行预测获得图像标注信息;步骤c,基于图像自动描述模型,使用编码器提取出图像卷积特征,再使用解码器将图像卷积特征作为初始输入解码为中文描述语句;通过词汇表预测网络对图像词汇表进行预测可以获得图像标注信息,在中文视觉词汇表预测网络中添加残差结构,可以有效地解决随着中文视觉词汇表预测网络层数加深。

Description

一种中文视觉词汇表构建的图像内容自动描述方法 技术领域
本发明涉及图像语义理解技术,具体提供一种多通道中文视觉词汇表构建的图像内容自动描述方法。
背景技术
图像语义理解技术融合了计算机视觉和自然语言处理两个研究方向,是目前人工智能领域的一项研究热点,也是缩减图像的低层特征和高层语义之间的语义鸿沟的有效方法。图像语义理解技术为机器提供了处理多模态数据的能力,可以有效地缩减图像的低层特征和高层语义之间的语义鸿沟,其核心技术是结合计算机视觉和自然语言处理的相关知识,对图像的内容进行分析、理解,以文本语义信息的形式反馈。
当前,使用中文对图像描述的语句自动生成质量较低,除了图像处理技术的瓶颈。究其原因,一方面是中文的图像描述数据较少且质量较差,限制了图像内容自动生成的发展,另一方面是中文词语的含义丰富,句子结构复杂,同样也存在着语义理解的难题。
发明内容
鉴于上述现有技术中的存在的难题或缺陷,对图像内容理解的中文句子自动生成是其中一重点突破领域,考虑到图像内容目标检测网络往往只能检测到图中的部分物体信息,并且只能提供物体名词信息,无法提供物体相关的属性和动作等关键的信息,在进行句子自动生成中遇到困难;另一方面,通过挖掘图像的描述文本中可以利用的信息,将图像的描述文本进行分词和词性标注等处理,得到图像的标注信息,进而形成中文视觉词汇表也是一个关键,而且词汇表中不仅包括有名词,还包括了谓词、形容词等,词汇表中信息更加丰富,可以使得在进行模型训练中获得更多的语义信息,进而可以更好的应用到图像内容的自动描述过程。
一种中文视觉词汇表构建的图像内容自动描述方法,包括按顺序进行的如下:
步骤a,使用中文分词工具将单张图片对应的若干个描述语句进行分词处理,并根据统计的词频有选择地保留词表中的名词、动词和形容词,再将保留下来的词语构成中文视觉词汇表;
步骤b,基于中文词汇表预测网络对中文视觉词汇表进行预测获得图像标注信息;
步骤c,基于图像自动描述模型,使用编码器提取出图像卷积特征,再使用解码器将图像卷积特征作为初始输入解码为中文描述语句。
较为优选的,还包括步骤c后的步骤d,基于标签信息匹配的模型对描述生成网络的损失函数进行优化。
较为优选的:所述中文词汇表预测网络由基于卷积神经网络的特征提取网络和特征分类网络两个部分组成,所述特征提取网络将基于标签词表数据集上预训练的数据网络的平均池化的输出当作图像特征,输入所述特征分类网络的输入层,并由所述特征分类网络的输出层输出预测图像所对应的中文标签信息。
较为优选的:所述图像自动描述模型进行预测时,解码器首先接受图像卷积特征,并忽略这一时刻的输出;然后输入一个开始符号<Start>和预测标签特征后,解码器输出一个由词表中词语被预测的概率组成的向量,根据输出向量选取概率最大的词语作为本时刻输出;再把这个词语和预测标签特征作为下一时刻的输入,继续进行预测,直到预测出结束符号<End>:
较为优选的:所述编码器包括卷积、下采样和激活操作中的一种或多种。
较为优选的:所述特征分类网络是基于残差连接的三层全连通网络,其中包括输入层、三个隐藏层、两个残差连接以及输出层,所述残差连接添加在第一层全连接与第三层全连接之间、第二层全连接与第三层全连接之间。
较为优选的:所述解码器能够接收图像预测标签特征,并利用预测标签特征引导中文描述语句的生成。
较为优选的:所述步骤d中的基于标签信息匹配的模型优化具体为计算图像词表特征与解码器最终时刻细胞状态之间的距离,作为一个额外项加入损失函数中,并在模型训练时尽可能缩短标签特征与细胞状态的距离。
较为优选的:所述计算距离的方法采用曼哈顿距离或欧式距离。
有益效果:
本发明提供一种中文视觉词汇表构建的图像内容自动描述方法,具体使用一种中文视觉词汇表构建的图像自动描述生成模型,首先设计了一种中文视觉词汇表预测网络,通过词汇表预测网络对图像词汇表进行预测可以获得图像标注信息,在中文视觉词汇表预测网络中添加残差结构,可以有效地解决随着中文视觉词汇表预测网络层数加深,而 导致的网络退化问题;其次使用了一种L-LSTM架构,可以将图像中文视觉词汇表特征引入描述生成网络中;此外,对描述生成网络的损失函数进行优化,缩短图像中文视觉词汇表特征与L-LSTM的细胞状态之间的距离,使得生成的图像描述语句与图像中文视觉词汇表更加贴近,最后通过各种方法验证模型的有效性。
附图说明
图1为图像自动描述模型整体架构示意图;
图2为L-LSTM模型架构图。
具体实施方式
下面首先对本发明所涉及的图像自动生成描述技术的现状进行分析:
目前的图像自动描述生成方法可总结为三大类别,分别为基于模板的方法、基于相似度检索的方法、基于深度学习的方法。
得益于图像物体识别技术的发展,研究人员提出了基于模板的图像描述生成方法。具体为通过目标识别检测出图像中的物体及其属性信息,然后将这些信息以恰当的方式嵌入到预先设计好的模板中。2010年,Farhadi等人使用检测器检测到图像中的物体去推断<物体,动作,场景>三元组,并使用模板将其转化为描述文本。2011年,Yang等人用隐马尔科夫模型选择可能的对象、动词、介词及场景类型填充句子模板。2013年,Kulkarni等人提出了Baby Talk模型,使用条件随机场(Conditional Random Field,CRF)对检测到的物体、属性、关系进行标注,最终使用模板生成描述语句。此类方法得到的描述语句受到模板的限制,显得内容生硬,不够灵活。
基于相似度检索的方法是利用图像传统视觉特征的相似度进行检索,将相似性高的图像的描述文本作为候选答案,或者将图像特征与文本特征映射到同一特征空间,从中检索出与待描述图像相似高的文本作为候选结果。2011年,Ordonez等人提出利用图像的全局特征在百万图像库中进行检索,并将最相似的图像的描述作为待描述图像的描述文本。2014年,Gong等人则是使用典型关联分析(Canonical Correlation Analysis,CCA),把图像和文本映射到同一特征空间,建立对应关系,并从数据库中检索与图像最相似的文本。2015年,Hodosh等人提出使用核典型关联分析(Kernel Canonical Correlation Analysis,KCCA)方法学习图像和文本两个模态的公共特征空间,利用核 函数将原始特征与高维特征进行映射,使用K近邻方法进行检索。这类方法不能完全根据图像内容产生语句,也无法产生数据库中不存在的描述语句。
随着深度学习的兴起,研究人员们提出了基于深度学习的图像描述方法。2014年,Mao等人提出了多模态循环神经网络(m-RNN),使用卷积神经网络对图像进行编码,提取出图像卷积特征,并将此特征在每一时刻输入多模态循环神经网络中进行解码,生成描述单词。同年,Vinyals等人提出了基于卷积神经网络和长短期记忆网络(Long Short Term Memory,LSTM)的图像描述生成模型(Neural Image Caption,NIC),不同于Mao的是,NIC模型使用LSTM建立语言模型生成描述语句,只将卷积神经网络提取图像卷积特征在开始时刻输入到LSTM中,没有在每一个时刻都进行输入,取得了很好的效果。随后,研究人员们对于NIC模型做出了改进,生成描述文本的质量也得到了提升。2015年,Xu等首次在模型中引入两种注意力机制(Attention Mechanism),即Soft-Attention和Hard-Attention使得模型能够捕捉到图像的局部信息,在生成描述语句的过程中始终将CNN产生的低层特征图(Feature Map)输入到注意力机制中,注意力机制会从中选择某些特征图作为视觉信息输入LSTM。在每一轮生成描述单词后,模型都重新聚焦于图像的某些特征图,得到新的视觉信息输入到LSTM中。2016年Lu等人提出一种基于Spatial Attention的图像描述模型,通过Spatial Attention使得模型可以自主决定使用图像信息还是使用语言模型信息。同年,Jia等使用语义信息指导LSTM生成描述,与基于相似度检索的方法类似,首先在图像-文本向量空间中检索出与图像特征最相似的描述文本特征,再将其作为指导信息输入到LSTM中,生成描述语句。2017年,汤鹏杰等人通过迁移学习的方法,分别训练场景分类网络用以捕捉图像的场景先验信息和物体分类网络用以捕捉图像的物体类别先验信息,再将图像的场景先验信息和物体类别先验信息融入模型中,协同生成图像的描述句子,提高句子生成质量。2018年,刘畅等人改变解码器结构,在解码器中加入栈式隐层和普通隐层,提高了语言模型的学习能力。刘泽宇、蓝玮毓对中文图像描述进行了研究,同样在NIC模型的基础上进行优化,刘泽宇等人提出基于多模态神经网络的图像中文摘要生成方法,在“编码-解码”模型中引入多标签关键词特征预测网络,首先利用关键词特征预测网络提取图像关键词特征,再将关键词特征以不同的方式输入 到LSTM中进行解码,蓝玮毓则是利用深度模型对图像进行标签预测,并使用标签对解码器结果进行重新排序,改善了句子生成质量。同年,Lu等提出的神经网络方法可以自动生成一个句子模板,将模板中的空槽和图片中物体捆绑在一起。在生成每个词语时,模型会决定选择文本词汇还是视觉词汇。
本发明使用了一种中文视觉词汇表的构建来实现对图像内容的自动描述,方法使用Flickr8kc、Flickr30kc中文图像描述数据集。具体实现过程集合附图1-2进行描述:
1.构建图像中文视觉词汇表的预测网络
在图像描述的数据集中,每幅图片对应着五个描述语句,每个句子都可以生动地描述图像的内容。本发明的方法是选择句子中的名词、动词和形容词作为要预测的图像内容标签。更具体地说,首先使用中文分词工具Boson将描述语句进行分词处理,并根据统计的词频有选择地保留词表中的名词、动词和形容词,再将保留下来的词语构成标签词表,并根据标签词表为每一幅图片标注标签信息,这样就获得了标签预测网络的训练数据。
本方法中使用的图像标签预测网络由2个部分组成,一是基于CNN的特征提取网络,二是特征分类网络。
其中,特征提取网络使用的是在ImageNet数据集上预训练的Resnet-152网络,ResNet-152是ImageNet2015图像分类比赛中的冠军模型,将Resnet-152网络的平均池化的输出当作图像特征,用于后续的特征分类网络。特征分类网络是基于残差连接的三层全连通网络,其中包括输入层、三个隐藏层、两个残差连接以及输出层。输入层用于接收由resnet-152提取的图像特征,输出层用于预测图像所对应的标签信息。由于每幅图中的包含的标签不止一个,所以这是一个多标签分类问题,我们将输出层的激活函数设置为Sigmoid函数。传统的深度学习特征分类网络只是一个单层的全连接网络。随着特征分类网络深度的增加,模型的表达能力增强,但模型的训练变得困难,出现了网络退化问题,即随着模型深度的增加,模型的准确率下降。我们受到ResNet残差结构的启发,在第一层全连接与第三层全连接之间、第二层全连接与第三层全连接之间添加了残差连接,这种做法不仅没有增加模型的复杂度,而且提高了模型的准确率。
2.基于中文视觉词汇表的图像自动描述模型
本发明中所提出的中文描述生成模型由两个部分组成,分别为卷积神经网络CNN和融合标签信息的长短期记忆网络L-LSTM,其架构是使用CNN作为编码器,提取出图像卷积特征,再使用L-LSTM作为解码器,将图像卷积特征作为初始输入解码为目标描述语句。
具体地,在模型进行预测时,L-LSTM首先接受图像卷积特征,并忽略这一时刻的输出;然后输入一个开始符号<Start>和预测标签特征后,L-LSTM输出一个由词表中词语被预测的概率组成的向量,根据输出向量选取概率最大的词语作为本时刻输出;再把这个词语和预测标签特征作为下一时刻的输入,继续进行预测,直到预测出结束符号<End>,整体架构如图1所示。
图像自动描述模型中的编码器CNN是一种用于处理网格化数据的神经网络。CNN模型由一系列的变换模块组成,例如卷积、激活、下采样等。用深度CNN模型提取图像特征,图像数据需要经过多次的卷积、下采样和激活等操作,其提取出的特征更加抽象,表达能力更强,在图像分类与识别、目标检测、场景理解等视觉任务上取得了显著的效果。
图像自动描述模型中的解码器L-LSTM是由本文提出的一种能够融合标签信息的长短期记忆网络,如图2所示。L-LSTM与LSTM相同之处在于,通过不同的“门”的结构向细胞状态c中增加或去除信息,其中忘记门(Forget Gate)用于决定从前一时刻的细胞状态c t-1中保留或丢弃哪些语义信息;输入门(Input Gate)用于决定哪些语义信息输入到本时刻的细胞状态c t中;输出门(Output Gate)用于决定从本时刻的细胞状态c t中输出哪些语义信息。不同之处在于,L-LSTM能够接收图像预测标签特征l,并利用预测标签特征引导描述语句的生成。具体公式如下:
f t=σ g(W fx t+U fh t-1+V fl+b f)        (1)
i t=σ g(W ix t+U ih t-1+V il+b i)        (2)
o t=σ g(W ox t+U oh t-1+V ol+b o)        (3)
g t=σ h(W cx t+U ch t-1+V cl+b c)        (4)
c t=f t*c t-1+i t*g t                 (5)
h t=o t*c t                        (6)
其中W,U,V,b表示在L-LSTM中需要训练的权重和偏置,x t表示L-LSTM的输入向量,l表示预测的标签特征,h t表示L-LSTM的隐藏状态,c t表示L-LSTM的单元状态,f t表示L-LSTM的“遗忘门”的激活向量,i t表示L-LSTM的“输入门”的激活向量,o t表示L-LSTM的“输出门”的激活向量,*表示点乘,下标t表示时间σ g表示Sigmoid函数,σ h表示Tanh函数。
在“编码-解码”过程中CIC模型需要最大化给定图像生成目标描述语句的概率,由式(7)表示:
Figure PCTCN2020102234-appb-000001
其中,I表示输入图像,Y表示任意一个不定长度的目标描述语句,由单词Y 0,Y 1,...,Y N构成,θ表示模型参数。
3.图像自动描述模型训练过程
模型训练的过程如下:
(1)通过卷积神经网络ResNet-152提取图像卷积特征I c。通过标签预测网络提取图像标签特征l;
(2)将图像卷积特征I c作为L-LSTM第一时刻的输入;
x -1=I c          (8)
(3)将目标语句中的单词独热向量Y t,t∈{0,...,N-1},经过词嵌入后的单词特征向量W eY t和图像标签特征l作为L-LSTM其他时刻的输入,可以得到L-LSTM的隐藏状态h t以及词语预测概率p t+1
x t=W eY t,t∈{0,...,N-1}          (9)
h t=L-LSTM(x t,l,h t-1,c t-1)        (10)
p t+1=Softmax([h t])                 (11)
(4)最终使用词语预测概率p t+1与最后时刻的细胞状态c计算模型的损失,并使用随机梯度下降进行优化,损失计算方法如式(7)所示。
其中,W e为词嵌入矩阵,Y t表示独热向量,Y 0与Y n分别为特殊的起始符<Start>和结束符<End>用来表示句子的开始与结束。当L-LSTM预测下一个词为结束符时,表示已经生成了一个完整的句子。
4.模型效果验证
数据集
本发明使用的数据集是Flickr8kc和Flickr30kc。采用机器翻译的方法,将英文图 像描述数据集Flickr8k和Flickr30k翻译为中文版本。Flickr8kc数据集中包含8000张标注图像,40000条中文描述语句。Flickr30kc数据集中包含30000张标注图像,150000条中文描述语句。本发明使用图像分割方法对Flickr8kc和Flickr30kc进行分割,其中Flickr8kc包括6000张训练数据,1000张验证数据,1000张测试数据,Flickr30kc包括28000张训练数据,1000张验证数据,1000张测试数据。
系统配置
环境配置如下:操作系统为Ubuntu16.03.1、开发语言为Python2.7,深度学习框架为TensorFlow1.6。Flickr8kc训练集包括6000幅图像、30000个中文描述语句和7784个词语。Flickr30kc训练集包括28000张图片,140000个中文描述语句和19735个词语。为了消除低频词的干扰,保留了同一张图片的5个中文描述语句中至少出现2次的名词、动词以及形容词,并且整体词频大于20次的词语当作词汇表。词汇表预测网络参数配置,如下表所示。
表1 中文视觉词汇表预测网络参数配置
Figure PCTCN2020102234-appb-000002
标签预测网络采用的评价标准为precision-i、recall-i和f-i。precision-i表示前k个预测标签的准确率。recall-i表示前i个预测标签的召回率。f-i是前i个预测标签的准确率和召回率的调和平均值。
中文图像描述网络训练集数据与词汇表预测网络相同,与其不同的是筛选出词频大于5的词语当作词表,最终Flickr8kc词表包括2625个词语,Flickr30kc词表包括7108个词语。中文视觉词汇表的图像自动描述模型参数配置如下:
表2 基于中文视觉词汇表的图像自动描述模型网络参数配置
Figure PCTCN2020102234-appb-000003
Figure PCTCN2020102234-appb-000004
图像视觉词汇表预测网络评估
下面两个表分别显示了不同词汇表预测网络的precision-i、recall-i和f-i的结果,one表示特征分类网络为单层全连接的网络,two表示两层全连接的网络,three表示三层全连接的网络,this代表本发明使用的词汇表预测网络。
我们以表4中的Flickr8kc标签预测网络结果为例,将所提出的this网络与one、two、three进行比较。实验表明,随着网络层次的增加,出现了网络退化的现象,即网络的准确率和召回率下降的问题。one和three之间的准确率下降了0.4%,召回率下降了0.48%。而本发明提出的this是在three的基础上加入残差结构的网络,将准确率和召回率分别提高到了33.49%,39.54%,说明了发明提出方法的可以解决网络退化问题。但从整体来看,中文词表预测网络的准确率与召回率仍有很大的提高的空间。
表3 标签预测网络1层结果比较
Figure PCTCN2020102234-appb-000005
表4 标签预测网络5层结果比较
Figure PCTCN2020102234-appb-000006
优化损失函数
本发明使用了一种基于标签信息匹配的模型优化方法,对图像描述生成的原始损失函数进行优化。观察L-LSTM的内部结构就可以了解到L-LSTM通过“输入门”与“遗忘 门”将输入信息选择性地保存在细胞状态c中,并利用“输出门”控制细胞状态c中的信息进行输出,细胞状态c中的信息在一定程度上影响了描述词语的生成。图像的标签信息可以在一定程度上对图像内容进行刻画。计算图像词表特征与L-LSTM最终时刻细胞状态c之间的距离,作为一个额外项加入损失函数中。在模型训练时尽可能缩短标签特征与细胞状态c的距离,可以使得L-LSTM的中保存的信息更加贴近于图像的标签信息,有助于生成质量更高的描述语句。探索不同的计算距离的方法,例如曼哈顿距离、欧式距离等,具体如下。
CityBlock Distance(l,c)=      (12)
Figure PCTCN2020102234-appb-000007
Figure PCTCN2020102234-appb-000008
其中,c表示L-LSTM的细胞状态,l表示图像标签特征。原始损失函数为每个时刻输出正确单词概率的负对数和,将标签特征与细胞状态c的距离添加到原始损失函数中,如下式所示。
Figure PCTCN2020102234-appb-000009
以曼哈顿距离为例,对超参数α进行选择。α取值范围为(0,1),步长为0.1。当α的值为0.2为实验效果最佳。故此,将α值经验地设定为0.2。
本发明将中文视觉词汇表的图像自动描述模型简称为IADCVV,表示网络中只是通过L-LSTM引入图像标签特征,并没有优化损失函数。在IADCVV的基础上使用损失函数优化方法并通过曼哈顿距离计算L-LSTM的细胞状态c与图像词表特征之间的相似度称为IADCVV-CB。而IADCVV-E、IADCVV-C是使用欧式距离、余弦值来衡量相似度。通过实验对比如下:
以下表统计为例,首先,可以看出IADCVV比基线Google模型的实验效果提高了2.8%,2.7%,5.3%,说明了在网络中加入图像视觉词汇表特征的方法可以有效地提高图像描述模型生成语句的质量。其次,IADCVV-CB和IADCVV-E与IADCVV相比又有不同程度的提升,说明了缩短图像标签特征与L-LSTM细胞状态之间的距离可以进一步优化图像描述模型生成语句的质量。IADCVV-C与IADCVV相比值有所降低,说明了需要选择合适距离计算方法来缩短图像标签特征与L-LSTM细胞状态之间的距离。IADCVV-CB和IADCVV-E的效果更好,说明曼哈顿距离比欧氏距离更加适合计算图像标签特征与L-LSTM细胞状态之间的距离。
IADCVV在Flickr8kc上结果对比
Figure PCTCN2020102234-appb-000010
Figure PCTCN2020102234-appb-000011
IADCVV在Flickr30kc上结果对比
Figure PCTCN2020102234-appb-000012
通过上表分析,展示了所有模型实验结果的对比。可以看出本发明所提模型效果与已知现有模型相比有了较大的提升。
本发明使用一种中文视觉词汇表构建的图像自动描述生成模型。首先设计了一种中文视觉词汇表预测网络,通过词汇表预测网络对图像词汇表进行预测可以获得图像标注信息,在中文视觉词汇表预测网络中添加残差结构,可以有效地解决随着中文视觉词汇表预测网络层数加深,而导致的网络退化问题。
其次使用了一种L-LSTM架构,可以将图像中文视觉词汇表特征引入描述生成网络中。此外,对描述生成网络的损失函数进行优化,缩短图像中文视觉词汇表特征与L-LSTM的细胞状态之间的距离,使得生成的图像描述语句与图像中文视觉词汇表更加贴近。最后通过各种方法验证模型的有效性。
本发明的实施例公布的是较佳的实施例,但并不局限于此,本领域的普通技术人员,极易根据上述实施例,领会本发明的精神,并做出不同的引申和变化,但只要不脱离本发明的精神,都在本发明的保护范围内。

Claims (9)

  1. 一种中文视觉词汇表构建的图像内容自动描述方法,其特征在于,包括按顺序进行的如下:
    步骤a,使用中文分词工具将单张图片对应的若干个描述语句进行分词处理,并根据统计的词频有选择地保留词表中的名词、动词和形容词,再将保留下来的词语构成中文视觉词汇表;
    步骤b,基于中文词汇表预测网络对中文视觉词汇表进行预测获得图像标注信息;
    步骤c,基于图像自动描述模型,使用编码器提取出图像卷积特征,再使用解码器将图像卷积特征作为初始输入解码为中文描述语句。
  2. 如权利要求1所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:还包括步骤c后的步骤d,基于标签信息匹配的模型对描述生成网络的损失函数进行优化。
  3. 如权利要求1所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:所述中文词汇表预测网络由基于卷积神经网络的特征提取网络和特征分类网络两个部分组成,所述特征提取网络将基于标签词表数据集上预训练的数据网络的平均池化的输出当作图像特征,输入所述特征分类网络的输入层,并由所述特征分类网络的输出层输出预测图像所对应的中文标签信息。
  4. 如权利要求1所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:所述图像自动描述模型进行预测时,解码器首先接受图像卷积特征,并忽略这一时刻的输出;然后输入一个开始符号<Start>和预测标签特征后,解码器输出一个由词表中词语被预测的概率组成的向量,根据输出向量选取概率最大的词语作为本时刻输出;再把这个词语和预测标签特征作为下一时刻的输入,继续进行预测,直到预测出结束符号<End>:
  5. 如权利要求1所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:所述编码器包括卷积、下采样和激活操作中的一种或多种。
  6. 如权利要求3所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:所述特征分类网络是基于残差连接的三层全连通网络,其中包括输入层、三个隐藏层、两个残差连接以及输出层,所述残差连接添加在第一层全连接与第三层全连接之间、第二层全连接与第三层全连接之间。
  7. 如权利要求4所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:所述解码器能够接收图像预测标签特征,并利用预测标签特征引导中文描述语句的生成。
  8. 如权利要求2所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:所述步骤d中的基于标签信息匹配的模型优化具体为计算图像词表特征与解码器最终时刻细胞状态之间的距离,作为一个额外项加入损失函数中,并在模型训练时尽可能缩短标签特征与细胞状态的距离。
  9. 如权利要求8所述的中文视觉词汇表构建的图像内容自动描述方法,其特征在于:所述计算距离的方法采用曼哈顿距离或欧式距离。
PCT/CN2020/102234 2020-05-06 2020-07-16 一种中文视觉词汇表构建的图像内容自动描述方法 WO2021223323A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010374110.X 2020-05-06
CN202010374110.XA CN111581961B (zh) 2020-05-06 2020-05-06 一种中文视觉词汇表构建的图像内容自动描述方法

Publications (1)

Publication Number Publication Date
WO2021223323A1 true WO2021223323A1 (zh) 2021-11-11

Family

ID=72116901

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/102234 WO2021223323A1 (zh) 2020-05-06 2020-07-16 一种中文视觉词汇表构建的图像内容自动描述方法

Country Status (2)

Country Link
CN (1) CN111581961B (zh)
WO (1) WO2021223323A1 (zh)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049501A (zh) * 2021-11-22 2022-02-15 江苏科技大学 融合集束搜索的图像描述生成方法、系统、介质及设备
CN114238563A (zh) * 2021-12-08 2022-03-25 齐鲁工业大学 基于多角度交互的中文句子对语义智能匹配方法和装置
CN114469661A (zh) * 2022-02-24 2022-05-13 沈阳理工大学 一种基于编码解码技术的视觉内容导盲辅助系统及方法
CN114549850A (zh) * 2022-01-24 2022-05-27 西北大学 一种解决模态缺失问题的多模态图像美学质量评价方法
CN114596588A (zh) * 2022-03-11 2022-06-07 中山大学 基于文本辅助特征对齐模型的受损行人图像再识别方法及装置
CN114663915A (zh) * 2022-03-04 2022-06-24 西安交通大学 基于Transformer模型的图像人-物交互定位方法及系统
CN114707523A (zh) * 2022-04-20 2022-07-05 合肥工业大学 基于交互式Transformer的图像-多语言字幕转换方法
CN114781393A (zh) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 图像描述生成方法和装置、电子设备及存储介质
CN114882488A (zh) * 2022-05-18 2022-08-09 北京理工大学 基于深度学习与注意力机制的多源遥感图像信息处理方法
CN115171889A (zh) * 2022-09-09 2022-10-11 紫东信息科技(苏州)有限公司 一种小样本胃部肿瘤诊断系统
CN115909317A (zh) * 2022-07-15 2023-04-04 广东工业大学 一种三维模型-文本联合表达的学习方法及系统
CN115953779A (zh) * 2023-03-03 2023-04-11 中国科学技术大学 基于文本对抗生成网络的无监督图像描述生成方法
CN116012685A (zh) * 2022-12-20 2023-04-25 中国科学院空天信息创新研究院 一种基于关系序列与视觉序列融合的图像描述生成方法
CN116071641A (zh) * 2023-04-06 2023-05-05 中国石油大学(华东) 一种水下图像中文描述生成方法、装置、设备及存储介质
CN116204674A (zh) * 2023-04-28 2023-06-02 中国科学技术大学 一种基于视觉概念词关联结构化建模的图像描述方法
CN116502092A (zh) * 2023-06-26 2023-07-28 国网智能电网研究院有限公司 多源异构数据的语义对齐方法、装置、设备及存储介质
CN116543289A (zh) * 2023-05-10 2023-08-04 南通大学 一种基于编码器-解码器及Bi-LSTM注意力模型的图像描述方法
CN114596588B (zh) * 2022-03-11 2024-05-31 中山大学 基于文本辅助特征对齐模型的受损行人图像再识别方法及装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052906B (zh) * 2020-09-14 2024-02-02 南京大学 一种基于指针网络的图像描述优化方法
CN112328782B (zh) * 2020-11-04 2022-08-09 福州大学 一种融合图像过滤器的多模态摘要生成方法
CN113408430B (zh) * 2021-06-22 2022-09-09 哈尔滨理工大学 基于多级策略和深度强化学习框架的图像中文描述系统及方法
CN113792617B (zh) * 2021-08-26 2023-04-18 电子科技大学 一种结合图像信息和文本信息的图像解译方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599198A (zh) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种多级联结循环神经网络的图像描述方法
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN109271628A (zh) * 2018-09-03 2019-01-25 东北大学 一种图像描述生成方法
CN110046226A (zh) * 2019-04-17 2019-07-23 桂林电子科技大学 一种基于分布词向量cnn-rnn网络的图像描述方法
CN110598713A (zh) * 2019-08-06 2019-12-20 厦门大学 基于深度神经网络的智能图像自动描述方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10558750B2 (en) * 2016-11-18 2020-02-11 Salesforce.Com, Inc. Spatial attention model for image captioning
CN108830287A (zh) * 2018-04-18 2018-11-16 哈尔滨理工大学 基于残差连接的Inception网络结合多层GRU的中文图像语义描述方法
CN110111399B (zh) * 2019-04-24 2023-06-30 上海理工大学 一种基于视觉注意力的图像文本生成方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599198A (zh) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种多级联结循环神经网络的图像描述方法
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN109271628A (zh) * 2018-09-03 2019-01-25 东北大学 一种图像描述生成方法
CN110046226A (zh) * 2019-04-17 2019-07-23 桂林电子科技大学 一种基于分布词向量cnn-rnn网络的图像描述方法
CN110598713A (zh) * 2019-08-06 2019-12-20 厦门大学 基于深度神经网络的智能图像自动描述方法

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049501A (zh) * 2021-11-22 2022-02-15 江苏科技大学 融合集束搜索的图像描述生成方法、系统、介质及设备
CN114238563A (zh) * 2021-12-08 2022-03-25 齐鲁工业大学 基于多角度交互的中文句子对语义智能匹配方法和装置
CN114549850A (zh) * 2022-01-24 2022-05-27 西北大学 一种解决模态缺失问题的多模态图像美学质量评价方法
CN114549850B (zh) * 2022-01-24 2023-08-08 西北大学 一种解决模态缺失问题的多模态图像美学质量评价方法
CN114469661A (zh) * 2022-02-24 2022-05-13 沈阳理工大学 一种基于编码解码技术的视觉内容导盲辅助系统及方法
CN114469661B (zh) * 2022-02-24 2023-10-03 沈阳理工大学 一种基于编码解码技术的视觉内容导盲辅助系统及方法
CN114663915A (zh) * 2022-03-04 2022-06-24 西安交通大学 基于Transformer模型的图像人-物交互定位方法及系统
CN114663915B (zh) * 2022-03-04 2024-04-05 西安交通大学 基于Transformer模型的图像人-物交互定位方法及系统
CN114596588A (zh) * 2022-03-11 2022-06-07 中山大学 基于文本辅助特征对齐模型的受损行人图像再识别方法及装置
CN114596588B (zh) * 2022-03-11 2024-05-31 中山大学 基于文本辅助特征对齐模型的受损行人图像再识别方法及装置
CN114781393B (zh) * 2022-04-20 2023-05-26 平安科技(深圳)有限公司 图像描述生成方法和装置、电子设备及存储介质
CN114707523A (zh) * 2022-04-20 2022-07-05 合肥工业大学 基于交互式Transformer的图像-多语言字幕转换方法
CN114781393A (zh) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 图像描述生成方法和装置、电子设备及存储介质
CN114707523B (zh) * 2022-04-20 2024-03-08 合肥工业大学 基于交互式Transformer的图像-多语言字幕转换方法
CN114882488A (zh) * 2022-05-18 2022-08-09 北京理工大学 基于深度学习与注意力机制的多源遥感图像信息处理方法
CN115909317A (zh) * 2022-07-15 2023-04-04 广东工业大学 一种三维模型-文本联合表达的学习方法及系统
CN115171889A (zh) * 2022-09-09 2022-10-11 紫东信息科技(苏州)有限公司 一种小样本胃部肿瘤诊断系统
CN116012685A (zh) * 2022-12-20 2023-04-25 中国科学院空天信息创新研究院 一种基于关系序列与视觉序列融合的图像描述生成方法
CN116012685B (zh) * 2022-12-20 2023-06-16 中国科学院空天信息创新研究院 一种基于关系序列与视觉序列融合的图像描述生成方法
CN115953779A (zh) * 2023-03-03 2023-04-11 中国科学技术大学 基于文本对抗生成网络的无监督图像描述生成方法
CN115953779B (zh) * 2023-03-03 2023-06-16 中国科学技术大学 基于文本对抗生成网络的无监督图像描述生成方法
CN116071641A (zh) * 2023-04-06 2023-05-05 中国石油大学(华东) 一种水下图像中文描述生成方法、装置、设备及存储介质
CN116071641B (zh) * 2023-04-06 2023-08-04 中国石油大学(华东) 一种水下图像中文描述生成方法、装置、设备及存储介质
CN116204674A (zh) * 2023-04-28 2023-06-02 中国科学技术大学 一种基于视觉概念词关联结构化建模的图像描述方法
CN116204674B (zh) * 2023-04-28 2023-07-18 中国科学技术大学 一种基于视觉概念词关联结构化建模的图像描述方法
CN116543289B (zh) * 2023-05-10 2023-11-21 南通大学 一种基于编码器-解码器及Bi-LSTM注意力模型的图像描述方法
CN116543289A (zh) * 2023-05-10 2023-08-04 南通大学 一种基于编码器-解码器及Bi-LSTM注意力模型的图像描述方法
CN116502092A (zh) * 2023-06-26 2023-07-28 国网智能电网研究院有限公司 多源异构数据的语义对齐方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN111581961B (zh) 2022-06-21
CN111581961A (zh) 2020-08-25

Similar Documents

Publication Publication Date Title
WO2021223323A1 (zh) 一种中文视觉词汇表构建的图像内容自动描述方法
CN108733792B (zh) 一种实体关系抽取方法
Torfi et al. Natural language processing advancements by deep learning: A survey
CN109918671B (zh) 基于卷积循环神经网络的电子病历实体关系抽取方法
Bai et al. A survey on automatic image caption generation
WO2022037256A1 (zh) 文本语句处理方法、装置、计算机设备和存储介质
WO2023093574A1 (zh) 基于多级图文语义对齐模型的新闻事件搜索方法及系统
Wang et al. Application of convolutional neural network in natural language processing
CN112000818B (zh) 一种面向文本和图像的跨媒体检索方法及电子装置
CN111930942B (zh) 文本分类方法、语言模型训练方法、装置及设备
CN111125406B (zh) 一种基于自适应聚类学习的视觉关系检测方法
CN110969020A (zh) 基于cnn和注意力机制的中文命名实体识别方法、系统及介质
CN108985370B (zh) 图像标注语句自动生成方法
CN111881292B (zh) 一种文本分类方法及装置
CN112818670B (zh) 可分解变分自动编码器句子表示中的切分语法和语义
Cheng et al. A semi-supervised deep learning image caption model based on Pseudo Label and N-gram
CN114580428A (zh) 融合多任务和多标签学习的司法领域深度事件抽取方法
CN110347853B (zh) 一种基于循环神经网络的图像哈希码生成方法
Perez-Martin et al. A comprehensive review of the video-to-text problem
Rasool et al. WRS: a novel word-embedding method for real-time sentiment with integrated LSTM-CNN model
Deorukhkar et al. A detailed review of prevailing image captioning methods using deep learning techniques
CN113378919B (zh) 融合视觉常识和增强多层全局特征的图像描述生成方法
CN113268592B (zh) 基于多层次交互注意力机制的短文本对象情感分类方法
CN110889505A (zh) 一种图文序列匹配的跨媒体综合推理方法和系统
Ludwig et al. Deep embedding for spatial role labeling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20934476

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 030423)

122 Ep: pct application non-entry in european phase

Ref document number: 20934476

Country of ref document: EP

Kind code of ref document: A1