CN111813978A

CN111813978A - An image description model generation method, generation device and storage medium

Info

Publication number: CN111813978A
Application number: CN201910295123.5A
Authority: CN
Inventors: 潘滢炜; 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2020-10-23
Anticipated expiration: 2039-04-12
Also published as: CN111813978B

Abstract

The present application discloses a method for generating an image description model, a generating device and a storage medium, in particular, acquiring an image sample and a description sentence corresponding to the image sample; inputting the image sample and the description sentence into a first prediction model to generate a first distribution probability set; obtain the second distribution probability set corresponding to the target vocabulary corresponding to each entity object in the image sample; assign weight values to the first distribution probability set and the second distribution probability set respectively, so as to obtain the corresponding target vocabulary at each decoding time. The third distribution probability set of the third distribution probability, and output the predicted sentence; based on the first distribution probability set, the second distribution probability set and the weight value, respectively determine the sentence entity object coverage loss function and the language sequence loss function, and determine the model Loss function to optimize the image description model until the image description model is generated. The embodiment of the present application improves the accuracy of image description by optimizing the image description model.

Description

An image description model generation method, generation device and storage medium

技术领域technical field

本申请涉及计算机技术领域，尤其涉及一种图像描述模型的生成方法、生成装置和存储介质。The present application relates to the field of computer technology, and in particular, to a method for generating an image description model, a generating device and a storage medium.

背景技术Background technique

图像描述主要是通过自然语言表达对图像的视觉内容进行描述。近年来，图像描述技术应用广泛，自动的图像描述生成可以被应用于图像检索。现有的图像描述技术主要通过卷积神经网络(Convolutional Neural Networks,CNN)提取输入的图像数据的特征向量，其次，利用循环神经网络(Recurrent Neural Network，RNN)或长短期记忆网络(LongShort-Term Memory，LSTM)对提取的特征进行解码，获取输出序列，生成图像描述。Image description mainly describes the visual content of the image through natural language expression. In recent years, image description technology has been widely used, and automatic image description generation can be applied to image retrieval. The existing image description technology mainly extracts the feature vector of the input image data through Convolutional Neural Networks (CNN), and secondly, uses Recurrent Neural Network (RNN) or Long Short-Term Memory (LongShort-Term Memory, LSTM) decodes the extracted features, obtains the output sequence, and generates image descriptions.

现有的图像描述技术在训练图像描述的生成模型时，通常将图像和对应的描述语句作为样本数据集进行训练，因此词表固定，生成的模型一般局限的使用于已经训练过的图像数据中，在新的图像数据作为输入的场景上的应用受限。另外，通过上述模型输出的描述语句中对待识别对象的识别准确度不高。When the existing image description technology trains the image description generation model, the image and the corresponding description sentence are usually used as the sample data set for training, so the vocabulary is fixed, and the generated model is generally limited to the image data that has been trained. , the application in scenarios where new image data is used as input is limited. In addition, the recognition accuracy of the object to be recognized in the description sentence output by the above model is not high.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种图像描述模型的生成方法，该方法通过建立图像描述模型，提升了对输入的待识别图像的图像描述的准确度。An embodiment of the present application provides a method for generating an image description model, which improves the accuracy of image description of an input image to be recognized by establishing an image description model.

该方法包括：The method includes:

获取图像样本和与所述图像样本对应的描述语句；obtaining an image sample and a description sentence corresponding to the image sample;

将所述图像样本和所述描述语句中的每个目标词汇依次输入预先训练的第一预测模型，并生成所述描述语句中包含的每一个目标词汇在每个解码时刻对应的第一分布概率的第一分布概率集合；Inputting the image sample and each target word in the description sentence into a pre-trained first prediction model in turn, and generating a first distribution probability corresponding to each target word contained in the description sentence at each decoding moment The first distribution probability set of ;

获取所述图像样本中的各个实体对象对应的所述目标词汇在每个解码时刻对应的第二分布概率的第二分布概率集合；obtaining a second distribution probability set of the second distribution probability corresponding to the target vocabulary corresponding to each entity object in the image sample at each decoding moment;

分别为所述目标词汇在所述第一分布概率集合中对应的第一分布概率和在所述第二分布概率集合中对应的第二分布概率分配权重值，以获取包含所述目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，并输出预测语句；Weights are respectively assigned to the first distribution probability corresponding to the target vocabulary in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, so as to obtain the target vocabulary in each The third distribution probability set of the third distribution probability corresponding to each decoding moment, and output the prediction sentence;

基于所述第一分布概率集合、所述第二分布概率集合和所述权重值，分别确定语句实体对象覆盖损失函数和语言序列损失函数；Based on the first distribution probability set, the second distribution probability set and the weight value, determine the sentence entity object coverage loss function and the language sequence loss function respectively;

根据所述语句实体对象覆盖损失函数和所述语言序列损失函数，确定模型损失函数，并根据所述模型损失函数，对所述图像描述模型进行优化，直至生成所述图像描述模型。According to the sentence entity object coverage loss function and the language sequence loss function, a model loss function is determined, and the image description model is optimized according to the model loss function until the image description model is generated.

可选地，该生成方法进一步包括：Optionally, the generating method further includes:

将所述图像样本输入预先训练的所述卷积神经网络模型，并提取图像样本的特征向量；Inputting the image sample into the pre-trained convolutional neural network model, and extracting the feature vector of the image sample;

将所述特征向量和所述描述语句在上一个解码时刻对应的所述目标词汇输入所述长短期记忆网络模型。The feature vector and the target vocabulary corresponding to the description sentence at the last decoding time are input into the long short-term memory network model.

可选地，该生成方法包括：Optionally, the generating method includes:

将所述图像样本输入基于预设的图像数据集预先训练的对象检测器中，并生成所述图像样本中包含的各个实体对象对应为所述目标词汇的预测分数的预测分数集合；Inputting the image sample into an object detector pre-trained based on a preset image data set, and generating a prediction score set in which each entity object included in the image sample corresponds to the prediction score of the target vocabulary;

基于所述预测分数集合和所述长短期记忆网络模型在当前解码时刻的隐状态，将所述目标词汇确定为所述实体对象的正确语义词，并生成所述目标词汇在每个解码时刻的第二分布概率的第二分布概率集合。Based on the prediction score set and the hidden state of the long short-term memory network model at the current decoding moment, the target vocabulary is determined as the correct semantic word of the entity object, and the target vocabulary at each decoding moment is generated. A second set of distribution probabilities for the second distribution probability.

可选地，该生成方法包括：Optionally, the generating method includes:

基于为所述目标词汇的所述第一分布概率分配的第一权重值和为所述第二分布概率分配的第二权重值，计算所述包含目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，其中，所述第一权重值和所述第二权重值的和为1。Based on the first weight value assigned to the first distribution probability of the target word and the second weight value assigned to the second distribution probability, a third distribution corresponding to the target word containing the target word at each decoding time is calculated A third distribution probability set of probabilities, wherein the sum of the first weight value and the second weight value is 1.

可选地，该生成方法包括：Optionally, the generating method includes:

基于每个所述目标词汇在所述第二分布概率集合中的每个所述解码时刻对应的所述第二分布概率与分配的所述权重值的和，与所述目标词汇在所述描述语句中的实际分布概率，确定语句实体对象覆盖损失函数；Based on the sum of the second distribution probability corresponding to each of the target words at each of the decoding moments in the second distribution probability set and the assigned weight value, the target word in the description The actual distribution probability in the sentence, to determine the sentence entity object coverage loss function;

基于所述第一预测模型输出的对所述图像样本的预测语句和所述描述语句，确定语言序列损失函数。A language sequence loss function is determined based on the prediction sentence for the image sample and the description sentence output by the first prediction model.

在本发明的另一个实施例中，提供了一种获取图像描述的方法，该方法包括：In another embodiment of the present invention, a method for acquiring image description is provided, the method comprising:

获取待识别图像；Get the image to be recognized;

将所述待识别图像输入如上述一种图像描述模型的生成方法中的各个步骤中的图像描述模型，以生成所述待识别图像的描述语句。The to-be-recognized image is input into the image description model in each step of the above-mentioned method for generating an image description model, so as to generate a description sentence of the to-be-recognized image.

在本发明的另一个实施例中，提供了一种图像描述模型的生成装置，该生成装置包括：In another embodiment of the present invention, an apparatus for generating an image description model is provided, and the generating apparatus includes:

获取模块，用于获取图像样本和与所述图像样本对应的描述语句；an acquisition module for acquiring an image sample and a description sentence corresponding to the image sample;

第一生成模块，用于将所述图像样本和所述描述语句中的每个目标词汇依次输入预先训练的第一预测模型，并生成所述描述语句中包含的每一个目标词汇在每个解码时刻对应的第一分布概率的第一分布概率集合；The first generation module is used to sequentially input the image sample and each target word in the description sentence into the pre-trained first prediction model, and generate each target word contained in the description sentence in each decoding. the first distribution probability set of the first distribution probability corresponding to the moment;

第二生成模块，用于获取所述图像样本中的各个实体对象对应的所述目标词汇在每个解码时刻对应的第二分布概率的第二分布概率集合；A second generation module, configured to obtain a second distribution probability set of the second distribution probability corresponding to the target vocabulary corresponding to each entity object in the image sample at each decoding moment;

第三生成模块，用于分别为所述目标词汇在所述第一分布概率集合中对应的第一分布概率和在所述第二分布概率集合中对应的第二分布概率分配权重值，以获取包含所述目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，并输出预测语句；The third generation module is configured to assign weight values to the first distribution probability corresponding to the target word in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, to obtain A third distribution probability set containing the third distribution probability corresponding to the target vocabulary at each decoding moment, and outputting a predicted sentence;

第一确定模块，用于基于所述第一分布概率集合、所述第二分布概率集合和所述权重值，分别确定语句实体对象覆盖损失函数和语言序列损失函数；a first determination module, configured to determine a sentence entity object coverage loss function and a language sequence loss function based on the first distribution probability set, the second distribution probability set and the weight value, respectively;

第二确定模块，用于根据所述语句实体对象覆盖损失函数和所述语言序列损失函数，确定模型损失函数，并根据所述模型损失函数，对所述图像描述模型进行优化，直至生成所述图像描述模型。The second determination module is configured to determine a model loss function according to the sentence entity object coverage loss function and the language sequence loss function, and optimize the image description model according to the model loss function until the generation of the Image description model.

可选地，该生成装置还包括：Optionally, the generating device also includes:

提取模块，用于将所述图像样本输入预先训练的所述卷积神经网络模型，并提取图像样本的特征向量；an extraction module for inputting the image sample into the pre-trained convolutional neural network model, and extracting the feature vector of the image sample;

输入模块，用于将所述特征向量和所述描述语句在上一个解码时刻对应的所述目标词汇输入所述长短期记忆网络模型。The input module is configured to input the feature vector and the target vocabulary corresponding to the description sentence at the last decoding moment into the long short-term memory network model.

可选地，第二生成模块包括：Optionally, the second generation module includes:

输入单元，用于将所述图像样本输入基于预设的图像数据集预先训练的对象检测器中，并生成所述图像样本中包含的各个实体对象对应为所述目标词汇的预测分数的预测分数集合；an input unit, configured to input the image sample into an object detector pre-trained based on a preset image data set, and generate a predicted score corresponding to the predicted score of the target vocabulary for each entity object contained in the image sample gather;

生成单元，用于基于所述预测分数集合和所述长短期记忆网络模型在当前解码时刻的隐状态，将所述目标词汇确定为所述实体对象的正确语义词，并生成所述目标词汇在每个解码时刻的第二分布概率的第二分布概率集合。The generating unit is used to determine the target vocabulary as the correct semantic word of the entity object based on the prediction score set and the hidden state of the long short-term memory network model at the current decoding moment, and generate the target vocabulary in The second distribution probability set of the second distribution probability for each decoding moment.

可选地，第三生成模块还用于：Optionally, the third generation module is further used for:

可选地，第一确定模块包括：Optionally, the first determining module includes:

第一确定单元，用于基于每个所述目标词汇在所述第二分布概率集合中的每个所述解码时刻对应的所述第二分布概率与分配的所述权重值的和，与所述目标词汇在所述描述语句中的实际分布概率，确定语句实体对象覆盖损失函数；The first determination unit is configured to, based on the sum of the second distribution probability corresponding to each of the target words at each of the decoding moments in the second distribution probability set and the assigned weight value, and the Determine the actual distribution probability of the target vocabulary in the description sentence, and determine the sentence entity object coverage loss function;

第二确定单元，用于基于所述第一预测模型输出的对所述图像样本的预测语句和所述描述语句，确定语言序列损失函数。A second determining unit, configured to determine a language sequence loss function based on the predicted sentence for the image sample and the description sentence output by the first prediction model.

在本发明的另一个实施例中，提供了一种获取图像描述的装置，该装置包括：In another embodiment of the present invention, there is provided a device for acquiring image description, the device comprising:

获取模块，用于获取待识别图像；an acquisition module, used to acquire the image to be recognized;

生成模块，用于将所述待识别图像输入如上述一种图像描述模型的生成方法中的各个步骤中的图像描述模型，以生成所述待识别图像的描述语句。The generating module is configured to input the image to be recognized into the image description model in each step of the above-mentioned method for generating an image description model, so as to generate a description sentence of the image to be recognized.

在本发明的另一个实施例中，提供了一种非瞬时计算机可读存储介质，所述非瞬时计算机可读存储介质存储指令，所述指令在由处理器执行时使得所述处理器执行上述一种图像描述模型的生成方法中的各个步骤。In another embodiment of the present invention, a non-transitory computer-readable storage medium is provided, the non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the above-described Steps in a method for generating an image description model.

在本发明的另一个实施例中，提供了一种终端设备，包括处理器，所述处理器用于执行上述一种图像描述模型的生成方法中的各个步骤。In another embodiment of the present invention, a terminal device is provided, including a processor, where the processor is configured to execute each step in the above-mentioned method for generating an image description model.

如上可见，基于上述实施例，首先获取图像样本和与图像样本对应的描述语句。其次，将图像样本和描述语句中的每个目标词汇依次输入预先训练的第一预测模型，并生成描述语句中包含的每一个目标词汇在每个解码时刻对应的第一分布概率的第一分布概率集合。同时，获取图像样本中的各个实体对象对应的目标词汇在每个解码时刻对应的第二分布概率的第二分布概率集合。进一步地，分别为目标词汇在第一分布概率集合中对应的第一分布概率和在第二分布概率集合中对应的第二分布概率分配权重值，以获取包含目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，并输出预测语句。然后，基于所述第一分布概率集合、第二分布概率集合和权重值，分别确定语句实体对象覆盖损失函数和语言序列损失函数。最后，根据语句实体对象覆盖损失函数和语言序列损失函数，确定模型损失函数，并根据模型损失函数，对图像描述模型进行优化，直至生成图像描述模型。本申请实施例通过确定模型损失函数，优化图像描述模型，提升图像描述的准确性。As can be seen from the above, based on the above-mentioned embodiment, an image sample and a description sentence corresponding to the image sample are obtained first. Secondly, input the image samples and each target word in the description sentence into the pre-trained first prediction model in turn, and generate a first distribution of the probability of the first distribution corresponding to each target word contained in the description sentence at each decoding moment. set of probabilities. At the same time, a second distribution probability set of the second distribution probability corresponding to the target vocabulary corresponding to each entity object in the image sample at each decoding moment is obtained. Further, assign a weight value to the first distribution probability corresponding to the target word in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, so as to obtain the corresponding value of the target word at each decoding time. A set of third distribution probabilities of the third distribution probability, and outputting a prediction sentence. Then, based on the first distribution probability set, the second distribution probability set and the weight value, the sentence entity object coverage loss function and the language sequence loss function are respectively determined. Finally, according to the sentence entity object coverage loss function and the language sequence loss function, the model loss function is determined, and according to the model loss function, the image description model is optimized until the image description model is generated. The embodiment of the present application optimizes the image description model by determining the model loss function, and improves the accuracy of the image description.

附图说明Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following drawings will briefly introduce the drawings that need to be used in the embodiments. It should be understood that the following drawings only show some embodiments of the present application, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.

图1示出了本申请实施例10所提供的一种图像描述模型的生成方法中的流程示意图；FIG. 1 shows a schematic flowchart of a method for generating an image description model provided in Embodiment 10 of the present application;

图2示出了本申请实施例20提供的一种图像描述模型的生成方法中生成语句实体对象覆盖损失函数和语言序列损失函数的示意图；2 shows a schematic diagram of generating a sentence entity object coverage loss function and a language sequence loss function in a method for generating an image description model provided in Embodiment 20 of the present application;

图3示出了本申请提供的实施例30中一种图像描述模型的生成方法中的具体流程的示意图；3 shows a schematic diagram of a specific process in a method for generating an image description model in Embodiment 30 provided by the present application;

图4示出了本申请实施例40提供的一种获取图像描述的方法的流程示意图；FIG. 4 shows a schematic flowchart of a method for acquiring an image description provided by Embodiment 40 of the present application;

图5示出了本申请实施例50提供的一种图像描述模型的生成装置的示意图；FIG. 5 shows a schematic diagram of an apparatus for generating an image description model provided by Embodiment 50 of the present application;

图6示出了本申请实施例60提供的一种获取图像描述的装置的示意图；FIG. 6 shows a schematic diagram of an apparatus for acquiring image description provided by Embodiment 60 of the present application;

图7示出了本申请实施例70所提供的一种终端设备的示意图。FIG. 7 shows a schematic diagram of a terminal device provided by Embodiment 70 of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案及优点更加清楚明白，以下参照附图并举实施例，对本发明进一步详细说明。In order to make the objectives, technical solutions and advantages of the present application clearer, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

基于现有技术中的问题，本申请实施例提供了一种图像描述模型的生成方法，通过提出一种针对待识别对象的基于LSTM的指向机制LSTM-P，首先通过CNN提取给定的图像的视觉特征，其次，将这些视觉特征输入LSTM，LSTM输出单词表中包含的所有目标词汇的第一分布概率。此外，通过预训练的对象检测器对输入的图像中的实体对象进行识别，获取各个实体对象的预测分数，并在复制层将各个实体对象的预测分数和LSTM的当前解码时刻的隐状态作为输入，输出将目标词汇直接作为实体对象的描述的第二分布概率。最后，将两个模型分别输出的第一分布概率和第二分布概率输入指向机制，由指向机制动态地分配权重，输出最终的各个目标词汇的概率分布，并计算获得对应的语句实体对象覆盖损失函数和语言序列损失函数，以确定整个图像描述模型的模型损失函数，并优化图像描述模型。基于LSTM-P的图像描述模型扩大了可以识别的目标词汇的范围，同时提升对图像描述的描述语句的准确性。Based on the problems in the prior art, the embodiment of the present application provides a method for generating an image description model. By proposing an LSTM-based pointing mechanism LSTM-P for the object to be recognized, the Visual features, and secondly, these visual features are fed into the LSTM, which outputs the first distribution probabilities of all target words contained in the vocabulary. In addition, the pre-trained object detector is used to identify the entity objects in the input image, obtain the predicted scores of each entity object, and use the predicted scores of each entity object and the hidden state of the current decoding moment of the LSTM as input in the replication layer. , output the second distribution probability that the target vocabulary is directly used as the description of the entity object. Finally, the first distribution probability and the second distribution probability output by the two models are input to the pointing mechanism, the weights are dynamically allocated by the pointing mechanism, the final probability distribution of each target vocabulary is output, and the corresponding sentence entity object coverage loss is calculated. function and language sequence loss function to determine the model loss function of the entire image description model and optimize the image description model. The image description model based on LSTM-P expands the range of the target vocabulary that can be recognized, and at the same time improves the accuracy of the description sentences for image description.

本申请的应用领域主要是在计算机技术领域中，适用于计算机视觉技术领域和自然语言处理领域中。如图1所示，为本申请实施例所提供的一种图像描述模型的生成方法中的实施例10的示意图。其中，详细步骤如下：The application field of this application is mainly in the field of computer technology, and is applicable to the field of computer vision technology and natural language processing. As shown in FIG. 1 , it is a schematic diagram of Embodiment 10 in a method for generating an image description model provided by an embodiment of the present application. The detailed steps are as follows:

S11，获取图像样本和与图像样本对应的描述语句。S11, acquire an image sample and a description sentence corresponding to the image sample.

本步骤中，获取多个图像样本和图像样本的描述语句作为多对输入数据，如图2所示的图像样本，对应的描述语句为“a large dog laying on a blanket on a couch”，其中，需要预先设置图像样本对应的描述语句。同时，描述语句是对图像样本的内容含义的正确描述。In this step, multiple image samples and description sentences of the image samples are obtained as multiple pairs of input data. For the image samples shown in Figure 2, the corresponding description sentences are "a large dog laying on a blanket on a couch", where, The description sentence corresponding to the image sample needs to be preset. At the same time, the descriptive sentence is a correct description of the content meaning of the image sample.

S12，将图像样本和描述语句输入预先训练的第一预测模型，并生成描述语句中包含的每一个目标词汇在每个解码时刻对应的第一分布概率的第一分布概率集合。S12: Input the image sample and the description sentence into the pre-trained first prediction model, and generate a first distribution probability set of the first distribution probability corresponding to each target word included in the description sentence at each decoding moment.

本步骤中，第一预测模型主要由CNN和LSTM组成。CNN模型主要用于提取图像样本的特征向量，将图像样本输入CNN模型，输出图像样本的特征向量。具体的，将图像样本输入到CNN中，使用预先训练的网络VGG-16或者ResNet，并通过除输出类别得分的softmax分类器以外的网络层，以提取图像样本的特征向量。CNN通常称为编码器(Encoder)，将图像样本中包含的大量信息编码为特征向量，并进行处理，并将提取的特征向量作为后续LSTM的初始输入。LSTM的作用主要是通过解码处理特征向量，将其转换为自然语言。其中，将图像样本和对应的描述语句一起输入LSTM。具体的，将CNN提取的特征和描述语句中在上一个解码时刻的目标词汇输入LSTM，LSTM按照句子中词的顺序逐个的生成预测语句。这里，LSTM最终按照句子中词的顺序输出目标词汇在每个解码时刻的第一分布概率的第一分布概率集合。如图2所示，其中

为经过包含CNN和LSTM的第一预测模型对图像样本和对应的描述语句input sentence进行训练后得到的预测的各个第一分布概率组成的第一分布概率集合。In this step, the first prediction model is mainly composed of CNN and LSTM. The CNN model is mainly used to extract the feature vector of the image sample, input the image sample into the CNN model, and output the feature vector of the image sample. Specifically, the image samples are input into the CNN, the pre-trained network VGG-16 or ResNet is used, and the feature vectors of the image samples are extracted through network layers other than the softmax classifier that outputs the class scores. CNN is usually called an encoder, which encodes a large amount of information contained in image samples into feature vectors, processes them, and uses the extracted feature vectors as the initial input for subsequent LSTMs. The role of LSTM is mainly to process feature vectors through decoding and convert them into natural language. Among them, the image samples and the corresponding description sentences are input into the LSTM together. Specifically, the features extracted by the CNN and the target vocabulary in the description sentence at the last decoding time are input into the LSTM, and the LSTM generates the predicted sentences one by one according to the order of the words in the sentence. Here, LSTM finally outputs the first distribution probability set of the first distribution probability of the target vocabulary at each decoding moment according to the order of the words in the sentence. As shown in Figure 2, where

is a first distribution probability set composed of each predicted first distribution probability obtained after the image sample and the corresponding description sentence input sentence are trained by the first prediction model including CNN and LSTM.

S13，获取图像样本中的各个实体对象对应的目标词汇在每个解码时刻对应的第二分布概率的第二分布概率集合。S13: Acquire a second distribution probability set of the second distribution probability corresponding to the target vocabulary corresponding to each entity object in the image sample at each decoding moment.

本步骤中，首先将图像样本输入预训练好的对象检测器中，由对象检测器对图像样本中可能包含的各个实体对象进行识别。其中，对象检测器预先根据图像数据集进行训练。图像数据集中包含大量图像数据和对应的描述词语，使用对象检测器对图像样本中的各个实体对象进行识别，获取每个实体对象可能是对应的目标词汇的预测分数的预测分数集合。如图2所示，对象检测器表示为Object Leaner，将图像样本输入Object Leaner后，输出图像样本中包含的各个实体对象可能为目标词汇的预测分数。当在Object Leaner输入如图2所示的图像样本时，Object Leaner输出各个实体对象可能为相关的目标词汇的预测概率，即预测分数，如dog：1.00，couch；0.21，bed：0.13，blanket：0.12。In this step, the image sample is firstly input into the pre-trained object detector, and the object detector identifies each entity object that may be included in the image sample. Among them, the object detector is pre-trained on the image dataset. The image dataset contains a large amount of image data and corresponding descriptors. The object detector is used to identify each entity object in the image sample, and each entity object may be the predicted score set of the corresponding target vocabulary. As shown in Figure 2, the object detector is represented as Object Leaner. After inputting the image sample into Object Leaner, each entity object contained in the output image sample may be the predicted score of the target vocabulary. When the image sample shown in Figure 2 is input into Object Leaner, Object Leaner outputs the predicted probability that each entity object may be related to the target vocabulary, that is, the predicted score, such as dog: 1.00, couch; 0.21, bed: 0.13, blank: 0.12.

其次，在获取对各个实体对象的预测分数的预测分数集合后，将LSTM在当前解码时刻的隐状态和上述获取的预测分数集合同时输入复制层Copying Layer。在复制层中，将已经被识别的实体对象的目标词汇确定为输入的图像样本中各个实体对象的正确语义词，并生成目标词汇在每个解码时刻对应的第二分布概率的第二分布概率集合

Secondly, after obtaining the prediction score set of the prediction scores for each entity object, the hidden state of the LSTM at the current decoding moment and the obtained prediction score set are input into the Copying Layer at the same time. In the replication layer, the target vocabulary of the recognized entity object is determined as the correct semantic word of each entity object in the input image sample, and the second distribution probability of the second distribution probability corresponding to the target vocabulary at each decoding moment is generated gather

S14，分别为目标词汇在第一分布概率集合中对应的第一分布概率和在第二分布概率集合中对应的第二分布概率分配权重，以获取包含目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合。S14, respectively assigning weights to the first distribution probability corresponding to the target word in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, so as to obtain the third distribution probability corresponding to the target word at each decoding time. A third set of distribution probabilities for distribution probabilities.

本步骤中，在获取了第一预测模型生成的第一分布概率集合和Copying Layer中生成的第二分布概率集合后，将上述两个分布概率集合均输入创建的指向机制模型Pointing Mechanism中。Pointing Mechanism动态地为两个概率分布集合分配权重p_t和1-p_t，并计算加权和，以确定最后生成的包含预测语句中的各个目标词汇的第三分布概率的第三分布概率集合Pr^t。In this step, after obtaining the first distribution probability set generated by the first prediction model and the second distribution probability set generated in the Copying Layer, the above two distribution probability sets are input into the created Pointing Mechanism model. Pointing Mechanism dynamically assigns weights _pt and 1- _pt to the two probability distribution sets, and calculates the weighted sum to determine the finally generated third distribution probability set Pr that contains the third distribution probability of each target vocabulary in the predicted sentence ^t .

S15，基于第一分布概率集合、第二分布概率集合和权重值，确定语句实体对象覆盖损失函数和语言序列损失函数。S15: Determine the sentence entity object coverage loss function and the language sequence loss function based on the first distribution probability set, the second distribution probability set, and the weight value.

本步骤中，因为由第一预测模型最终生成的句子是按照描述语句中目标词汇的顺序逐个生成的每一个词汇组成的。因此，最终由第一预测模型生成的句子的第一分布率等于句子中每个词汇被生成的对数概率的总和来表示语言序列损失函数，通过最小化语言序列损失函数，保证最终输出的句子中各个目标词汇的上下文语法依赖关系，使得生成的句子在语法上流畅。In this step, because the sentence finally generated by the first prediction model is composed of each vocabulary generated one by one according to the order of the target vocabulary in the description sentence. Therefore, the first distribution rate of the sentence finally generated by the first prediction model is equal to the sum of the logarithmic probabilities generated by each word in the sentence to represent the language sequence loss function. By minimizing the language sequence loss function, the final output sentence is guaranteed. The contextual grammatical dependencies of each target vocabulary in the paper make the generated sentences grammatically fluent.

另外，针对预测语句最终的语义一致性，对目标词汇进行多标签分类。具体的，在上述复制层生成的第二分布概率集合中，计算在每个解码时刻的第二分布概率与分配的对应的权重值的乘积的总和，获取每个解码时刻的每个目标词汇的出词概率。最后，根据出词概率，获取语句实体对象覆盖损失函数。同时，通过最小化语句实体对象覆盖损失函数，保证最终输出的句子中的各个目标词汇与对应的各个图像样本中的实体对象的正确含义相符，提高生成的句子中各个目标词汇的准确性。In addition, for the final semantic consistency of predicted sentences, multi-label classification is performed on the target vocabulary. Specifically, in the second distribution probability set generated by the above-mentioned replication layer, the sum of the products of the second distribution probability at each decoding time and the assigned corresponding weight value is calculated, and the probability of each target vocabulary at each decoding time is obtained. word probability. Finally, according to the probability of word occurrence, the coverage loss function of the sentence entity object is obtained. At the same time, by minimizing the sentence entity object coverage loss function, it is ensured that each target word in the final output sentence is consistent with the correct meaning of the entity object in each corresponding image sample, and the accuracy of each target word in the generated sentence is improved.

S16，根据语句实体对象覆盖损失函数和语言序列损失函数，确定模型损失函数，并根据模型损失函数，对图像描述模型进行优化，直至生成图像描述模型。S16: Determine the model loss function according to the sentence entity object coverage loss function and the language sequence loss function, and optimize the image description model according to the model loss function until the image description model is generated.

本步骤中，在确定了上述语义一致损失函数和语言序列损失函数后，选定优化参数，确定模型损失函数。同时，根据模型损失函数不断修正LSTM-P图像描述模型，直至生成最优图像描述模型。In this step, after the above-mentioned semantic consistency loss function and language sequence loss function are determined, optimization parameters are selected to determine the model loss function. At the same time, the LSTM-P image description model is continuously revised according to the model loss function until the optimal image description model is generated.

基于本申请的上述实施例，首先获取图像样本和与图像样本对应的描述语句，其次，将图像样本和描述语句中的每个目标词汇依次输入预先训练的第一预测模型，并生成描述语句中包含的每一个目标词汇在每个解码时刻对应的第一分布概率的第一分布概率集合，同时，获取图像样本中的各个实体对象对应的目标词汇在每个解码时刻对应的第二分布概率的第二分布概率集合，然后，分别为目标词汇在第一分布概率集合中对应的第一分布概率和在第二分布概率集合中对应的第二分布概率分配权重值，以获取包含目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，并输出预测语句，基于第一分布概率集合、第二分布概率集合和权重值，分别确定语句实体对象覆盖损失函数和语言序列损失函数，最后，根据语句实体对象覆盖损失函数和语言序列损失函数，确定模型损失函数，并根据模型损失函数，对图像描述模型进行优化，直至生成图像描述模型。本申请实施例通过设计了一种基于LSTM-P的图像描述模型，以实现对待识别图像的图像描述。通过指向性机制，动态地将已识别的实体对象对应的正确的目标词汇集成到输出的预测语句中。基于LSTM-P的图像描述模型首先通过常规的CNN&RNN语言模型，充分挖掘了生成的目标词汇之间的语境关联性。同时，根据已识别的数据集训练的对象训练器来识别输入的图像样本中实体对象。另外，通过Copying Layer直接从已识别的实体对象对应的目标词汇中复制词汇。进一步地，指向机制分别为通过CNN和LSTM组成的第一预测模型生成预测语句，以及直接从已识别的目标词汇中复制词汇，通过这两条方式为目标词汇动态的分配权重值。Pointing Mechanism可以在当前语境中甄别出将对象样本复制到对应的描述语句的最佳时机。并通过减小输出的预测语句中的语句实体对象覆盖损失和语言序列损失的错误率，对基于LSTM-P的图像描述模型进行全局的训练。此外，将语句层级的覆盖率的缺失作为反馈，进一步提升输出的预测语句对图像样本中各个实体对象的覆盖率。Based on the above-mentioned embodiment of the present application, firstly obtain an image sample and a description sentence corresponding to the image sample; The first distribution probability set of the first distribution probability corresponding to each target word included at each decoding moment, and at the same time, the second distribution probability corresponding to the target word corresponding to each entity object in the image sample at each decoding moment is obtained. the second distribution probability set, and then assign weight values to the first distribution probability corresponding to the target word in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, so as to obtain the target word in each The third distribution probability set of the third distribution probability corresponding to each decoding moment, and output the predicted sentence. Based on the first distribution probability set, the second distribution probability set and the weight value, the sentence entity object coverage loss function and the language sequence loss function are determined respectively. , and finally, according to the sentence entity object coverage loss function and the language sequence loss function, the model loss function is determined, and according to the model loss function, the image description model is optimized until the image description model is generated. In the embodiment of the present application, an image description model based on LSTM-P is designed to realize the image description of the image to be recognized. Through the directivity mechanism, the correct target vocabulary corresponding to the identified entity object is dynamically integrated into the output prediction sentence. The image description model based on LSTM-P first fully exploits the contextual relevance between the generated target words through the conventional CNN&RNN language model. At the same time, an object trainer trained on the identified dataset recognizes solid objects in the input image samples. In addition, the vocabulary is directly copied from the target vocabulary corresponding to the recognized entity object through the Copying Layer. Further, the pointing mechanism is to generate predicted sentences through the first prediction model composed of CNN and LSTM, and directly copy words from the recognized target words, and dynamically assign weight values to the target words through these two methods. Pointing Mechanism can identify the best time to copy object samples to corresponding descriptive sentences in the current context. And by reducing the error rate of sentence entity object coverage loss and language sequence loss in the output prediction sentence, the image description model based on LSTM-P is trained globally. In addition, the lack of sentence-level coverage is used as feedback to further improve the coverage of each entity object in the image sample by the output predicted sentence.

如图3所示，为本申请提供的实施例30中一种图像描述模型的生成方法的具体流程的示意图。其中，该具体流程的详细过程如下：As shown in FIG. 3 , it is a schematic diagram of a specific flow of a method for generating an image description model in Embodiment 30 provided for this application. The detailed process of the specific process is as follows:

S301，获取图像样本和与图像样本对应的描述语句。S301: Obtain an image sample and a description sentence corresponding to the image sample.

S302，将图像样本和对应的描述语句中的每个目标词汇依次输入预先训练的第一预测模型。S302: Input the image sample and each target word in the corresponding description sentence into the pre-trained first prediction model in turn.

这里，第一预测训练模型包括卷积神经网络CNN和长短期记忆网络模型LSTM。具体的，将图像样本输入预先训练的卷积神经网络模型，提取图像样本的特征向量。进一步地，将特征向量和描述语句在上一个解码时刻对应的目标词汇输入长短期记忆网络模型。Here, the first prediction training model includes a convolutional neural network CNN and a long short-term memory network model LSTM. Specifically, the image samples are input into a pre-trained convolutional neural network model, and feature vectors of the image samples are extracted. Further, the feature vector and the target vocabulary corresponding to the description sentence at the last decoding moment are input into the long short-term memory network model.

S303，生成第一分布概率集合。S303: Generate a first distribution probability set.

本步骤中，第一预测模型输出的是描述语句中包含的每一个目标词汇在每个解码时刻对应的第一分布概率的第一分布概率集合。对于图像样本I，用句子S来描述该图像样本I。其中，S＝{ω₁，ω₂，...，ω_Ns}由N_s个单词组成。I∈R^Dv和Wt∈R^Dw分别表示Dv维视觉特征和句子S中第t个词的Dw维文本特征。在第一预测模型阶段，目标词汇ω_t+1的第一分布概率的公式1表示如下：In this step, the output of the first prediction model is a first distribution probability set describing the first distribution probability corresponding to each target word included in the sentence at each decoding moment. For an image sample I, a sentence S is used to describe the image sample I. where S={ω ₁ , ω ₂ , . . . , ω _Ns } consists of N _s words. I ∈ R ^Dv and Wt ∈ R ^Dw represent Dv-dimensional visual features and Dw-dimensional textual features of the t-th word in sentence S, respectively. In the first prediction model stage, the formula 1 of the first distribution probability of the target vocabulary ωt ₊₁ is expressed as follows:

其中，h^t为LSTM在t解码时刻的输出状态。Among them, h ^t is the output state of the LSTM at the decoding time t.

S304，将图像样本输入对象检测器，并生成预测分数集合。S304, input the image samples into the object detector, and generate a set of prediction scores.

这里，将图像样本输入基于预设的图像数据集预先训练的对象检测器中，并生成图像样本中包含的各个实体对象对应为目标词汇的预测分数的预测分数集合。Here, the image samples are input into an object detector pre-trained based on a preset image data set, and a prediction score set in which each entity object included in the image samples corresponds to the prediction score of the target vocabulary is generated.

S305，根据预测分数集合和LSTM的当前时刻的隐状态，生成第二分布概率集合。S305, according to the prediction score set and the hidden state of the LSTM at the current moment, generate a second distribution probability set.

本步骤中，首先将图像样本输入基于预设的图像数据集预先训练的对象检测器中，并生成图像样本中包含的各个实体对象对应为目标词汇的预测分数的预测分数集合。进一步地，基于预测分数集合和长短期记忆网络模型在当前解码时刻的隐状态，将目标词汇确定为实体对象的正确语义词，并生成目标词汇在每个解码时刻的第二分布概率的第二分布概率集合。第二分布概率的计算公式2如下：In this step, the image samples are first input into an object detector pre-trained based on a preset image data set, and a prediction score set in which each entity object included in the image samples corresponds to the prediction score of the target vocabulary is generated. Further, based on the prediction score set and the hidden state of the long short-term memory network model at the current decoding moment, the target vocabulary is determined as the correct semantic word of the entity object, and the second distribution probability of the target vocabulary at each decoding moment is generated. A collection of distribution probabilities. The calculation formula 2 of the probability of the second distribution is as follows:

其中，h^t为LSTM在t解码时刻的输出状态，I_c为对象检测器的输出，

和

是变换矩阵。where h ^t is the output state of the LSTM at decoding time t, I _c is the output of the object detector,

and

is the transformation matrix.

S306，指向机制为第一分布概率集合和第二分布概率集合动态分配权重值。S306, the pointing mechanism dynamically assigns weight values to the first distribution probability set and the second distribution probability set.

这里，将第一分布概率集合和第二分布概率集合输入到指向机制中后，指向机制动态地分配权重值，具体的计算过程由以下计算公式3体现：Here, after the first distribution probability set and the second distribution probability set are input into the pointing mechanism, the pointing mechanism dynamically assigns the weight value, and the specific calculation process is embodied by the following calculation formula 3:

Pt＝σ(G_sw_t+G_hh^t+b_p) (3)Pt=σ(G _s w _t +G _h h ^t +b _p ) (3)

其中，

和

分别表示文本特征的变换矩阵和LSTM模型输出的变换矩阵，b_p为偏置向量。in,

and

respectively represent the transformation matrix of the text feature and the transformation matrix output by the LSTM model, and b _p is the bias vector.

S307，根据分配的权重值确定第三分布概率集合。S307: Determine a third distribution probability set according to the assigned weight value.

这里，基于为目标词汇的第一分布概率分配的第一权重值和为第二分布概率分配的第二权重值，计算包含目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，其中，第一权重值和第二权重值的和为1。其中，目标词汇ω_t+1的最终的分布概率通过为上述公式1和公式2分配由公式3计算得出的权重值p_t得出：Here, based on the first weight value assigned to the first distribution probability of the target word and the second weight value assigned to the second distribution probability, a third distribution probability including the third distribution probability corresponding to the target word at each decoding time is calculated A set, wherein the sum of the first weight value and the second weight value is 1. Among them, the final distribution probability of the target vocabulary ω _t+1 is obtained by assigning the weight value p _t calculated by the formula 3 to the above formula 1 and formula 2:

其中，

和

分别表示第一权重值和第二权重值，φ表示归一化函数，如Softmax。in,

and

represent the first weight value and the second weight value, respectively, and φ represents a normalization function, such as Softmax.

S308，确定语言序列损失函数。S308, determine the language sequence loss function.

本步骤中，基于第一预测模型输出的对图像样本的预测语句和描述语句，确定语言序列损失函数。由于第一预测模型在每个解码时刻都会生成对应的一个目标词汇，采用链式规则对多有词汇的分布概率进行建模。整个输出的预测语句的负对数概率等于所有目标词汇的负对数第一分布概率的和，可以用如下公式5表达为:In this step, the language sequence loss function is determined based on the prediction sentence and the description sentence for the image sample output by the first prediction model. Since the first prediction model generates a corresponding target word at each decoding time, the chain rule is used to model the distribution probability of multiple words. The negative log probability of the entire output prediction sentence is equal to the sum of the negative log first distribution probabilities of all target words, which can be expressed by the following formula 5:

通过最小化公式5所示的语言序列损失函数，可以使输出的预测语句中的各个目标词汇之间的上下文依赖关系更准确。By minimizing the language sequence loss function shown in Equation 5, the context dependencies between each target vocabulary in the output predicted sentence can be made more accurate.

S309，确定语句实体对象覆盖损失函数。S309: Determine the sentence entity object coverage loss function.

这里，基于对前述公式2所表示的目标词汇的第二分布概率和公式3确定的权重值，进一步计算输出的预测语句中包含的各个目标词汇的覆盖范围。具体计算公式如下：Here, based on the second distribution probability of the target vocabulary represented by the aforementioned formula 2 and the weight value determined by the formula 3, the coverage of each target vocabulary included in the output prediction sentence is further calculated. The specific calculation formula is as follows:

其中，ω_oi表示被复制的目标词汇，σ为非线性激活函数，如Sigmoid函数，如图2右侧所示，在每个解码时刻获取每个目标词汇的第二分布概率与权重值的乘积，并获取句子中每个目标词汇的乘积的总和，即加权和。确定每个目标词汇的覆盖率。进一步地，根据以下公式确定语句实体对象覆盖损失函数：Among them, ω _oi represents the copied target vocabulary, σ is a nonlinear activation function, such as the Sigmoid function, as shown on the right side of Figure 2, the product of the second distribution probability of each target vocabulary and the weight value is obtained at each decoding time , and get the sum of the products of each target vocabulary in the sentence, the weighted sum. Determine coverage for each target vocabulary. Further, the sentence entity object coverage loss function is determined according to the following formula:

公式7为最终的语句实体对象覆盖损失函数。通过最小化语句实体对象覆盖损失函数，使描述语句更加接近图像样本的真实含义。Equation 7 is the final statement entity object coverage loss function. By minimizing the sentence entity object coverage loss function, the description sentence is closer to the real meaning of the image sample.

S310，确定模型损失函数。S310, determine the model loss function.

这里，根据公式5确定的语言序列损失函数和公式7确定的语句实体对象覆盖损失函数确定模型损失函数。具体的模型损失函数表达如下：Here, the model loss function is determined according to the language sequence loss function determined in Equation 5 and the sentence entity object coverage loss function determined in Equation 7. The specific model loss function is expressed as follows:

其中，λ为权衡参数。模型损失函数的目的是为优化图像描述模型，以使图像描述模型最后预测的描述语句符合语言逻辑，同时可以准确的形容输入的待预测图像。where λ is a trade-off parameter. The purpose of the model loss function is to optimize the image description model, so that the last description sentence predicted by the image description model conforms to the language logic, and can accurately describe the input image to be predicted.

S311，优化图像描述模型，模型生成。S311, the image description model is optimized, and the model is generated.

本申请实施例基于上述步骤实现一种图像描述模型的生成方法。通过构建一种基于LSTM-P结构的图像描述模型，实现图像描述的方法。同时，实现了语汇扩展的配置方法以及图像描述模型的混合网络的训练方法。其中，基于可用的已知的图像数据集合，对象检测器进行预训练，然后，构建指向机制，用来均衡基于CNN和LSTM组成到的第一预测模型生成的关于词汇识别结果的第一分布概率，以及直接从对象检测器识别的数据中复制提取的目标词汇。此外，进一步挖掘了预测语句中目标词汇的覆盖率，从而提升图像描述的准确度。基于COCO图像描述数据集以及ImageNet图像识别数据集的实验充分验证了模型和分析结果。The embodiments of the present application implement a method for generating an image description model based on the above steps. The method of image description is realized by constructing an image description model based on LSTM-P structure. At the same time, the configuration method of vocabulary expansion and the training method of hybrid network of image description model are implemented. Among them, based on the available known image data set, the object detector is pre-trained, and then a pointing mechanism is constructed to balance the first distribution probability of the word recognition result generated based on the first prediction model composed of CNN and LSTM , and copy the extracted target vocabulary directly from the data recognized by the object detector. In addition, the coverage of the target vocabulary in the predicted sentence is further mined, thereby improving the accuracy of image description. Experiments based on COCO image description dataset and ImageNet image recognition dataset fully validate the model and analysis results.

另外，如图4所示为本申请实施例40提供的一种获取图像描述的方法，将待识别图像输入如上述一种图像描述模型的生成方法中的各个步骤中的图像描述模型，以生成待识别图像的描述语句。In addition, as shown in FIG. 4 , a method for obtaining image description provided by Embodiment 40 of the present application is to input the image to be recognized into the image description model in each step of the above-mentioned method for generating an image description model, so as to generate A description of the image to be recognized.

S41，获取待识别图像。S41, acquiring an image to be recognized.

S42，并将待识别图像输入图像描述模型，以生成待识别图像的描述语句。S42, and input the image to be recognized into the image description model to generate a description sentence of the image to be recognized.

基于同一发明构思，本申请实施例50还提供一种图像描述模型的生成装置，其中，如图5所示，该生成装置包括：Based on the same inventive concept, Embodiment 50 of the present application further provides a device for generating an image description model, wherein, as shown in FIG. 5 , the generating device includes:

获取模块51，用于获取图像样本和与图像样本对应的描述语句；an acquisition module 51, configured to acquire an image sample and a description sentence corresponding to the image sample;

第一生成模块52，用于将图像样本和描述语句中的每个目标词汇依次输入预先训练的第一预测模型，并生成描述语句中包含的每一个目标词汇在每个解码时刻对应的第一分布概率的第一分布概率集合；The first generation module 52 is used to sequentially input each target word in the image sample and the description sentence into the pre-trained first prediction model, and generate the first prediction model corresponding to each target word included in the description sentence at each decoding moment. a first set of distribution probabilities of distribution probabilities;

第二生成模块53，用于获取图像样本中的各个实体对象对应的目标词汇在每个解码时刻对应的第二分布概率的第二分布概率集合；The second generation module 53 is used to obtain the second distribution probability set of the second distribution probability corresponding to the target vocabulary corresponding to each entity object in the image sample at each decoding moment;

第三生成模块54，用于分别为目标词汇在第一分布概率集合中对应的第一分布概率和在第二分布概率集合中对应的第二分布概率分配权重值，以获取包含目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，并输出预测语句；The third generating module 54 is configured to respectively assign weight values to the first distribution probability corresponding to the target word in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, so as to obtain the target word in each The third distribution probability set of the third distribution probability corresponding to each decoding moment, and output the prediction sentence;

第一确定模块55，用于基于第一分布概率集合、第二分布概率集合和权重值，分别确定语句实体对象覆盖损失函数和语言序列损失函数；The first determination module 55 is configured to determine the sentence entity object coverage loss function and the language sequence loss function respectively based on the first distribution probability set, the second distribution probability set and the weight value;

第二确定模块56，用于根据语句实体对象覆盖损失函数和语言序列损失函数，确定模型损失函数，并根据模型损失函数，对图像描述模型进行优化，直至生成图像描述模型。The second determination module 56 is configured to cover the loss function and the language sequence loss function according to the sentence entity object, determine the model loss function, and optimize the image description model according to the model loss function until the image description model is generated.

本实施例中，获取模块51、第一生成模块52、第二生成模块53、第三生成模块54、第一确定模块55和第二确定模块56的具体功能和交互方式，可参见图1对应的实施例的记载，在此不再赘述。In this embodiment, the specific functions and interaction methods of the acquisition module 51 , the first generation module 52 , the second generation module 53 , the third generation module 54 , the first determination module 55 and the second determination module 56 can be referred to in FIG. 1 . The description of the embodiment is not repeated here.

可选地，可选地，该生成装置还包括：Optionally, optionally, the generating device further includes:

提取模块57，用于将图像样本输入预先训练的卷积神经网络模型，并提取图像样本的特征向量；Extraction module 57, for inputting the image sample into the pre-trained convolutional neural network model, and extracting the feature vector of the image sample;

输入模块58，用于将特征向量和描述语句在上一个解码时刻对应的目标词汇输入长短期记忆网络模型。The input module 58 is configured to input the feature vector and the target vocabulary corresponding to the description sentence at the last decoding moment into the long short-term memory network model.

可选地，第二生成模块53包括：Optionally, the second generation module 53 includes:

输入单元，用于将图像样本输入基于预设的图像数据集预先训练的对象检测器中，并生成图像样本中包含的各个实体对象对应为目标词汇的预测分数的预测分数集合；The input unit is used to input the image sample into the object detector pre-trained based on the preset image data set, and generate a prediction score set in which each entity object included in the image sample corresponds to the prediction score of the target vocabulary;

生成单元，用于基于预测分数集合和长短期记忆网络模型在当前解码时刻的隐状态，将目标词汇确定为实体对象的正确语义词，并生成目标词汇在每个解码时刻的第二分布概率的第二分布概率集合。The generating unit is used to determine the target vocabulary as the correct semantic word of the entity object based on the prediction score set and the hidden state of the long short-term memory network model at the current decoding moment, and generate the second distribution probability of the target vocabulary at each decoding moment. The second distribution probability set.

可选地，第三生成模块54还用于：Optionally, the third generation module 54 is also used for:

基于为目标词汇的第一分布概率分配的第一权重值和为第二分布概率分配的第二权重值，计算包含目标词汇在每个解码时刻对应的第三分布概率的第三分布概率集合，其中，第一权重值和所述第二权重值的和为1。Based on the first weight value assigned to the first distribution probability of the target word and the second weight value assigned to the second distribution probability, a third distribution probability set containing the third distribution probability corresponding to the target word at each decoding time is calculated, The sum of the first weight value and the second weight value is 1.

可选地，第一确定模块55包括：Optionally, the first determining module 55 includes:

第一确定单元，用于基于每个目标词汇在第二分布概率集合中的每个解码时刻对应的第二分布概率与分配的权重值的和，与目标词汇在描述语句中的实际分布概率，确定语句实体对象覆盖损失函数；The first determining unit is configured to, based on the sum of the second distribution probability corresponding to each target word at each decoding moment in the second distribution probability set and the assigned weight value, and the actual distribution probability of the target word in the description sentence, Determine the statement entity object coverage loss function;

第二确定单元，用于基于第一预测模型输出的对图像样本的预测语句和描述语句，确定语言序列损失函数。The second determination unit is configured to determine the language sequence loss function based on the prediction sentence and the description sentence for the image sample output by the first prediction model.

基于同一发明构思，本申请实施例60还提供一种获取图像描述的装置，其中，如图6所示，该装置包括：Based on the same inventive concept, Embodiment 60 of the present application further provides an apparatus for acquiring image description, wherein, as shown in FIG. 6 , the apparatus includes:

获取模块61，用于获取待识别图像；an acquisition module 61, configured to acquire an image to be recognized;

生成模块62，用于将待识别图像输入图像描述模型，以生成待识别图像的描述语句。The generating module 62 is configured to input the image to be recognized into the image description model, so as to generate a description sentence of the image to be recognized.

本实施例中，获取模块61和生成模块62的具体功能和交互方式，可参见图4对应的实施例的记载，在此不再赘述。In this embodiment, for the specific functions and interaction methods of the acquisition module 61 and the generation module 62, reference may be made to the description of the embodiment corresponding to FIG. 4, and details are not described herein again.

如图7所示，本申请的又一实施例70还提供一种终端设备，包括处理器70，其中，处理器70用于执行上述一种…的方法的步骤。As shown in FIG. 7 , another embodiment 70 of the present application further provides a terminal device, including a processor 70, wherein the processor 70 is configured to execute the steps of one of the above methods.

从图7中还可以看出，上述实施例提供的终端设备还包括非瞬时计算机可读存储介质71，该非瞬时计算机可读存储介质71上存储有计算机程序，该计算机程序被处理器70运行时执行上述一种图像描述模型的生成方法的步骤。It can also be seen from FIG. 7 that the terminal device provided in the above embodiment further includes a non-transitory computer-readable storage medium 71 , and a computer program is stored on the non-transitory computer-readable storage medium 71 , and the computer program is executed by the processor 70 . When performing the steps of the above-mentioned method for generating an image description model.

具体地，该存储介质能够为通用的存储介质，如移动磁盘、硬盘和FLASH等，该存储介质上的计算机程序被运行时，能够执行上述的一种图像描述模型的生成方法。Specifically, the storage medium can be a general storage medium, such as a removable disk, a hard disk, and FLASH, etc. When the computer program on the storage medium is run, the above-mentioned method for generating an image description model can be executed.

最后应说明的是：以上所述实施例，仅为本申请的具体实施方式，用以说明本申请的技术方案，而非对其限制，本申请的保护范围并不局限于此，尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本申请实施例技术方案的精神和范围，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present application, and are used to illustrate the technical solutions of the present application, rather than limit them. The embodiments describe the application in detail, and those of ordinary skill in the art should understand that: any person skilled in the art can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the application. Or can easily think of changes, or equivalently replace some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be covered in this application. within the scope of protection. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. a generation method of image description model, is characterized in that, comprises:

obtaining an image sample and a description sentence corresponding to the image sample;

Inputting the image sample and each target word in the description sentence into a pre-trained first prediction model in turn, and generating a first distribution probability corresponding to each target word contained in the description sentence at each decoding moment The first distribution probability set of ;

obtaining a second distribution probability set of the second distribution probability corresponding to the target vocabulary corresponding to each entity object in the image sample at each decoding moment;

Weights are respectively assigned to the first distribution probability corresponding to the target vocabulary in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, so as to obtain the target vocabulary in each The third distribution probability set of the third distribution probability corresponding to each decoding moment, and output the prediction sentence;

Based on the first distribution probability set, the second distribution probability set, and the weight value, determine a semantic consistency loss function and a syntactic consistency loss function, respectively;

A model loss function is determined according to the semantic consistency loss function and the syntactic coherence loss function, and the image description model is optimized according to the model loss function until the image description model is generated.

2. The generation method according to claim 1, wherein the first prediction model is composed of a convolutional neural network model and a long short-term memory network model, and, in the image sample and the description sentence, each of the Between the step of sequentially inputting the target words into the pre-trained first prediction model and the step of generating the first distribution probability set of the first distribution probability corresponding to each target word included in the description sentence at each decoding moment, the The generating method further includes:

Inputting the image sample into the pre-trained convolutional neural network model, and extracting the feature vector of the image sample;

The feature vector and the target vocabulary corresponding to the description sentence at the last decoding time are input into the long short-term memory network model.

3 . The generating method according to claim 2 , wherein the target vocabulary corresponding to each entity object in the acquired image sample corresponds to the second distribution probability set of the second distribution probability corresponding to each decoding moment. 4 . steps, including:

Inputting the image sample into an object detector pre-trained based on a preset image data set, and generating a prediction score set in which each entity object included in the image sample corresponds to the prediction score of the target vocabulary;

Based on the prediction score set and the hidden state of the long short-term memory network model at the current decoding moment, the target vocabulary is determined as the correct semantic word of the entity object, and the target vocabulary at each decoding moment is generated. A second set of distribution probabilities for the second distribution probability.

4. The generation method according to claim 3, wherein the step of obtaining a third distribution probability set comprising the third distribution probability corresponding to the target vocabulary at each decoding moment comprises:

Based on the first weight value assigned to the first distribution probability of the target word and the second weight value assigned to the second distribution probability, a third distribution corresponding to the target word containing the target word at each decoding time is calculated A third distribution probability set of probabilities, wherein the sum of the first weight value and the second weight value is 1.

5. The generating method according to claim 4, wherein the step of respectively determining the semantic consistency loss function and the syntactic consistency loss function comprises:

Based on the sum of the second distribution probability corresponding to each of the target words at each of the decoding moments in the second distribution probability set and the assigned weight value, the target word in the description The actual distribution probability in the sentence to determine the semantic consistency loss function;

A syntactic coherence loss function is determined based on the prediction sentence for the image sample and the description sentence output by the first prediction model.

6. A method for obtaining image description, characterized in that, based on the method according to any one of the steps of claims 1 to 5:

Get the image to be recognized;

The to-be-recognized image is input into an image description model to generate a description sentence of the to-be-recognized image.

7. A device for generating an image description model, comprising:

an acquisition module for acquiring an image sample and a description sentence corresponding to the image sample;

The first generation module is used to sequentially input the image sample and each target word in the description sentence into the pre-trained first prediction model, and generate each target word contained in the description sentence in each decoding. the first distribution probability set of the first distribution probability corresponding to the moment;

A second generation module, configured to obtain a second distribution probability set of the second distribution probability corresponding to the target vocabulary corresponding to each entity object in the image sample at each decoding moment;

The third generation module is configured to assign weight values to the first distribution probability corresponding to the target word in the first distribution probability set and the second distribution probability corresponding to the second distribution probability set, to obtain A third distribution probability set containing the third distribution probability corresponding to the target vocabulary at each decoding moment, and outputting a predicted sentence;

a first determination module, configured to respectively determine a semantic consistency loss function and a syntactic consistency loss function based on the first distribution probability set, the second distribution probability set and the weight value;

The second determination module is configured to determine a model loss function according to the semantic consistency loss function and the syntactic coherence loss function, and optimize the image description model according to the model loss function until the generation of the Image description model.

8. The generating device according to claim 7, wherein the generating device further comprises:

an extraction module for inputting the image sample into the pre-trained convolutional neural network model, and extracting the feature vector of the image sample;

The input module is configured to input the feature vector and the target vocabulary corresponding to the description sentence at the last decoding moment into the long short-term memory network model.

9. The generating apparatus according to claim 8, wherein the second generating module comprises:

an input unit, configured to input the image sample into an object detector pre-trained based on a preset image data set, and generate a predicted score corresponding to the predicted score of the target vocabulary for each entity object contained in the image sample gather;

The generating unit is used to determine the target vocabulary as the correct semantic word of the entity object based on the prediction score set and the hidden state of the long short-term memory network model at the current decoding moment, and generate the target vocabulary in The second distribution probability set of the second distribution probability for each decoding moment.

10. The generating apparatus according to claim 9, wherein the third generating module is further configured to:

11. The generating apparatus according to claim 10, wherein the first determining module comprises:

The first determination unit is configured to, based on the sum of the second distribution probability corresponding to each of the target words at each of the decoding moments in the second distribution probability set and the assigned weight value, and the Determine the actual distribution probability of the target vocabulary in the description sentence, and determine the semantic consistency loss function;

A second determining unit, configured to determine a syntactic coherence loss function based on the prediction sentence for the image sample and the description sentence output by the first prediction model.

12. A device for acquiring image description, characterized in that it comprises:

an acquisition module for acquiring the image to be recognized;

A generating module is configured to input the image to be recognized into an image description model to generate a description sentence of the image to be recognized.

13. A non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores instructions that, when executed by a processor, cause the processor to perform any of claims 1 to 5. Each step in a method for generating an image description model described in one item.

14. A terminal device, characterized by comprising a processor, wherein the processor is configured to execute each step in the method for generating an image description model according to any one of claims 1 to 5.