CN110046656A

CN110046656A - Multi-modal scene recognition method based on deep learning

Info

Publication number: CN110046656A
Application number: CN201910242039.7A
Authority: CN
Inventors: 吴家皋; 刘源; 孙璨; 郑剑刚
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2019-07-23
Anticipated expiration: 2039-03-28
Also published as: CN110046656B

Abstract

The invention discloses a multi-modal scene recognition method based on deep learning, which includes the following steps: S1, performing word segmentation processing on short text; S2, inputting a group of pictures and short text word segmentation and corresponding labels into respective convolutional neural networks Train in the network; S3, train the short text classification model; S4, train the image classification model; S5, calculate the cross entropy of the fully connected layer output in S3 and S4 and the standard classification result respectively, calculate the average Euclidean distance and use this as the loss value, and then feed it back to the respective convolutional neural networks, and finally obtain a complete multimodal scene recognition model; S6, add the text and image prediction result vectors to obtain the final classification result; S7, add the short text to be recognized and images are respectively input into the trained multi-modal scene recognition model for scene recognition. The present invention proposes a multi-modal scene search method, which provides users with more accurate and convenient scene recognition.

Description

Multimodal scene recognition method based on deep learning

技术领域technical field

本发明涉及一种多模态场景识别方法，具体涉及一种基于深度学习的多模态场景识别方法，属于人工智能、模式识别领域。The invention relates to a multimodal scene recognition method, in particular to a multimodal scene recognition method based on deep learning, and belongs to the fields of artificial intelligence and pattern recognition.

背景技术Background technique

深度学习是机器学习的一个崭新的领域，其目的是让机器学习更加接近人类智能，卷积神经网络是深度学习的代表性算法，具有结构简单、适应性强、训练参数少而连接多等特点，因此，多年来这一网络被广泛地应用在图像处理和模式识别等领域。Deep learning is a new field of machine learning. Its purpose is to make machine learning closer to human intelligence. Convolutional neural network is a representative algorithm of deep learning. It has the characteristics of simple structure, strong adaptability, few training parameters and many connections. Therefore, this network has been widely used in image processing and pattern recognition for many years.

具体而言，卷积神经网络是一种层次模型，其输入是原始数据，通过卷积操作、池化操作、非线性激活函数等一系列操作的层层叠叠，将高层语意信息逐层从原始数据输入层中抽取出来、并逐层抽象。这一过程被称为“前馈运算”。最终，卷积神经网络最后一层输出目标函数，通过设计损失函数，计算预测值和真实值之间的误差损失，再通过反向传播算法，将误差由最后一层逐层向前反馈，更新每层参数，并在更新参数后再次前馈。如此往复，直到网络模型收敛，从而达到模型训练的目的。Specifically, a convolutional neural network is a hierarchical model whose input is raw data. Through a series of operations such as convolution operations, pooling operations, and nonlinear activation functions, the high-level semantic information is transformed from the original data layer by layer. The data input layer is extracted and abstracted layer by layer. This process is called "feedforward operation". Finally, the last layer of the convolutional neural network outputs the objective function. By designing the loss function, the error loss between the predicted value and the real value is calculated, and then the error is fed back layer by layer from the last layer through the back propagation algorithm to update parameters at each layer and feed forward again after updating the parameters. Repeat this until the network model converges, so as to achieve the purpose of model training.

目前常用的模态融合方式主要包括决策融合和特征融合两种方式。At present, the commonly used modal fusion methods mainly include decision fusion and feature fusion.

决策融合是指在获得两个模态分类结果的基础上，对两类结果进行加权综合，得出最终结果。Meng-Ju Han等在研究中提出了一种决策融合策略，这一策略将训练样本与决策平面的平均欧氏距离归一化后作为融合的权重，取得了比单模态高约5％的识别率。决策融合的方法虽然处理过程比较简单，但是其所获得的结果不够客观。Decision fusion refers to the weighted synthesis of the two types of results on the basis of obtaining the two modal classification results to obtain the final result. Meng-Ju Han et al. proposed a decision fusion strategy in their research, which normalized the average Euclidean distance between the training samples and the decision plane as the fusion weight, and achieved about 5% higher than single-modality. Recognition rate. Although the process of decision fusion is relatively simple, the results obtained are not objective enough.

特征融合则是指在将从两个模态提取出来的特征进行融合后再次进行分类。S.Emerich等在研究中对提取的面部表情特征和语音特征进行了特征的融合，融合后的特征识别率和鲁棒性较单模态均有提升。特征融合的方法所得出的结果比较客观，但事其实现方式则过于复杂。Feature fusion refers to classifying again after the features extracted from the two modalities are fused. In their research, S.Emerich et al. performed feature fusion on the extracted facial expression features and speech features, and the fusion feature recognition rate and robustness were both improved compared to single-modality features. The results obtained by the method of feature fusion are relatively objective, but the implementation method is too complicated.

综上所述，如何在现有技术的基础上提出一种全新的多模态场景识别方法，尽可能地保留决策融合和特征融合两种方式各自的优点、克服其各自的不足，也就成为了本领域内技术人员亟待解决的问题。To sum up, how to propose a new multi-modal scene recognition method on the basis of the existing technology, keep the respective advantages of decision fusion and feature fusion as much as possible, and overcome their respective shortcomings, which becomes the It solves the problem to be solved urgently by those skilled in the art.

发明内容SUMMARY OF THE INVENTION

鉴于现有技术存在上述缺陷，本发明的目的是提出一种基于深度学习的多模态场景识别方法，包括如下步骤：In view of the above-mentioned defects in the prior art, the purpose of the present invention is to propose a multi-modal scene recognition method based on deep learning, comprising the following steps:

S1、对短文本进行分词处理；S1. Perform word segmentation on short texts;

S2、将一组图片和短文本分词及相应的标签输入各自的卷积神经网络中进行训练；S2. Input a set of pictures and short text word segmentation and corresponding labels into their respective convolutional neural networks for training;

S3、训练短文本分类模型；S3. Train a short text classification model;

S4、训练图片分类模型；S4, train the image classification model;

S5、将S3与S4中的全连接层输出分别与标准分类结果计算交叉熵，计算平均欧式距离并以此作为损失值，随后再反馈回各自的卷积神经网络，反复进行训练，直至模型收敛，最终得到完整的多模态场景识别模型；S5. Calculate the cross-entropy of the fully connected layer output in S3 and S4 and the standard classification result respectively, calculate the average Euclidean distance and use this as the loss value, and then feed back to the respective convolutional neural network, and repeat the training until the model converges , and finally obtain a complete multimodal scene recognition model;

S6、将经过训练的文本和图像预测结果向量相加，得到最终的分类结果；S6. Add the trained text and image prediction result vectors to obtain the final classification result;

S7、将待识别的短文本和图像分别输入所训练出的所述多模态场景识别模型，进行场景识别。S7. Input the short text and image to be recognized into the trained multimodal scene recognition model, respectively, to perform scene recognition.

优选地，S1具体包括如下步骤：使用结巴分词工具对短文本进行分词处理。Preferably, S1 specifically includes the following steps: using a stuttering word segmentation tool to perform word segmentation processing on the short text.

优选地，S3具体包括如下步骤：Preferably, S3 specifically includes the following steps:

S31、将输入的短文本分词结果量化，输入三个并行卷积层中；S31, quantify the word segmentation result of the input short text, and input it into three parallel convolutional layers;

S32、将所述三个并行卷积层的输出依次送入线性修正单元层和池化层中，得到多个池化输出结果；S32, sending the outputs of the three parallel convolutional layers to the linear correction unit layer and the pooling layer in turn to obtain multiple pooled output results;

S33、将多个所述池化输出结果连接在一起，经过随机丢弃，作为全连接层的输入，最后计算全连接层，得到文本分类预测结果向量输出。S33. Connect a plurality of the pooled output results together, discard them at random, and use them as the input of the fully connected layer, and finally calculate the fully connected layer to obtain a text classification prediction result vector output.

优选地，所述三个并行卷积层包括第一卷积层、第二卷积层以及第三卷积层，所述第一卷积层具有384个3*128大小的卷积核，所述第二卷积层具有256个4*128大小的卷积核，所述第三卷积层具有128个5*128大小的卷积核。Preferably, the three parallel convolutional layers include a first convolutional layer, a second convolutional layer and a third convolutional layer, and the first convolutional layer has 384 convolution kernels with a size of 3*128, so The second convolution layer has 256 convolution kernels of size 4*128, and the third convolution layer has 128 convolution kernels of size 5*128.

优选地，S4具体包括如下步骤：Preferably, S4 specifically includes the following steps:

S41、将输入的图片送入第一层卷积网络，通过设计的卷积核个数提取图片中相应的特征个数，输出卷积层结果；S41. Send the input image to the first-layer convolutional network, extract the corresponding number of features in the image through the number of designed convolution kernels, and output the result of the convolution layer;

S42、将卷积层的输出进行池化，通过卷积核压缩数据核参数的量，减少过拟合，再将池化结果输入下一层卷积，反复经过4次卷积池化，使卷积核内的权值初始化为随机值，并不断训练获得模型参数；S42. Pool the output of the convolution layer, compress the amount of data kernel parameters through the convolution kernel, reduce overfitting, and then input the pooling result into the next layer of convolution, and repeat the convolution pooling for 4 times to make The weights in the convolution kernel are initialized to random values, and the model parameters are obtained through continuous training;

S43、将最后一层池化结果输入全连接层，经过随机丢弃，计算得到图像分类预测结果向量输出。S43: Input the pooling result of the last layer into the fully connected layer, and after random discarding, the image classification prediction result vector output is obtained by calculation.

优选地，S5中所述计算平均欧式距离并以此作为损失值，具体包括如下步骤：使用损失函数S计算损失值，所述损失函数S的计算公式为Preferably, calculating the average Euclidean distance as described in S5 and using it as the loss value specifically includes the following steps: using the loss function S to calculate the loss value, and the calculation formula of the loss function S is:

其中，h₁＝H(p₁,q₁),h₂＝H(p₂,q₂),h₃＝H(p₁,p₂)，p₁为S3中输出的文本分类预测结果向量，p₂为S4中输出的图像分类预测结果向量，q₁为文本分类标准结果向量，q₂为图像分类标准结果向量，H(·)为交叉熵函数。Among them, h ₁ =H(p ₁ ,q ₁ ),h ₂ =H(p ₂ ,q ₂ ),h ₃ =H(p ₁ ,p ₂ ), p ₁ is the text classification prediction result vector output in S3 , p ₂ is the image classification prediction result vector output in S4, q ₁ is the text classification standard result vector, q ₂ is the image classification standard result vector, and H( ) is the cross entropy function.

优选地，S6具体包括如下步骤：使用Softmax函数将训练好的文本和图像预测结果向量相加，得到最终的分类结果。Preferably, S6 specifically includes the following steps: using the Softmax function to add the trained text and image prediction result vectors to obtain the final classification result.

与现有技术相比，本发明的优点主要体现在以下几个方面：Compared with the prior art, the advantages of the present invention are mainly reflected in the following aspects:

本发明所提供的基于深度学习的多模态场景识别方法，提出了一种全新的多模态场景搜索方式，为用户提供了更加精准、方便的场景识别手段。本发明的方法全面提取了文字与图像的特征，并设计了新的损失函数，利用多种模态的信息，提高了场景识别的准确率。The deep learning-based multi-modal scene recognition method provided by the present invention proposes a brand-new multi-modal scene search method, and provides users with a more accurate and convenient scene recognition method. The method of the invention comprehensively extracts the features of text and images, designs a new loss function, and utilizes the information of various modalities to improve the accuracy of scene recognition.

本发明也为同领域内的其他相关问题提供了参考，可以以此为依据进行拓展延伸，运用于其他与场景识别方法相关的技术方案中，具有十分广阔的应用前景。The present invention also provides reference for other related problems in the same field, and can be expanded and extended based on this, and has a very broad application prospect when applied to other technical solutions related to the scene recognition method.

以下便结合实施例附图，对本发明的具体实施方式作进一步的详述，以使本发明技术方案更易于理解、掌握。The specific embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings of the embodiments, so as to make the technical solutions of the present invention easier to understand and grasp.

附图说明Description of drawings

图1为本发明所构建的多模态场景识别模型的结构示意图。FIG. 1 is a schematic structural diagram of a multimodal scene recognition model constructed by the present invention.

具体实施方式Detailed ways

本发明针对现有场景识别方法结果不准确、复杂度高等问题提供了一种新的基于深度学习的多模态场景识别方法，将输入的多模态信息，利用卷积神经网络分别提取图像和文本模态的特征信息，并将多模态特征信息进行融合，提高场景识别的准确率。Aiming at the problems of inaccurate results and high complexity of the existing scene recognition methods, the present invention provides a new multi-modal scene recognition method based on deep learning. feature information of text modalities, and fuse multi-modal feature information to improve the accuracy of scene recognition.

进一步而言，本发明的基于深度学习的多模态场景识别方法，包括如下步骤。Further, the deep learning-based multimodal scene recognition method of the present invention includes the following steps.

S1、使用结巴分词工具对短文本进行分词处理。S1. Use the stuttering word segmentation tool to segment the short text.

S2、将一组图片和短文本分词及相应的标签输入各自的卷积神经网络中进行训练。S2. Input a set of pictures and short text word segmentation and corresponding labels into the respective convolutional neural networks for training.

S3、训练短文本分类模型。具体包括如下步骤：S3. Train a short text classification model. Specifically include the following steps:

S31、在短文本分类模型的训练过程中，将输入的短文本分词结果量化，输入三个并行卷积层中。S31. In the training process of the short text classification model, quantify the input short text word segmentation result and input it into three parallel convolutional layers.

所述三个并行卷积层包括第一卷积层、第二卷积层以及第三卷积层，所述第一卷积层具有384个3*128大小的卷积核，所述第二卷积层具有256个4*128大小的卷积核，所述第三卷积层具有128个5*128大小的卷积核。The three parallel convolutional layers include a first convolutional layer, a second convolutional layer, and a third convolutional layer, the first convolutional layer has 384 convolution kernels of size 3*128, the second convolutional layer The convolution layer has 256 convolution kernels of size 4*128, and the third convolution layer has 128 convolution kernels of size 5*128.

S32、将所述三个并行卷积层的输出依次送入线性修正单元(relu)层和池化层中，得到多个池化输出结果。S32. Send the outputs of the three parallel convolutional layers to the linear correction unit (relu) layer and the pooling layer in turn to obtain multiple pooled output results.

S4、训练图片分类模型。具体包括如下步骤：S4. Train the image classification model. Specifically include the following steps:

S41、将输入的图片送入第一层卷积网络，通过设计的卷积核个数提取图片中相应的特征个数，输出卷积层结果。S41. Send the input image to the first-layer convolutional network, extract the corresponding number of features in the image through the designed number of convolution kernels, and output the result of the convolution layer.

S42、将卷积层的输出进行池化，通过卷积核压缩数据核参数的量，减少过拟合，再将池化结果输入下一层卷积，反复经过4次卷积池化，使卷积核内的权值初始化为随机值，并不断训练获得适用于本发明方法所使用的模型参数。S42. Pool the output of the convolution layer, compress the amount of data kernel parameters through the convolution kernel, reduce overfitting, and then input the pooling result into the next layer of convolution, and repeat the convolution pooling for 4 times to make The weights in the convolution kernel are initialized to random values, and are continuously trained to obtain model parameters suitable for the method of the present invention.

S5、将S3与S4中的全连接层输出分别与标准分类结果计算交叉熵，计算平均欧式距离并以此作为损失值，随后再反馈回各自的卷积神经网络，反复进行训练，直至模型收敛，最终得到完整的多模态场景识别模型。模型结构如图1所示。S5. Calculate the cross-entropy of the fully connected layer output in S3 and S4 and the standard classification result respectively, calculate the average Euclidean distance and use this as the loss value, and then feed back to the respective convolutional neural network, and repeat the training until the model converges , and finally a complete multimodal scene recognition model is obtained. The model structure is shown in Figure 1.

所述计算平均欧式距离并以此作为损失值，具体包括如下步骤：使用损失函数S计算损失值，所述损失函数S的计算公式为The calculating the average Euclidean distance as the loss value specifically includes the following steps: using the loss function S to calculate the loss value, and the calculation formula of the loss function S is:

S6、使用Softmax函数将训练好的文本和图像预测结果向量相加，得到最终的分类结果。S6. Use the Softmax function to add the trained text and image prediction result vectors to obtain the final classification result.

总体而言，本发明将图像卷积神经网络与短文本卷积神经网络相融合，采用一种新的决策融合方式，先经过训练获得两类分类结果再与标准结果之间计算交叉熵，然后再计算两种模态之间训练所得的分类结果之间的交叉熵，最后计算三者的平均欧式距离作为损失值返回到前馈网络更新参数，相比于现有技术具有更高的识别率。In general, the present invention integrates the image convolutional neural network and the short text convolutional neural network, adopts a new decision fusion method, first obtains two types of classification results through training, and then calculates the cross-entropy between the standard results and the results. Then calculate the cross entropy between the classification results obtained from the training between the two modalities, and finally calculate the average Euclidean distance of the three as the loss value to return to the feedforward network to update the parameters, which has a higher recognition rate than the prior art. .

本发明也为同领域内的其他相关问题提供了参考，可以以此为依据进行拓展延伸，运用于其他与对情绪识别分析方法相关的技术方案中，具有十分广阔的应用前景。The present invention also provides reference for other related problems in the same field, and can be expanded and extended on the basis of this, and has a very broad application prospect when applied to other technical solutions related to the emotion recognition and analysis method.

对于本领域技术人员而言，显然本发明不限于上述示范性实施例的细节，而且在不背离本发明的精神和基本特征的情况下，能够以其他的具体形式实现本发明。因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本发明的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本发明内，不应将权利要求中的任何附图标记视为限制所涉及的权利要求。It will be apparent to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, but that the present invention may be embodied in other specific forms without departing from the spirit and essential characteristics of the present invention. Therefore, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the invention is to be defined by the appended claims rather than the foregoing description, which are therefore intended to fall within the scope of the claims. All changes that come within the meaning and range of equivalents of , are intended to be embraced within the invention, and any reference signs in the claims shall not be construed as limiting the involved claim.

此外，应当理解，虽然本说明书按照实施方式加以描述，但并非每个实施方式仅包含一个独立的技术方案，说明书的这种叙述方式仅仅是为清楚起见，本领域技术人员应当将说明书作为一个整体，各实施例中的技术方案也可以经适当组合，形成本领域技术人员可以理解的其他实施方式。In addition, it should be understood that although this specification is described in terms of embodiments, not each embodiment only includes an independent technical solution, and this description in the specification is only for the sake of clarity, and those skilled in the art should take the specification as a whole , the technical solutions in each embodiment can also be appropriately combined to form other implementations that can be understood by those skilled in the art.

Claims

1. a kind of multi-modal scene recognition method based on deep learning, which comprises the steps of:

S1, word segmentation processing is carried out to short text；

S2, one group of picture and short text participle and corresponding label are inputted in respective convolutional neural networks and is trained；

S3, training short text disaggregated model；

S4, training picture classification model；

S5, the full articulamentum output in S3 and S4 is calculated into cross entropy with criteria classification result respectively, calculates average Euclidean distance And in this, as penalty values, respective convolutional neural networks are then fed back to again, training is repeated, until model is restrained, most Complete multi-modal scene Recognition model is obtained eventually；

S6, trained text is added with image prediction result vector, obtains final classification results；

S7, short text to be identified and image are inputted to the multi-modal scene Recognition model trained respectively, carry out field Scape identification.

2. the multi-modal scene recognition method according to claim 1 based on deep learning, which is characterized in that S1 is specifically wrapped It includes following steps: word segmentation processing being carried out to short text using stammerer participle tool.

3. the multi-modal scene recognition method according to claim 1 based on deep learning, which is characterized in that S3 is specifically wrapped Include following steps:

S31, the short text word segmentation result of input is quantified, is inputted in three parallel-convolution layers；

S32, the output of three parallel-convolution layers is sequentially sent in linear amending unit layer and pond layer, obtains multiple ponds Change output result；

S33, multiple pondization output results are linked together, by random drop, as the input of full articulamentum, finally Full articulamentum is calculated, the output of text classification prediction result vector is obtained.

4. the multi-modal scene recognition method according to claim 3 based on deep learning, it is characterised in that: described three Parallel-convolution layer includes the first convolutional layer, the second convolutional layer and third convolutional layer, and first convolutional layer has 384 3* The convolution kernel of 128 sizes, second convolutional layer have the convolution kernel of 256 4*128 sizes, and the third convolutional layer has The convolution kernel of 128 5*128 sizes.

5. the multi-modal scene recognition method according to claim 3 based on deep learning, which is characterized in that S4 is specifically wrapped Include following steps:

S41, the picture of input is sent into first layer convolutional network, is extracted by the convolution kernel number of design corresponding special in picture Number is levied, convolutional layer result is exported；

S42, the output of convolutional layer is subjected to pond, by the amount of convolution kernel compressed data nuclear parameter, reduces over-fitting, then by pond Change result and input next layer of convolution, passes through 4 convolution ponds repeatedly, make the weight initialization random value in convolution kernel, not Disconnected training obtains model parameter；

S43, image classification prediction result is calculated by random drop in the full articulamentum of the last layer pond result input Vector output.

6. the multi-modal scene recognition method according to claim 5 based on deep learning, which is characterized in that described in S5 Calculate average Euclidean distance and in this, as penalty values, specifically comprise the following steps: using loss function S calculate penalty values, institute The calculation formula for stating loss function S is

Wherein, h₁=H (p₁,q₁),h₂=H (p₂,q₂),h₃=H (p₁,p₂), p₁For the text classification prediction result that is exported in S3 to Amount, p₂For the image classification prediction result vector exported in S4, q₁For text classification standard results vector, q₂For image classification mark Quasi- result vector, H () are to intersect entropy function.

7. the multi-modal scene recognition method according to claim 1 based on deep learning, which is characterized in that S6 is specifically wrapped It includes following steps: trained text being added with image prediction result vector using Softmax function, obtains final classification As a result.