CN113610080B

CN113610080B - Cross-modal perception-based sensitive image identification method, device, equipment and medium

Info

Publication number: CN113610080B
Application number: CN202110892160.1A
Authority: CN
Inventors: 吴旭; 吴京宸; 高丽; 颉夏青; 杨金翠; 孙利娟; 张熙; 方滨兴
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2023-08-25
Anticipated expiration: 2041-08-04
Also published as: CN113610080A

Abstract

The invention discloses a sensitive image recognition method, device, device and medium based on cross-modal perception. The method includes: acquiring image information to be recognized in a network community; inputting the image information into a preset sensitive image recognition The cross-modal perception module in the model obtains the cross-modal text description of the image information; inputs the cross-modal text description of the image information into the sensitive information identification module in the sensitive image recognition model to obtain the sensitive information containing sensitive images. According to the sensitive image recognition method provided by the embodiments of the present disclosure, it aims to express the semantic information content of online community images across modalities, and integrate a large amount of prior knowledge of sensitive text content in online communities to analyze and judge the content of community images more accurately. By obtaining cross-modal text descriptions of images, it is possible to disseminate and trace sensitive image information.

Description

Sensitive image recognition method, device, equipment and medium based on cross-modal perception

技术领域technical field

本发明涉及图像识别技术领域，特别涉及一种基于跨模态感知的敏感图像识别方法、装置、设备及介质。The present invention relates to the technical field of image recognition, in particular to a sensitive image recognition method, device, equipment and medium based on cross-modal perception.

背景技术Background technique

随着多媒体时代的发展,图像化的网络社区环境是网络信息传播和发展的重要特点。重视图像化的网络社区环境,对图像信息进行准确的判断识别,对敏感内容进行及时和必要的干预有利于维护网络社区环境的稳定,有利于维护社会治安。With the development of the multimedia era, the graphic network community environment is an important feature of network information dissemination and development. Paying attention to the image-based network community environment, accurately judging and identifying image information, and timely and necessary intervention on sensitive content are conducive to maintaining the stability of the network community environment and maintaining social order.

现有技术中，图像承载的信息经过直观视觉化的方式表达,其语义信息并不能直接获取而需通过大脑的“阅读”和“理解”。近些年随着图像处理技术的迅猛发展,图像内容识别技术也得到了一定的提升。当前国内外对敏感图像内容识别分析基本采用图像分类技术。基于图像分类技术的敏感图像识别技术通过神经网络提取图像特征,将特征向量作为神经网络末端全连接层的输入,全连接层的输出即为图像内容分析结果。敏感图像中的主体、线条、色彩等低级特征经过此过程被捕捉并基于此进行判别分析。然而图像主体间关系、主体行为等图像语义内容在这一识别分类过程中无法获取,且与敏感图像内容识别密切相关的大量网络社区知识存在形式上的分离，使得图像内容判别结果无法结合大量网络社区文本信息的先验知识进行判别，识别准确率低,可理解性差。网络社区环境复杂，图像信息在时间维度上的传播和发酵是维护网络安全的重点。基于图像分类的敏感图像识别技术无法进行图像信息传播的追溯。In the existing technology, the information carried by the image is expressed in an intuitive and visual way, and its semantic information cannot be obtained directly but needs to be "read" and "understood" by the brain. In recent years, with the rapid development of image processing technology, image content recognition technology has also been improved to a certain extent. At present, image classification technology is basically used in the identification and analysis of sensitive image content at home and abroad. Sensitive image recognition technology based on image classification technology extracts image features through neural network, and uses the feature vector as the input of the fully connected layer at the end of the neural network, and the output of the fully connected layer is the image content analysis result. Low-level features such as subjects, lines, and colors in sensitive images are captured through this process and based on this, discriminant analysis is performed. However, image semantic content such as the relationship between image subjects and subject behavior cannot be obtained in this identification and classification process, and there is a formal separation of a large amount of network community knowledge closely related to the identification of sensitive image content, which makes it difficult to combine the results of image content discrimination with a large number of networks. The prior knowledge of community text information is used to discriminate, the recognition accuracy is low, and the comprehensibility is poor. The network community environment is complex, and the dissemination and fermentation of image information in the time dimension is the focus of maintaining network security. Sensitive image recognition technology based on image classification cannot trace the dissemination of image information.

发明内容Contents of the invention

本公开实施例提供了一种基于跨模态感知的敏感图像识别方法、装置、设备及介质。为了对披露的实施例的一些方面有一个基本的理解，下面给出了简单的概括。该概括部分不是泛泛评述，也不是要确定关键/重要组成元素或描绘这些实施例的保护范围。其唯一目的是用简单的形式呈现一些概念，以此作为后面的详细说明的序言。Embodiments of the present disclosure provide a sensitive image recognition method, device, device and medium based on cross-modal perception. In order to provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is presented below. This summary is not an overview, nor is it intended to identify key/critical elements or delineate the scope of these embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

第一方面，本公开实施例提供了一种基于跨模态感知的敏感图像识别方法，包括：In the first aspect, the embodiment of the present disclosure provides a sensitive image recognition method based on cross-modal perception, including:

获取网络社区中待识别的图像信息；Obtain image information to be identified in the online community;

将图像信息输入预设的敏感图像识别模型中的跨模态感知模块，得到图像信息的跨模态文本描述；Input the image information into the cross-modal perception module in the preset sensitive image recognition model to obtain the cross-modal text description of the image information;

将图像信息的跨模态文本描述输入敏感图像识别模型中的敏感信息识别模块，得到含有敏感信息的敏感图像。The cross-modal text description of image information is input into the sensitive information identification module in the sensitive image identification model to obtain sensitive images containing sensitive information.

在一个实施例中，将图像信息输入预设的敏感图像识别模型之前，还包括：In one embodiment, before inputting the image information into the preset sensitive image recognition model, it also includes:

构建训练数据集，基于训练数据集训练敏感图像识别模型，其中，敏感图像识别模型包括跨模态感知模块以及敏感信息识别模块。A training data set is constructed, and a sensitive image recognition model is trained based on the training data set, wherein the sensitive image recognition model includes a cross-modal perception module and a sensitive information recognition module.

在一个实施例中，将图像信息输入预设的敏感图像识别模型中的跨模态感知模块，得到图像信息的跨模态文本描述，包括：In one embodiment, the image information is input into the cross-modal perception module in the preset sensitive image recognition model, and the cross-modal text description of the image information is obtained, including:

根据跨模态感知模块识别图像内的显著主体；Identify salient subjects within an image based on a cross-modal perception module;

确定预训练的跨模态描述模型群中与识别出来的图像主体相对应的跨模态描述模型；Determine the cross-modal description model corresponding to the recognized image subject in the pre-trained cross-modal description model group;

根据跨模态描述模型对图像内的主体、主体间关系以及主体行为的高级语义信息进行泛化的内容文本模态转化，得到图像信息的跨模态文本描述。According to the cross-modal description model, the subject, inter-subject relationship and high-level semantic information of the subject's behavior in the image are generalized to the content text mode transformation, and the cross-modal text description of the image information is obtained.

在一个实施例中，根据跨模态感知模块识别图像内的显著主体，包括：In one embodiment, identifying salient subjects within an image according to a cross-modal perception module includes:

根据跨模态感知模块中的主体捕捉单元识别图像内的显著主体，主体捕捉单元包括DenseNet-Block网络结构，其计算公式如下所示：According to the subject capture unit in the cross-modal perception module to identify the salient subject in the image, the subject capture unit includes a DenseNet-Block network structure, and its calculation formula is as follows:

X_l＝H_l([X₀,X₁…X_l-1])X _l ＝H _l ([X ₀ ,X ₁ …X _l-1 ])

其中，[X₀,X₁…X_l-1]表示将0至l-1层的特征图做通道上的合并，H_l表示对合并后的特征进行归一化操作、激活操作以及卷积操作，xl表示第l层卷积计算之后的结果。Among them, [X ₀ ,X ₁ ...X _l-1 ] means that the feature maps of layers 0 to l-1 are merged on the channel, and H _l means that the merged features are normalized, activated, and convoluted operation, xl represents the result after the l-th layer convolution calculation.

在一个实施例中，根据跨模态描述模型对图像内的主体、主体间关系以及主体行为的高级语义信息进行泛化的内容文本模态转化，得到图像信息的跨模态文本描述，包括：In one embodiment, according to the cross-modal description model, the subject in the image, the relationship between the subjects and the high-level semantic information of the subject's behavior are generalized to the content text mode conversion, and the cross-modal text description of the image information is obtained, including:

通过跨模态描述模型中的VGGNET网络结构提取图像特征；Extract image features through the VGGNET network structure in the cross-modal description model;

将提取出来的图像特征输入包含注意力机制的长短期记忆循环神经网络，得到图像内的主体、主体间关系以及主体行为的高级语义信息的跨模态文本描述。The extracted image features are input into the long-short-term memory recurrent neural network including the attention mechanism, and the cross-modal text description of the subject in the image, the relationship between subjects and the high-level semantic information of the subject's behavior is obtained.

在一个实施例中，将图像信息的跨模态文本描述输入敏感图像识别模型中的敏感信息识别模块，得到含有敏感信息的敏感图像，包括：In one embodiment, the cross-modal text description of image information is input into the sensitive information identification module in the sensitive image identification model to obtain sensitive images containing sensitive information, including:

根据预先构建的训练集训练TextCNN卷积神经网络，得到训练好的敏感信息识别模块；Train the TextCNN convolutional neural network according to the pre-built training set to obtain the trained sensitive information identification module;

将图像信息的跨模态文本描述输入敏感信息识别模块，得到识别出来的敏感文本信息；Input the cross-modal text description of the image information into the sensitive information identification module to obtain the identified sensitive text information;

将敏感文本信息对应的图像作为敏感图像。The image corresponding to the sensitive text information is regarded as a sensitive image.

第二方面，本公开实施例提供了一种基于跨模态感知的敏感图像识别装置，包括：In the second aspect, an embodiment of the present disclosure provides a sensitive image recognition device based on cross-modal perception, including:

获取模块，用于获取网络社区中待识别的图像信息；An acquisition module, configured to acquire image information to be identified in the online community;

跨模态描述模块，用于将图像信息输入预设的敏感图像识别模型中的跨模态感知模块，得到图像信息的跨模态文本描述；The cross-modal description module is used to input the image information into the cross-modal perception module in the preset sensitive image recognition model to obtain the cross-modal text description of the image information;

识别模块，用于将图像信息的跨模态文本描述输入敏感图像识别模型中的敏感信息识别模块，得到含有敏感信息的敏感图像。The recognition module is used to input the cross-modal text description of image information into the sensitive information recognition module in the sensitive image recognition model to obtain sensitive images containing sensitive information.

在一个实施例中，还包括：In one embodiment, also includes:

训练模块，用于构建训练数据集，基于训练数据集训练敏感图像识别模型，其中，敏感图像识别模型包括跨模态感知模块以及敏感信息识别模块。The training module is used to construct a training data set, and train a sensitive image recognition model based on the training data set, wherein the sensitive image recognition model includes a cross-modal perception module and a sensitive information recognition module.

第三方面，本公开实施例提供了一种基于跨模态感知的敏感图像识别设备，包括处理器和存储有程序指令的存储器，处理器被配置为在执行程序指令时，执行上述实施例提供的基于跨模态感知的敏感图像识别方法。In a third aspect, an embodiment of the present disclosure provides a sensitive image recognition device based on cross-modal perception, including a processor and a memory storing program instructions, and the processor is configured to execute the above-mentioned embodiments when executing the program instructions. Sensitive image recognition method based on cross-modal perception.

第四方面，本公开实施例提供了一种计算机可读介质，其上存储有计算机可读指令，计算机可读指令可被处理器执行以实现上述实施例提供的一种基于跨模态感知的敏感图像识别方法。In a fourth aspect, the embodiments of the present disclosure provide a computer-readable medium on which computer-readable instructions are stored, and the computer-readable instructions can be executed by a processor to implement the cross-modal perception-based Sensitive Image Recognition Methods.

本公开实施例提供的技术方案可以包括以下有益效果：The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:

本公开实施例提供了一种基于跨模态内容感知的网络社区敏感图像识别模型(SIR-CM)。该模型主要由图像内容跨模态感知模块以及网络社区敏感文本信息识别模块组成。SIR-CM在MSCOCO数据集及网络社区敏感图像标注数据集上实现了泛化的网络社区图像内容文本转化模块，可对社区图像内容文本进行细粒度的跨模态表达。在网络社区敏感文本信息识别模块中融合网络社区环境下的敏感内容文本集先验知识，使其具备网络社区环境下对敏感信息的分析和识别能力，得到更加准确并且更具可理解性的敏感图像识别结果。另外，评论的追加、话题的发酵等基于时间维度上的信息传播内容,在获得图像的文本模态内容之后，使得图像信息与其追加评论、话题文本等相关后续信息做到了形式的统一，对图像信息的传播能够进行进一步的追溯。The embodiment of the present disclosure provides a network community-sensitive image recognition model (SIR-CM) based on cross-modal content awareness. The model is mainly composed of image content cross-modal perception module and network community sensitive text information recognition module. SIR-CM implements a generalized network community image content text conversion module on the MSCOCO dataset and online community sensitive image annotation dataset, which can perform fine-grained cross-modal expression of community image content text. Integrating the prior knowledge of sensitive content text sets in the network community environment in the network community sensitive text information identification module, it has the ability to analyze and identify sensitive information in the network community environment, and obtains more accurate and understandable sensitive information. Image recognition results. In addition, the content of information dissemination based on the time dimension, such as the addition of comments and the fermentation of topics, after obtaining the text modal content of the image, makes the image information and its related follow-up information such as additional comments and topic texts unified in form. The dissemination of information can be further traced.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本发明。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并与说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.

图1是根据一示例性实施例示出的一种基于跨模态感知的敏感图像识别方法的流程示意图；Fig. 1 is a schematic flowchart of a sensitive image recognition method based on cross-modal perception shown according to an exemplary embodiment;

图2是根据一示例性实施例示出的一种敏感图像识别模型的结构示意图；Fig. 2 is a schematic structural diagram of a sensitive image recognition model shown according to an exemplary embodiment;

图3是根据一示例性实施例示出的一种主体捕捉单元的示意图；Fig. 3 is a schematic diagram of a subject capturing unit according to an exemplary embodiment;

图4是根据一示例性实施例示出的一种内容文本上下文生成过程图；Fig. 4 is a process diagram showing a content text context generation process according to an exemplary embodiment;

图5是根据一示例性实施例示出的一种隐状态生成过程图；Fig. 5 is a process diagram showing a hidden state generation process according to an exemplary embodiment;

图6是根据一示例性实施例示出的一种门变量生成过程图；Fig. 6 is a diagram showing a gate variable generation process according to an exemplary embodiment;

图7是根据一示例性实施例示出的一种当前单词生成过程图；Fig. 7 is a kind of current word generation process diagram shown according to an exemplary embodiment;

图8是根据一示例性实施例示出的一种TextCNN卷积神经网络的结构图；Fig. 8 is a structural diagram of a TextCNN convolutional neural network shown according to an exemplary embodiment;

图9是根据一示例性实施例示出的一种图像内容描述示意图；Fig. 9 is a schematic diagram showing a description of image content according to an exemplary embodiment;

图10是根据一示例性实施例示出的一种基于跨模态感知的敏感图像识别装置的结构示意图；Fig. 10 is a schematic structural diagram of a sensitive image recognition device based on cross-modal perception according to an exemplary embodiment;

图11是根据一示例性实施例示出的一种基于跨模态感知的敏感图像识别设备的结构示意图；Fig. 11 is a schematic structural diagram of a sensitive image recognition device based on cross-modal perception according to an exemplary embodiment;

图12是根据一示例性实施例示出的一种计算机存储介质的示意图。Fig. 12 is a schematic diagram of a computer storage medium according to an exemplary embodiment.

具体实施方式Detailed ways

以下描述和附图充分地示出本发明的具体实施方案，以使本领域的技术人员能够实践它们。The following description and drawings illustrate specific embodiments of the invention sufficiently to enable those skilled in the art to practice them.

应当明确，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。It should be clear that the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反，它们仅是如所附权利要求书中所详述的、本发明的一些方面相一致的系统和方法的例子。When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of systems and methods consistent with aspects of the invention as recited in the appended claims.

在本发明的描述中，需要理解的是，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。此外，在本发明的描述中，除非另有说明，“多个”是指两个或两个以上。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。In the description of the present invention, it should be understood that the terms "first", "second" and so on are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention in specific situations. In addition, in the description of the present invention, unless otherwise specified, "plurality" means two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. The character "/" generally indicates that the contextual objects are an "or" relationship.

图1是根据一示例性实施例示出的一种基于跨模态感知的敏感图像识别方法，如图1所示，该方法具体包括以下步骤。Fig. 1 shows a sensitive image recognition method based on cross-modal perception according to an exemplary embodiment. As shown in Fig. 1 , the method specifically includes the following steps.

S101获取网络社区中待识别的图像信息。S101 Acquire image information to be identified in the online community.

本实施例利用图像描述技术提出了基于跨模态内容感知的网络社区敏感图像识别模型,旨在跨模态表达感知网络社区图像的语义信息内容,并融合大量网络社区敏感文本内容先验知识,对社区图像的内容进行更准确及更可理解的分析判别，通过获取网络社区图像的跨模态内容文本使得对敏感图像信息的传播及追溯成为可能。This embodiment uses image description technology to propose a network community-sensitive image recognition model based on cross-modal content perception, aiming at expressing the semantic information content of perceptual network community images across modalities, and integrating a large amount of prior knowledge of network community-sensitive text content, The content of community images can be analyzed and judged more accurately and understandably, and the dissemination and traceability of sensitive image information can be made possible by obtaining the cross-modal content text of network community images.

首先，获取网络社区中待识别的图像信息，然后将获取到的图像输入训练好的敏感图像识别模型进行识别。First, obtain the image information to be recognized in the network community, and then input the obtained image into the trained sensitive image recognition model for recognition.

S102将图像信息输入预设的敏感图像识别模型中的跨模态感知模块，得到图像信息的跨模态文本描述。S102 Input the image information into the cross-modal perception module in the preset sensitive image recognition model to obtain a cross-modal text description of the image information.

在一个实施例中，将图像信息输入预设的敏感图像识别模型之前，还包括：构建网络社区敏感图像标注数据集，基于网络社区敏感图像标注数据集以及MSCOCO数据集训练敏感图像识别模型。In one embodiment, before inputting the image information into the preset sensitive image recognition model, it further includes: constructing an online community sensitive image annotation dataset, and training the sensitive image recognition model based on the online community sensitive image annotation dataset and the MSCOCO dataset.

在一个示例性场景中，使用手动标注的网络社区图像数据集与MSCOCO数据集作为跨模态感知模块训练数据集。最终训练数据集25000张图像,验证集2500张图像,并在测试集共计1000张图像上进行实验结果验证，得到训练好的敏感图像识别模型，敏感图像识别模型包括跨模态感知模块以及敏感信息识别模块，通过跨模态感知模块得到图像信息的跨模态文本描述。In an exemplary scenario, the manually annotated online community image dataset and the MSCOCO dataset are used as training datasets for the cross-modal perception module. The final training data set is 25,000 images, the verification set is 2,500 images, and the experimental results are verified on a total of 1,000 images in the test set to obtain a trained sensitive image recognition model. The sensitive image recognition model includes cross-modal perception modules and sensitive information. The recognition module obtains the cross-modal text description of the image information through the cross-modal perception module.

在一个实施例中，根据跨模态感知模块得到图像信息的跨模态文本描述，包括：根据跨模态感知模块识别图像内的显著主体，确定预训练的跨模态描述模型群中与识别出来的图像主体相对应的跨模态描述模型，根据跨模态描述模型对图像内的主体、主体间关系以及主体行为的高级语义信息进行泛化的内容文本模态转化，得到图像信息的跨模态文本描述。In one embodiment, obtaining the cross-modal text description of the image information according to the cross-modal perception module includes: identifying the salient subject in the image according to the cross-modal perception module, and determining the pre-trained cross-modal description model group and recognition The cross-modal description model corresponding to the obtained image subject, according to the cross-modal description model, generalizes the content text modal transformation of the subject, the relationship between subjects and the high-level semantic information of the subject's behavior in the image, and obtains the cross-modality of the image information. Modal text description.

具体地，跨模态感知模块包括主体捕捉单元，用于识别图像内的显著主体，为下一步跨模态做预分析，提高图像信息文本模态转化的准确度。Specifically, the cross-modal perception module includes a subject capture unit, which is used to identify prominent subjects in the image, perform pre-analysis for the next step across modalities, and improve the accuracy of image information text modal conversion.

图3是根据一示例性实施例示出的一种主体捕捉单元的示意图，如图3所示，主体捕捉单元包括3层的DenseNet-Block网络结构，其计算公式如下所示：Fig. 3 is a schematic diagram of a subject capturing unit shown according to an exemplary embodiment. As shown in Fig. 3, the subject capturing unit includes a 3-layer DenseNet-Block network structure, and its calculation formula is as follows:

X_l＝H_l([X₀,X₁…X_l-1])X _l ＝H _l ([X ₀ ,X ₁ …X _l-1 ])

根据该公式得到每层卷积计算之后的结果，将每层卷积计算之后的结果进行合并，得到识别出来的图像主体特征。According to the formula, the results of each layer of convolution calculation are obtained, and the results of each layer of convolution calculation are combined to obtain the recognized image subject features.

DenseNet网络结构将0-l层的输出特征做通道上的合并，减轻了CNN网络训练过程中的梯度消失问题且使得训练参数更少，从而在主体捕捉单元训练过程中表现突出。The DenseNet network structure merges the output features of the 0-l layer on the channel, which alleviates the gradient disappearance problem in the training process of the CNN network and makes the training parameters less, so that it is outstanding in the training process of the main body capture unit.

进一步地，在模型的训练过程中，得到不同图像的主体之后，基于识别出来的多个图像主体训练多个跨模态描述模型，得到训练好的跨模态描述模型群。Further, in the training process of the model, after the subjects of different images are obtained, multiple cross-modal description models are trained based on the identified multiple image subjects, and a group of trained cross-modal description models is obtained.

然后根据识别出来的图像主体，从跨模态描述模型群中选择与该主体最匹配的跨模态描述模型，基于选择的跨模态描述模型得到图像信息的跨模态文本描述。Then, according to the identified image subject, the cross-modal description model that best matches the subject is selected from the cross-modal description model group, and the cross-modal text description of the image information is obtained based on the selected cross-modal description model.

进一步地，根据跨模态描述模型得到图像信息的跨模态文本描述，包括：Further, the cross-modal text description of image information is obtained according to the cross-modal description model, including:

根据跨模态描述模型中的特征提取单元，提取图像特征，然后将提取出来的图像特征输入图像内容跨模态单元，得到图像内容的文本描述。According to the feature extraction unit in the cross-modal description model, image features are extracted, and then the extracted image features are input into the image content cross-modal unit to obtain a text description of the image content.

在一个可选地实施例中，使用在千万级图像标准数据集ImageNet上预训练的VGGNET进行特征提取，这是因为VGGNET更小的卷积核和更深的网络在提取特征能力的表现上比其他网络更适用于本文的模型。网络以224*224的图片为输入,使用卷积核提取全局特征及局部特征。利用激活函数增加网络结构的非线性,使用池化层对输入的特征图进行压缩,简化网络计算的复杂度,常用的最大值池化方法也有突出主要特征的作用。In an optional embodiment, feature extraction is performed using VGGNET pre-trained on ImageNet, a standard data set of tens of millions of images, because VGGNET's smaller convolution kernel and deeper network perform better than Other networks are more suitable for our model. The network takes a 224*224 picture as input, and uses a convolution kernel to extract global features and local features. The activation function is used to increase the nonlinearity of the network structure, and the pooling layer is used to compress the input feature map to simplify the complexity of network calculations. The commonly used maximum pooling method also has the effect of highlighting the main features.

在VGGNET结构中,仅使用3*3的最小卷积核。这样能在同等感受野范围内增加网格深度。首先第一层使用64个3*3的卷积核,步长为1*1,输入即为224*224*64,第二层卷积层仍使用64个3*3卷积核,输入尺寸为224*224*64,接下来为最大池化层2*2,则输出为112*112*64。此为一整个卷积段。整个VGGNET有5段卷积,本实施例使用VGGNET网络中的conv5_3层的14*14*512维输出作为特征表示。最终扁平化为区域数量L，维度D的特征向量。表示为{a₁…a_i…a_L},其中L为14*14＝196，维度D为512。卷积计算公式如下所示：In the VGGNET structure, only the minimum convolution kernel of 3*3 is used. This can increase the mesh depth within the same receptive field range. First, the first layer uses 64 3*3 convolution kernels with a step size of 1*1, and the input is 224*224*64. The second convolution layer still uses 64 3*3 convolution kernels, and the input size It is 224*224*64, followed by the maximum pooling layer 2*2, and the output is 112*112*64. This is a whole convolution segment. The entire VGGNET has 5-segment convolutions, and this embodiment uses the 14*14*512-dimensional output of the conv5_3 layer in the VGGNET network as a feature representation. Finally, it is flattened into feature vectors with the number of regions L and dimension D. Expressed as {a ₁ ...a _i ...a _L }, where L is 14*14=196, and dimension D is 512. The convolution calculation formula is as follows:

B(i,j)＝∑_m＝1∑_n＝1K(m,n)×A(i-m+1,j-n+1)；B(i,j)= _∑m= 1∑n ₌₁ K(m,n)×A(i-m+1,j-n+1);

其中，A为被卷积矩阵,K为卷积核,B为卷积结果。Among them, A is the convolution matrix, K is the convolution kernel, and B is the convolution result.

激活函数的公式如下所示：The formula for the activation function is as follows:

tanh(x)＝2σ(2x)-1；tanh(x)=2σ(2x)-1;

其中σ(x)为sigmoid函数,使用激活函数提高特征提取非线性,激活后使用MaxPooling函数保留区域最大值。Where σ(x) is a sigmoid function, the activation function is used to improve the nonlinearity of feature extraction, and the MaxPooling function is used to retain the maximum value of the region after activation.

进一步地，将提取出来的图像特征输入图像内容跨模态单元，得到图像内容的文本描述。Further, the extracted image features are input into the image content cross-modal unit to obtain the text description of the image content.

具体地，图像内容跨模态单元的输入为经过特征提取单元得到的图像特征向量，使用注入注意力机制的长短期记忆循环神经网络LSTM模型产生网络社区图像的内容文本描述。Specifically, the input of the image content cross-modal unit is the image feature vector obtained through the feature extraction unit, and the long short-term memory cycle neural network LSTM model injected with the attention mechanism is used to generate the content text description of the network community image.

注入注意力机制的长短期记忆循环神经网络(LSTM)利用聚焦机制，通过注意力权重的自学习对输入图像进行不同程度的自适应关注，从而基于注意力权重的分布实现对图像特征更加准确的文本描述。LSTM可得到连续记忆下对重要特征进行表达、对无关特征进行遗忘的自然语言文本描述输出。The long-short-term memory recurrent neural network (LSTM) injected with the attention mechanism uses the focus mechanism to perform adaptive attention on the input image to different degrees through the self-learning of the attention weight, so as to achieve more accurate image features based on the distribution of the attention weight. text description. LSTM can obtain natural language text description output that expresses important features and forgets irrelevant features under continuous memory.

图4是根据一示例性实施例示出的一种内容文本上下文生成过程图，如图4所示，将图像特征a经过注意力模块作用后得到C个维度为D的上下文{Z₁…Z_t…Z_c}，其中C理解为输出内容文本的单词长度，Z_t理解为表示每个单词对应上下文的D维特征。上下文生成的过程即为逐个单词y_t的生成过程。Fig. 4 is a diagram showing a content text context generation process according to an exemplary embodiment. As shown in Fig. 4 , image feature a is subjected to the action of the attention module to obtain C contexts {Z ₁ ... Z _t with dimension D ...Z _c }, where C is understood as the word length of the output content text, and Z _t is understood as the D-dimensional feature representing the context of each word. The process of context generation is the generation process of word y _t one by one.

其中，上下文Z_t是原有特征A的加权和，权重为即：Among them, the context Z _t is the weighted sum of the original features A, and the weight is Right now:

权重为L＝196，对应每个图像特征区域的关注度。权重由前一步网络的隐变量h_t-1经过全连接层获得，如图5所示，图5是一示例性实施例示出的一种隐状态生成过程图，另外，第一步的权重无隐状态，完全由图像特征生成。 The weight is L=196, which corresponds to the attention degree of each image feature region. The weight is obtained from the hidden variable h _t-1 of the network in the previous step through the fully connected layer, as shown in Figure 5, which is a diagram of a hidden state generation process shown in an exemplary embodiment. In addition, the weight of the first step has no Hidden state, generated entirely from image features.

进一步地，图6是根据一示例性实施例示出的一种门变量生成过程图，如图6所示，隐变量h_t模拟记忆功能，对上下文根据前一步隐状态生成输入i_t、输出o_t、遗忘f_t“门变量”，生成候选g_t控制当前输入强度，生成存储c_t控制前一词的存储。隐状态h_t由存储c_t和输出门o_t共同控制。Further, Fig. 6 is a diagram showing a gate variable generation process according to an exemplary embodiment. As shown in Fig. 6, the hidden variable h _t simulates the memory function, and generates input it and output _o for the context according to the hidden state of the previous step _t , forget _ft "gate variable", generate candidate _gt controls the current input strength, generate storage _ct controls the storage of the previous word. The hidden state h _t is jointly controlled by the storage c _t and the output gate o _t .

进一步地，图7是根据一示例性实施例示出的一种当前单词生成过程图，如图7所示，当前隐变量再通过全连接网络生成当前输出单词y_t。Further, FIG. 7 is a diagram showing a current word generation process according to an exemplary embodiment. As shown in FIG. 7 , the current hidden variable generates the current output word y _t through a fully connected network.

根据生成的逐个单词，得到图像内容的文本描述。From the generated word-by-word, a textual description of the image content is obtained.

S103将图像信息的跨模态文本描述输入敏感图像识别模型中的敏感信息识别模块，得到含有敏感信息的敏感图像。S103 Input the cross-modal text description of the image information into the sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information.

在一个实施例中，首先构建训练数据集，本实施例爬取了50个热门网站的帖子，并按照网络社区敏感区分出35000条敏感文本及50000条非敏感文本作为训练集。In one embodiment, a training data set is constructed first. In this embodiment, posts from 50 popular websites are crawled, and 35,000 sensitive texts and 50,000 non-sensitive texts are classified according to network community sensitivity as a training set.

进一步地，根据预先构建的训练集训练TextCNN卷积神经网络，得到训练好的敏感信息识别模块。Further, the TextCNN convolutional neural network is trained according to the pre-built training set, and the trained sensitive information recognition module is obtained.

TextCNN是文本上的卷积神经网络,其将文本表示为单词向量矩阵，输入矩阵每一行代表一个单词，卷积核作为局部特征提取器，其长度和单词长度一致，Filter-size不同时其作用为n-gram的不同维度特征。图8是根据一示例性实施例示出的一种TextCNN卷积神经网络的结构图。如图8所示，TextCNN卷积神经网络包括输入的n个k维单词组成的句子，然后经过多维特征提取的卷积层、池化层以及全连接层输出结果。TextCNN is a convolutional neural network on text, which represents the text as a word vector matrix, and each row of the input matrix represents a word. The convolution kernel is used as a local feature extractor, and its length is the same as the word length. When the Filter-size is different, its function It is a feature of different dimensions of n-gram. Fig. 8 is a structural diagram of a TextCNN convolutional neural network according to an exemplary embodiment. As shown in Figure 8, the TextCNN convolutional neural network includes an input sentence composed of n k-dimensional words, and then outputs the result through a multi-dimensional feature extraction convolutional layer, pooling layer, and fully connected layer.

将图像信息的跨模态文本描述输入敏感信息识别模块，得到识别出来的敏感文本信息，敏感文本信息对应的图像即为敏感图像。Input the cross-modal text description of the image information into the sensitive information identification module to obtain the identified sensitive text information, and the image corresponding to the sensitive text information is the sensitive image.

本公开实施例中的敏感信息识别模块融合大量网络社区环境下的敏感文本知识库，在此基础上可对图像内容文本进行更准确和可信的敏感信息识别。The sensitive information identification module in the embodiment of the present disclosure integrates a large number of sensitive text knowledge bases in the network community environment, and based on this, more accurate and credible sensitive information identification can be performed on image content texts.

图2是根据一示例性实施例示出的一种敏感图像识别模型的结构示意图，如图2所示，该敏感图像识别模型包括：网络社区图像内容跨模态感知模块以及网络社区敏感文本信息识别模块。Fig. 2 is a schematic structural diagram of a sensitive image recognition model according to an exemplary embodiment. As shown in Fig. 2 , the sensitive image recognition model includes: a cross-modal perception module for image content in online communities and sensitive text information recognition in online communities module.

在网络社区图像内容跨模态感知模块中，首先，输入网络社区图像，然后基于DenseNet进行图像主体捕捉，在模型训练过程中，针对不同类别的主体，分别训练对应的跨模态描述模型，得到跨模态描述模型群。其中，跨模态描述模型包括特征提取单元，基于VGGNET进行特征提取，包括图像内容描述单元，基于注入注意力机制的长短期记忆循环神经网络LSTM模型产生网络社区图像的内容文本描述。In the cross-modal perception module of online community image content, firstly, the online community image is input, and then image subjects are captured based on DenseNet. During the model training process, corresponding cross-modal description models are trained for different types of subjects, and Describe model populations across modalities. Among them, the cross-modal description model includes a feature extraction unit based on VGGNET for feature extraction, including an image content description unit, and a long-term short-term memory cycle neural network LSTM model based on an injection attention mechanism to generate content text descriptions of network community images.

在网络社区敏感文本信息识别模块中，基于网络社区先验知识库训练TextCNN卷积神经网络，得到训练好的敏感信息文本识别模型，将跨模态感知模块产生的图像内容文本描述输入敏感文本信息识别模型，得到识别出来的敏感图像。In the network community sensitive text information recognition module, the TextCNN convolutional neural network is trained based on the network community prior knowledge base, and the trained sensitive information text recognition model is obtained, and the image content text description generated by the cross-modal perception module is input into the sensitive text information Identify the model and get the identified sensitive image.

在一个示例性场景中，本公开实施例对上述方法进行了实验验证。首先，获取实验数据，使用手动标注的网络社区描述数据与MSCOCO数据作为跨模态感知模块训练数据集，最终训练数据集25000张图像,验证集2500张图像,并在测试集共计1000张图像上进行实验结果验证。利用自爬取的高校网络社区发布文章共计35000条敏感训练文本、50000条非敏感训练文本进行敏感文本信息识别模块训练，在1000条测试文本上进行实验结果验证。In an exemplary scenario, an embodiment of the present disclosure performs experimental verification on the above method. First, obtain the experimental data, use the manually labeled network community description data and MSCOCO data as the cross-modal perception module training data set, the final training data set is 25,000 images, the verification set is 2,500 images, and a total of 1,000 images are used in the test set Verify the experimental results. A total of 35,000 sensitive training texts and 50,000 non-sensitive training texts were used to publish articles from the self-crawled university network communities to conduct sensitive text information recognition module training, and the experimental results were verified on 1,000 test texts.

然后构造实验环境，在Tensorflow训练环境、最大Epoch设为100，Batch-Size设为64，学习速率设置为0.001下训练DenseNet主体捕捉单元。Then construct the experimental environment, in the Tensorflow training environment, the maximum Epoch is set to 100, the Batch-Size is set to 64, and the learning rate is set to 0.001 to train the DenseNet main body capture unit.

在Tensorflow环境下使用Rmsprop优化器,使用交叉熵损失函数，Batch-Size设置为16,最大Epoch设置为500,学习速率为0.001下训练跨模态描述模型群。In the Tensorflow environment, the Rmsprop optimizer is used, the cross-entropy loss function is used, the Batch-Size is set to 16, the maximum Epoch is set to 500, and the learning rate is 0.001 to train the cross-modal description model group.

在Keras环境下文本长度设为50、字典长度为5000、Batch-Size设置为64,最大Epoch设置为100,Filter-size分别设为2、3、4、5、7、10训练网络社区敏感文本信息识别模块。In the Keras environment, the text length is set to 50, the dictionary length is 5000, the Batch-Size is set to 64, the maximum Epoch is set to 100, and the Filter-size is set to 2, 3, 4, 5, 7, 10 respectively to train network community sensitive text information identification module.

在Tensorflow训练环境、最大Epoch设为100，Batch-Size设为64，学习速率设置为0.001下训练VGGNet及DenseNet敏感图像分类识别模型作为对照模型。VGGNET作为深度神经网络的代表之一，其由于更小的卷积核形成更深的网络，使得网络在图像识别分类领域的错误率低于7.3％。由于其作为经典神经网络在图像分类识别领域里的突出表现，本实施例将其选为对照模型1。DenseNet在经典神经网络发展过程中，作为复杂版ResNet，其紧密的跨层连接以及简化的内存消耗使其成为经典神经网络最新的代表,本文将其选为对照模型2。In the Tensorflow training environment, the maximum Epoch is set to 100, the Batch-Size is set to 64, and the learning rate is set to 0.001, the VGGNet and DenseNet sensitive image classification recognition models are trained as control models. As one of the representatives of deep neural networks, VGGNET forms a deeper network due to smaller convolution kernels, making the error rate of the network in the field of image recognition and classification lower than 7.3%. Due to its outstanding performance as a classic neural network in the field of image classification and recognition, it is selected as the comparison model 1 in this embodiment. In the development process of the classic neural network, DenseNet, as a complex version of ResNet, its tight cross-layer connection and simplified memory consumption make it the latest representative of the classic neural network. This paper selects it as the comparison model 2.

进一步地，开始实验，首先在上述数据集上进行DenseNet主体捕捉单元训练，根据已标注的12类敏感主体及MSCOCO非敏感类主体作为13类分类标签。主体捕捉模块训练出分类准确率表现最高的模型作为捕捉器。Further, to start the experiment, first conduct DenseNet subject capture unit training on the above data set, and use the marked 12 types of sensitive subjects and MSCOCO non-sensitive subjects as 13 categories of classification labels. The subject capture module trains the model with the highest classification accuracy as the catcher.

对各主体类别训练,选取描述文本在验证集图像上BLEU标准表现效果最优的模型作为网络社区跨模态描述模型。所有专有模型组成网络社区图像内容跨模态描述模型群。For the training of each subject category, the model with the best performance of the BLEU standard in the description text on the verification set image is selected as the network community cross-modal description model. All proprietary models form a cross-modal description model group for image content in online communities.

TextCNN内容文本识别模型训练选择在测试文本集上识别敏感准确率最高的模型作为最终的网络社区图像敏感文本信息识别模型。TextCNN content text recognition model training selects the model with the highest recognition sensitivity accuracy on the test text set as the final network community image sensitive text information recognition model.

在上述数据集上进行VGGNet和DenseNet图像识别分类模型训练，以验证图像集上表现最佳的模型作为对照模型。The VGGNet and DenseNet image recognition classification models are trained on the above datasets to verify the best performing model on the image set as a control model.

DenseNet主体捕捉单元训练结果如下表所示：The training results of the DenseNet subject capture unit are shown in the following table:

Lossloss Acc_valAcc_val Acc_testAcc_test 得分Score 0.2250.225 0.92060.9206 0.9170.917

模型群中各主体类别内容跨模态描述模型在验证集上表现参数均值如下表所示：The average performance parameters of the cross-modal description model for each subject category in the model group on the verification set are shown in the table below:

Bleu_1Bleu_1 0.77270.7727 Bleu_2Bleu_2 0.68090.6809 Bleu_3Bleu_3 0.61430.6143 Bleu_4Bleu_4 0.58340.5834

跨模态感知的网络社区图像语义信息表现如图9所示，提取出来的文本描述为“ablurry view of a building covered with fire”即“一个被火覆盖的建筑物的模糊视图”。The semantic information representation of the cross-modal perception network community image is shown in Figure 9. The extracted text is described as "ablurry view of a building covered with fire", that is, "a blurry view of a building covered with fire".

网络社区图像内容跨模态描述模型群得分均值表现较好，主要是因为本实施例在训练图像集中针对性的进行网络社区敏感图像描述训练，对于网络社区环境的特定应用领域，图像内容描述模型具有更好的适应性。The average score of the network community image content cross-modal description model group performance is better, mainly because this embodiment conducts network community-sensitive image description training in the training image set. For the specific application field of the network community environment, the image content description model It has better adaptability.

此过程获得的图像内容文本使得本文掌握图像信息传播的过程中文本形式的信息内容，对下一步进行敏感图像识别的判别结果具有更强的可解释性，使得融合网络社区文本知识库进行敏感识别成为可能。The image content text obtained in this process enables this paper to grasp the information content in the form of text in the process of image information dissemination, and has stronger interpretability for the discrimination results of sensitive image recognition in the next step, making it possible to integrate the network community text knowledge base for sensitive recognition become possible.

TextCNN敏感文本识别准确率在1000条测试文本集上的准确率如下表所示：The accuracy of TextCNN sensitive text recognition accuracy on 1000 test text sets is shown in the following table:

Lossloss Acc_valAcc_val Acc_testAcc_test 得分Score 0.1040.104 0.9810.981 0.970.97

基于内容文本的网络社区敏感图像识别模型与对比模型在验证集和测试集上的敏感识别准确率对比实验结果如下表所示：The experimental results of sensitive recognition accuracy comparison between the content text-based network community sensitive image recognition model and the comparison model on the verification set and test set are shown in the following table:

实验结果表明，在基于跨模态感知并融合大量网络社区文本敏感判别先验知识来进行网络社区敏感图像的识别使得本文的判别预测准确率较通过神经网络进行敏感识别的效果得到了明显提升。The experimental results show that based on cross-modal perception and fusion of a large number of network community text-sensitive discriminative prior knowledge to identify sensitive images in the network community, the accuracy of the discriminant prediction in this paper has been significantly improved compared with the effect of sensitive recognition through neural networks.

在基于跨模态内容感知的网络社区敏感图像识别模型的判别过程中，能够通过得到图像内容的跨模态文本，融合大量网络社区文本敏感识别先验知识，得到更加准确并且更具可理解性的敏感图像识别结果。另外，评论的追加、话题的发酵等基于时间维度上的信息传播内容,在获得图像的文本模态内容之后，使得图像信息与其追加评论、话题文本等相关后续信息做到了形式的统一，对图像信息的传播能够进行进一步的追溯。In the discrimination process of the network community-sensitive image recognition model based on cross-modal content awareness, the cross-modal text of the image content can be obtained, and a large amount of prior knowledge of network community text-sensitive recognition can be integrated to obtain more accurate and understandable images. Sensitive image recognition results. In addition, the content of information dissemination based on the time dimension, such as the addition of comments and the fermentation of topics, after obtaining the text modal content of the image, makes the image information and its related follow-up information such as additional comments and topic texts unified in form. The dissemination of information can be further traced.

本公开实施例还提供一种基于跨模态感知的敏感图像识别装置，该装置用于执行上述实施例的基于跨模态感知的敏感图像识别方法，如图10所示，该装置包括：An embodiment of the present disclosure also provides a sensitive image recognition device based on cross-modal perception, which is used to implement the sensitive image recognition method based on cross-modal perception in the above embodiment, as shown in FIG. 10 , the device includes:

获取模块1001，用于获取网络社区中待识别的图像信息；An acquisition module 1001, configured to acquire image information to be identified in the online community;

跨模态描述模块1002，用于将图像信息输入预设的敏感图像识别模型中的跨模态感知模块，得到图像信息的跨模态文本描述；The cross-modal description module 1002 is used to input the image information into the cross-modal perception module in the preset sensitive image recognition model to obtain the cross-modal text description of the image information;

识别模块1003，用于将图像信息的跨模态文本描述输入敏感图像识别模型中的敏感信息识别模块，得到含有敏感信息的敏感图像。The recognition module 1003 is configured to input the cross-modal text description of the image information into the sensitive information recognition module in the sensitive image recognition model to obtain sensitive images containing sensitive information.

在一个实施例中，还包括：In one embodiment, also includes:

需要说明的是，上述实施例提供的基于跨模态感知的敏感图像识别装置在执行基于跨模态感知的敏感图像识别方法时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的基于跨模态感知的敏感图像识别装置与基于跨模态感知的敏感图像识别方法实施例属于同一构思，其体现实现过程详见方法实施例，这里不再赘述。It should be noted that when the sensitive image recognition device based on cross-modal perception provided by the above-mentioned embodiments executes the sensitive image recognition method based on cross-modal perception, it only uses the division of the above-mentioned functional modules as an example. In practical applications, The above function allocation can be completed by different functional modules as required, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the sensitive image recognition device based on cross-modal perception provided by the above embodiment and the sensitive image recognition method embodiment based on cross-modal perception belong to the same concept, and its implementation process is detailed in the method embodiment, and will not be repeated here.

本公开实施例还提供一种与前述实施例所提供的基于跨模态感知的敏感图像识别方法对应的电子设备，以执行上述基于跨模态感知的敏感图像识别方法。Embodiments of the present disclosure also provide an electronic device corresponding to the sensitive image recognition method based on cross-modal perception provided in the foregoing embodiments, so as to execute the above-mentioned sensitive image recognition method based on cross-modal perception.

请参考图11，其示出了本申请的一些实施例所提供的一种电子设备的示意图。如图11所示，电子设备包括：处理器1100，存储器1101，总线1102和通信接口1103，处理器1100、通信接口1103和存储器1101通过总线1102连接；存储器1101中存储有可在处理器1100上运行的计算机程序，处理器1100运行计算机程序时执行本申请前述任一实施例所提供的基于跨模态感知的敏感图像识别方法。Please refer to FIG. 11 , which shows a schematic diagram of an electronic device provided by some embodiments of the present application. As shown in Figure 11, electronic equipment comprises: processor 1100, memory 1101, bus 1102 and communication interface 1103, processor 1100, communication interface 1103 and memory 1101 are connected through bus 1102; running computer program, the processor 1100 executes the sensitive image recognition method based on cross-modal perception provided by any one of the foregoing embodiments of the present application when running the computer program.

其中，存储器1101可能包含高速随机存取存储器(RAM：Random Access Memory)，也可能还包括非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。通过至少一个通信接口1103(可以是有线或者无线)实现该系统网元与至少一个其他网元之间的通信连接，可以使用互联网、广域网、本地网、城域网等。Wherein, the memory 1101 may include a high-speed random access memory (RAM: Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 1103 (which may be wired or wireless), and the Internet, wide area network, local network, metropolitan area network, etc. can be used.

总线1102可以是ISA总线、PCI总线或EISA总线等。总线可以分为地址总线、数据总线、控制总线等。其中，存储器1101用于存储程序，处理器1100在接收到执行指令后，执行程序，前述本申请实施例任一实施方式揭示的基于跨模态感知的敏感图像识别方法可以应用于处理器1100中，或者由处理器1100实现。The bus 1102 can be an ISA bus, a PCI bus, or an EISA bus, etc. The bus can be divided into address bus, data bus, control bus and so on. Wherein, the memory 1101 is used to store the program, and the processor 1100 executes the program after receiving the execution instruction, and the sensitive image recognition method based on cross-modal perception disclosed in any of the above-mentioned embodiments of the present application can be applied to the processor 1100 , or implemented by the processor 1100.

处理器1100可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器1100中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1100可以是通用处理器，包括中央处理器(Central Processing Unit，简称CPU)、网络处理器(Network Processor，简称NP)等；还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1101，处理器1100读取存储器1101中的信息，结合其硬件完成上述方法的步骤。The processor 1100 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above method may be implemented by an integrated logic circuit of hardware in the processor 1100 or instructions in the form of software. The above-mentioned processor 1100 can be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a network processor (Network Processor, referred to as NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 1101, and the processor 1100 reads the information in the memory 1101, and completes the steps of the above method in combination with its hardware.

本申请实施例提供的电子设备与本申请实施例提供的基于跨模态感知的敏感图像识别方法出于相同的发明构思，具有与其采用、运行或实现的方法相同的有益效果。The electronic device provided in the embodiment of the present application is based on the same inventive concept as the sensitive image recognition method based on cross-modal perception provided in the embodiment of the present application, and has the same beneficial effect as the method adopted, operated or implemented.

本申请实施例还提供一种与前述实施例所提供的基于跨模态感知的敏感图像识别方法对应的计算机可读存储介质，请参考图12，其示出的计算机可读存储介质为光盘1200，其上存储有计算机程序(即程序产品)，计算机程序在被处理器运行时，会执行前述任意实施例所提供的基于跨模态感知的敏感图像识别方法。The embodiment of the present application also provides a computer-readable storage medium corresponding to the sensitive image recognition method based on cross-modal perception provided in the foregoing embodiments, please refer to FIG. 12 , the computer-readable storage medium shown in it is an optical disc 1200 A computer program (that is, a program product) is stored thereon, and when the computer program is run by the processor, it will execute the sensitive image recognition method based on cross-modal perception provided by any of the foregoing embodiments.

需要说明的是，计算机可读存储介质的例子还可以包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他光学、磁性存储介质，在此不再一一赘述。It should be noted that examples of computer-readable storage media may also include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access Memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media will not be repeated here.

本申请的上述实施例提供的计算机可读存储介质与本申请实施例提供的基于跨模态感知的敏感图像识别方法出于相同的发明构思，具有与其存储的应用程序所采用、运行或实现的方法相同的有益效果。The computer-readable storage medium provided by the above-mentioned embodiments of the present application is based on the same inventive concept as the sensitive image recognition method based on cross-modal perception provided by the embodiments of the present application. The same beneficial effect of the method.

以上实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above examples only express several implementations of the present invention, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present invention. It should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.

Claims

1. A sensitive image recognition method based on cross-modal perception, characterized in that, comprising:

Obtain image information to be identified in the online community;

Inputting the image information into a cross-modal perception module in a preset sensitive image recognition model to obtain a cross-modal text description of the image information; including: identifying a prominent subject in the image according to the cross-modal perception module; Determine the cross-modal description model corresponding to the identified image subject in the pre-trained cross-modal description model group; according to the cross-modal description model, the high-level semantic information of the subject, inter-subject relationship and subject behavior in the image performing generalized content text modal conversion to obtain a cross-modal text description of the image information;

Wherein, the image features are extracted through the VGGNET network structure in the cross-modal description model; the extracted image features are input into the long-short-term memory cycle neural network including the attention mechanism, and the subject, inter-subject relationship and subject behavior in the image are obtained Cross-modal textual description of high-level semantic information of ;

Inputting the cross-modal text description of the image information into a sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information.

2. The method according to claim 1, wherein, before inputting the image information into a preset sensitive image recognition model, further comprising:

A training data set is constructed, and the sensitive image recognition model is trained based on the training data set, wherein the sensitive image recognition model includes a cross-modal perception module and a sensitive information recognition module.

3. The method according to claim 1, wherein identifying a salient subject in an image according to the cross-modal perception module comprises:

According to the subject capture unit in the cross-modal perception module to identify the salient subject in the image, the subject capture unit includes a DenseNet-Block network structure, and its calculation formula is as follows:

X _l = H _l ([X ₀ , X ₁ ... X _l-1 ])

Among them, [X ₀ , X ₁ ... X _l-1 ] means that the feature maps of layers 0 to l-1 are merged on the channel, and H _l means that the merged features are normalized, activated, and Convolution operation, X _l represents the result after the l-th layer convolution calculation.

4. The method according to claim 1, wherein the cross-modal text description of the image information is input into the sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information, comprising:

Train the TextCNN convolutional neural network according to the pre-built training set to obtain the trained sensitive information identification module;

Inputting the cross-modal text description of the image information into the sensitive information identification module to obtain the identified sensitive text information;

The image corresponding to the sensitive text information is used as a sensitive image.

5. A sensitive image recognition device based on cross-modal perception, characterized in that it comprises:

An acquisition module, configured to acquire image information to be identified in the online community;

The cross-modal description module is used to input the image information into the cross-modal perception module in the preset sensitive image recognition model to obtain the cross-modal text description of the image information; including: according to the cross-modal perception The module recognizes the significant subject in the image; determines the cross-modal description model corresponding to the identified image subject in the pre-trained cross-modal description model group; The generalized content text modal conversion is performed on the high-level semantic information of the relationship and the subject's behavior, and the cross-modal text description of the image information is obtained;

The recognition module is configured to input the cross-modal text description of the image information into the sensitive information recognition module in the sensitive image recognition model to obtain a sensitive image containing sensitive information.

6. The device according to claim 5, further comprising:

The training module is used to construct a training data set, and train the sensitive image recognition model based on the training data set, wherein the sensitive image recognition model includes a cross-modal perception module and a sensitive information recognition module.

7. A sensitive image recognition device based on cross-modal perception, characterized in that it includes a processor and a memory storing program instructions, and the processor is configured to execute the program according to claim 1 when executing the program instructions. The sensitive image recognition method based on cross-modal perception described in any one of to 4.

8. A computer-readable medium, characterized in that computer-readable instructions are stored thereon, and the computer-readable instructions can be executed by a processor to implement a method based on any one of claims 1 to 4. A Sensitive Image Recognition Approach for Cross-Modal Perception.