CN113610080A

CN113610080A - Cross-modal perception-based sensitive image identification method, device, equipment and medium

Info

Publication number: CN113610080A
Application number: CN202110892160.1A
Authority: CN
Inventors: 吴旭; 吴京宸; 高丽; 颉夏青; 杨金翠; 孙利娟; 张熙; 方滨兴
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-11-05
Anticipated expiration: 2041-08-04
Also published as: CN113610080B

Abstract

The invention discloses a sensitive image identification method, a device, equipment and a medium based on cross-modal perception, wherein the method comprises the following steps: acquiring image information to be identified in a network community; inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information; and inputting the cross-modal text description of the image information into a sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information. According to the sensitive image identification method provided by the embodiment of the disclosure, the semantic information content of the network community image is expressed in a cross-modal manner, a large amount of priori knowledge of the content of the network community sensitive text is fused, the content of the community image is analyzed and judged more accurately, and the transmission and tracing of sensitive image information are possible by obtaining the cross-modal text description of the image.

Description

Cross-modal perception-based sensitive image identification method, device, equipment and medium

Technical Field

The invention relates to the technical field of image recognition, in particular to a sensitive image recognition method, a device, equipment and a medium based on cross-modal perception.

Background

With the development of the multimedia era, the imaging network community environment is an important characteristic of the propagation and development of network information. The imaging network community environment is emphasized, accurate judgment and identification are carried out on image information, and timely and necessary intervention on sensitive contents is beneficial to maintaining the stability of the network community environment and social security.

In the prior art, information carried by an image is expressed in an intuitive visualization mode, and semantic information of the information cannot be directly acquired without reading and understanding through the brain. With the rapid development of image processing technology in recent years, image content identification technology has also been improved to some extent. At present, image classification technology is basically adopted for identifying and analyzing sensitive image content at home and abroad. The sensitive image recognition technology based on the image classification technology extracts image features through a neural network, the feature vectors are used as the input of a full connection layer at the tail end of the neural network, and the output of the full connection layer is the image content analysis result. Low-level features such as subjects, lines, colors, etc. in the sensitive image are captured through this process and discriminant analysis is performed based thereon. However, image semantic contents such as relationships among image bodies and body behaviors cannot be acquired in the identification and classification process, and a large amount of network community knowledge closely related to sensitive image content identification is separated formally, so that an image content identification result cannot be identified by combining with prior knowledge of a large amount of network community text information, the identification accuracy is low, and the understandability is poor. The network community environment is complex, and the transmission and fermentation of image information in the time dimension are key points for maintaining network security. Sensitive image recognition technology based on image classification can not carry out tracing back of image information propagation.

Disclosure of Invention

The embodiment of the disclosure provides a cross-modal perception-based sensitive image identification method, a cross-modal perception-based sensitive image identification device, a cross-modal perception-based sensitive image identification equipment and a cross-modal perception-based sensitive image identification medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present disclosure provides a sensitive image identification method based on cross-modal perception, including:

acquiring image information to be identified in a network community;

inputting image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information;

and inputting the cross-modal text description of the image information into a sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information.

In one embodiment, before inputting the image information into the preset sensitive image recognition model, the method further comprises:

and constructing a training data set, and training a sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-modal perception module and a sensitive information recognition module.

In one embodiment, inputting image information into a cross-modal perception module in a preset sensitive image recognition model to obtain a cross-modal text description of the image information includes:

identifying a salient subject within the image according to a cross-modal perception module;

determining a cross-modal description model corresponding to the identified image main body in a pre-trained cross-modal description model group;

and carrying out generalized content text modal transformation on the main body in the image, the relationship among the main bodies and the high-level semantic information of the main body behaviors according to the cross-modal description model to obtain the cross-modal text description of the image information.

In one embodiment, identifying salient objects within an image according to a cross-modal perception module includes:

identifying a salient subject in the image according to a subject capturing unit in the cross-mode perception module, wherein the subject capturing unit comprises a DenseNet-Block network structure, and the calculation formula is as follows:

X_l＝H_l([X₀,X₁…X_l-1])

wherein, [ X ]₀,X₁…X_l-1]Representing merging of profiles from layers 0 to l-1 on the channel, H_lThe normalization operation, the activation operation and the convolution operation are performed on the combined features, and xl represents the result after the l-th layer of convolution calculation.

In one embodiment, performing generalized content text mode conversion on the high-level semantic information of the main body, the relationship between the main bodies and the main body behavior in the image according to the cross-mode description model to obtain the cross-mode text description of the image information, includes:

extracting image features through a VGGNET network structure in the cross-modal description model;

and inputting the extracted image features into a long-term and short-term memory recurrent neural network containing an attention mechanism to obtain the cross-modal text description of the main body in the image, the relationship among the main bodies and the high-level semantic information of the main body behaviors.

In one embodiment, inputting the cross-modal text description of the image information into a sensitive information recognition module in a sensitive image recognition model to obtain a sensitive image containing sensitive information, includes:

training a TextCNN convolutional neural network according to a pre-constructed training set to obtain a trained sensitive information identification module;

inputting the cross-modal text description of the image information into a sensitive information identification module to obtain identified sensitive text information;

and taking the image corresponding to the sensitive text information as a sensitive image.

In a second aspect, an embodiment of the present disclosure provides a sensitive image recognition apparatus based on cross-modal perception, including:

the acquisition module is used for acquiring image information to be identified in the network community;

the cross-modal description module is used for inputting the image information into a cross-modal perception module in a preset sensitive image recognition model to obtain cross-modal text description of the image information;

and the identification module is used for inputting the cross-modal text description of the image information into the sensitive information identification module in the sensitive image identification model to obtain the sensitive image containing the sensitive information.

In one embodiment, further comprising:

the training module is used for constructing a training data set and training a sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-modal perception module and a sensitive information recognition module.

In a third aspect, the disclosed embodiment provides a cross-modal perception-based sensitive image recognition device, which includes a processor and a memory storing program instructions, where the processor is configured to execute the cross-modal perception-based sensitive image recognition method provided in the foregoing embodiment when executing the program instructions.

In a fourth aspect, the disclosed embodiments provide a computer-readable medium, on which computer-readable instructions are stored, where the computer-readable instructions are executable by a processor to implement a cross-modal perception-based sensitive image recognition method provided by the foregoing embodiments.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the embodiment of the disclosure provides a cross-modal content awareness-based network community sensitive image recognition model (SIR-CM). The model mainly comprises an image content cross-modal perception module and a network community sensitive text information identification module. The SIR-CM realizes a generalized network community image content text conversion module on an MSCOCO data set and a network community sensitive image labeling data set, and can perform fine-grained cross-modal expression on community image content texts. The sensitive content text set prior knowledge under the network community environment is fused in the network community sensitive text information identification module, so that the sensitive content text set prior knowledge has the analysis and identification capabilities on sensitive information under the network community environment, and a more accurate and more understandable sensitive image identification result is obtained. In addition, the image information and the subsequent information related to the additional comment, the topic text and the like are unified in form after the text modal content of the image is obtained based on the information propagation content in the time dimension, and the propagation of the image information can be further traced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram illustrating a cross-modal perception based sensitive image recognition method according to an exemplary embodiment;

FIG. 2 is a diagram illustrating the structure of a sensitive image recognition model in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram of a subject capture unit shown in accordance with an exemplary embodiment;

FIG. 4 is a diagram illustrating a content text context generation process in accordance with an illustrative embodiment;

FIG. 5 is a diagram illustrating a hidden state generation process in accordance with an exemplary embodiment;

FIG. 6 is a diagram illustrating a gate variable generation process in accordance with an exemplary embodiment;

FIG. 7 is a diagram illustrating a current word generation process in accordance with an illustrative embodiment;

FIG. 8 is a block diagram illustrating a TextCNN convolutional neural network, according to an exemplary embodiment;

FIG. 9 is an image content description diagram, shown in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating a cross-modal perception based sensitive image recognition apparatus according to an exemplary embodiment;

FIG. 11 is a block diagram illustrating a cross-modal perception based sensitive image recognition device according to an exemplary embodiment;

FIG. 12 is a schematic diagram illustrating a computer storage medium in accordance with an exemplary embodiment.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Fig. 1 is a diagram illustrating a cross-modal perception-based sensitive image recognition method according to an exemplary embodiment, where the method specifically includes the following steps, as shown in fig. 1.

S101, image information to be identified in the network community is obtained.

The embodiment provides a cross-modal content perception-based network community sensitive image recognition model by using an image description technology, aims to express and perceive semantic information content of a network community image in a cross-modal manner, integrates a large amount of prior knowledge of network community sensitive text content, performs more accurate and more understandable analysis and judgment on the content of the community image, and enables propagation and tracing of sensitive image information to be possible by obtaining the cross-modal content text of the network community image.

Firstly, obtaining image information to be recognized in a network community, and then inputting the obtained image into a trained sensitive image recognition model for recognition.

S102, inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information.

In one embodiment, before inputting the image information into the preset sensitive image recognition model, the method further comprises: and constructing a network community sensitive image labeling data set, and training a sensitive image recognition model based on the network community sensitive image labeling data set and the MSCOCO data set.

In one exemplary scenario, a manually labeled web community image dataset and an MSCOCO dataset are used as the cross-modality awareness module training dataset. And finally training 25000 images of a data set, verifying 2500 images of the data set, and verifying the experimental result on 1000 images of the test set to obtain a trained sensitive image recognition model, wherein the sensitive image recognition model comprises a cross-modal sensing module and a sensitive information recognition module, and cross-modal text description of image information is obtained through the cross-modal sensing module.

In one embodiment, obtaining a cross-modal textual description of image information according to a cross-modal awareness module includes: identifying a significant subject in the image according to a cross-modal perception module, determining a cross-modal description model corresponding to the identified image subject in a pre-trained cross-modal description model group, and performing generalized content text modal transformation on the subject in the image, the relationship among the subjects and the high-level semantic information of the subject behavior according to the cross-modal description model to obtain the cross-modal text description of the image information.

Specifically, the cross-modal perception module comprises a main body capturing unit used for identifying a significant main body in the image, performing pre-analysis for the next cross-modal, and improving the accuracy of image information text modal transformation.

Fig. 3 is a schematic diagram illustrating a body capture unit according to an exemplary embodiment, where, as shown in fig. 3, the body capture unit includes a 3-layer DenseNet-Block network structure, and its calculation formula is as follows:

X_l＝H_l([X₀,X₁…X_l-1])

And obtaining the result after each layer of convolution calculation according to the formula, and combining the results after each layer of convolution calculation to obtain the identified image main features.

The DenseNet structure combines the output characteristics of the 0-l layers on the channels, thereby reducing the gradient disappearance problem in the CNN network training process and leading the training parameters to be less, thereby showing prominent performance in the main body capturing unit training process.

Further, in the training process of the model, after the subjects of different images are obtained, a plurality of cross-modal description models are trained based on the identified subjects of the images, and a trained cross-modal description model group is obtained.

And then according to the identified image main body, selecting a cross-modal description model which is most matched with the main body from the cross-modal description model group, and obtaining cross-modal text description of the image information based on the selected cross-modal description model.

Further, obtaining the cross-modal text description of the image information according to the cross-modal description model includes:

and extracting image features according to a feature extraction unit in the cross-modal description model, and then inputting the extracted image features into an image content cross-modal unit to obtain text description of the image content.

In an alternative embodiment, VGGNET pre-trained on the million-scale image standard dataset ImageNet is used for feature extraction, since VGGNET's smaller convolution kernel and deeper networks are more suitable for use in the model herein than other networks in their performance in extracting features. The network takes pictures of 224 x 224 as input, and uses convolution kernels to extract global features and local features. The method has the advantages that the nonlinearity of the network structure is increased by using the activation function, the input feature diagram is compressed by using the pooling layer, the complexity of network calculation is simplified, and the common maximum pooling method also has the function of highlighting main features.

In the VGGNET structure, only a minimum convolution kernel of 3 × 3 is used. This allows for increased grid depth over an equivalent receptive field range. The first layer uses 64 convolution kernels of 3 x 3 with a step size of 1 x 1, the input is 224 x 64, the second layer uses still 64 convolution kernels of 3 x 3 with an input size of 224 x 64, followed by the largest pooling layer 2, and the output is 112 x 64. This is an entire convolution segment. There are 5 segments of convolution for the entire VGGNET, and this embodiment uses the 14 × 512 dimensional output of the conv5_3 layer in the VGGNET network as a feature representation. And finally flattening into feature vectors of the number of regions L and the dimension D. Denoted as { a₁…a_i…a_LWhere L is 14 x 14-196 and dimension D is 512. The convolution calculation formula is as follows:

B(i,j)＝∑_m＝1∑_n＝1K(m,n)×A(i-m+1,j-n+1)；

wherein, A is a convolved matrix, K is a convolution kernel, and B is a convolution result.

The formula for the activation function is as follows:

tanh(x)＝2σ(2x)-1；

wherein sigma (x) is a sigmoid function, an activation function is used for improving the nonlinearity of feature extraction, and a MaxPholing function is used for reserving the maximum value of the region after activation.

Further, the extracted image features are input into an image content cross-modal unit, and text description of the image content is obtained.

Specifically, the input of the image content cross-modal unit is an image feature vector obtained by the feature extraction unit, and a content text description of the network community image is generated by using a long-short term memory recurrent neural network (LSTM) model of an attention injection mechanism.

The long-short term memory cyclic neural network (LSTM) with the attention mechanism utilizes a focusing mechanism to carry out adaptive attention to the input image to different degrees through self-learning of attention weights, so that more accurate text description of image features is realized based on distribution of the attention weights. The LSTM can obtain the natural language text description output for expressing important features and forgetting irrelevant features under continuous memory.

FIG. 4 is a diagram illustrating a process for generating context of content text according to an exemplary embodiment, where as shown in FIG. 4, an image feature a is subjected to an attention module to obtain C contexts { Z ] with D dimensions₁…Z_t…Z_cWhere C is understood as the word length of the output content text, Z_tUnderstood as a D-dimensional feature representing the context to which each word corresponds. The process of context generation is word by word y_tThe generation process of (1).

Wherein the context Z_tIs a weighted sum of the original features A, the weight being

Namely:

the weight is L196, which corresponds to the degree of attention of each image feature region. The weight is represented by the hidden variable h of the previous network_t-1Obtained through the full connection layer, as shown in fig. 5, fig. 5 is a hidden state generation process diagram shown in an exemplary embodiment, and in addition, the weight of the first step has no hidden state and is completely generated by the image features.

Further, FIG. 6 is a diagram illustrating a gate variable generation process, such as hidden variable h shown in FIG. 6, according to an exemplary embodiment_tSimulating memory function, generating input i according to previous hidden state for context_tAnd an output o_tForgetting f_t"Gate variable" to generate candidate g_tControlling the current input intensity to generate a memory c_tControlling the storage of the previous word. Hidden state h_tFrom storage c_tAnd an output gate o_tAnd (4) controlling together.

Further, FIG. 7 is a diagram illustrating a current word generation process, such as the current hidden variable generating the current output word y through the fully-connected network shown in FIG. 7, according to an exemplary embodiment_t。

And obtaining the text description of the image content according to the generated word by word.

S103, inputting the cross-modal text description of the image information into a sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information.

In one embodiment, a training data set is firstly constructed, in this embodiment, posts of 50 popular websites are crawled, and 35000 sensitive texts and 50000 non-sensitive texts are distinguished as the training set according to the web community sensitivity.

Further, training the TextCNN convolutional neural network according to a pre-constructed training set to obtain a trained sensitive information identification module.

TextCNN is a convolutional neural network on a text, which expresses the text as a word vector matrix, each line of an input matrix represents a word, a convolution kernel is used as a local feature extractor, the length of the convolution kernel is consistent with the length of the word, and Filter-sizes are not used as different dimensional features of n-grams at the same time. FIG. 8 is a block diagram illustrating a TextCNN convolutional neural network, according to an example embodiment. As shown in fig. 8, the TextCNN convolutional neural network includes a sentence composed of input n k-dimensional words, and then outputs a result through a convolutional layer, a pooling layer, and a full-link layer of multi-dimensional feature extraction.

And inputting the cross-modal text description of the image information into a sensitive information identification module to obtain the identified sensitive text information, wherein the image corresponding to the sensitive text information is the sensitive image.

The sensitive information identification module in the embodiment of the disclosure fuses a large number of sensitive text knowledge bases in the network community environment, and on the basis, more accurate and credible sensitive information identification can be carried out on the image content text.

Fig. 2 is a schematic structural diagram illustrating a sensitive image recognition model according to an exemplary embodiment, where the sensitive image recognition model includes, as shown in fig. 2: the network community image content cross-modal perception module and the network community sensitive text information identification module.

In the network community image content cross-modal perception module, firstly, network community images are input, then image main body capture is carried out based on DenseNet, and in the model training process, corresponding cross-modal description models are respectively trained aiming at different types of main bodies to obtain a cross-modal description model group. The cross-modal description model comprises a feature extraction unit, performs feature extraction based on VGGNET, comprises an image content description unit, and generates content text description of the network community image based on the long-short term memory recurrent neural network LSTM model of the attention injection mechanism.

In a network community sensitive text information recognition module, a TextCNN convolutional neural network is trained based on a network community priori knowledge base to obtain a trained sensitive information text recognition model, and image content text description generated by a cross-modal perception module is input into the sensitive text information recognition model to obtain a recognized sensitive image.

In an exemplary scenario, the embodiments of the present disclosure experimentally verify the above method. Firstly, acquiring experimental data, using manually marked network community description data and MSCOCO data as a training data set of a cross-modal perception module, finally training 25000 images in the data set, and 2500 images in a verification set, and verifying experimental results on 1000 images in the test set. 35000 sensitive training texts and 50000 non-sensitive training texts in total published by the network community of the colleges and universities are used for carrying out sensitive text information identification module training, and experimental result verification is carried out on 1000 test texts.

The experimental environment was then constructed to train the DenseNet body capture unit in a tensoflow training environment with maximum Epoch set to 100, Batch-Size set to 64, and a learning rate set to 0.001.

The cross-modal description model population was trained in a Tensorflow environment using an Rmpp optimizer, using a cross entropy loss function, with Batch-Size set to 16, max Epoch set to 500, and a learning rate of 0.001.

The method comprises the steps of setting the text length to be 50, the dictionary length to be 5000, the Batch-Size to be 64, setting the maximum Epoch to be 100 and setting the Filter-Size to be 2, 3, 4, 5, 7 and 10 respectively to train the network community sensitive text information recognition module in a Keras environment.

VGGNet and DenseNet sensitive image classification recognition models were trained as control models in a Tensorflow training environment with maximum Epoch set to 100, Batch-Size set to 64, and a learning rate set to 0.001. VGGNET is one of the representatives of the deep neural network, which forms a deeper network due to a smaller convolution kernel, so that the error rate of the network in the field of image recognition classification is lower than 7.3%. This embodiment selects it as the comparison model 1 because of its prominent expression as a classical neural network in the field of image classification recognition. DenseNet, a complex version of ResNet, in the course of classical neural network development, its tight cross-layer connectivity and simplified memory consumption make it the latest representative of classical neural networks, which is chosen here as control model 2.

Further, the experiment was started by first performing a DenseNet entity capture unit training on the data set, and using the labeled 12-class sensitive entities and MSCOCO non-sensitive entities as class 13 classification labels. And training a model with the highest classification accuracy as a catcher by the main body catching module.

And training each main body category, and selecting a model with the description text having the optimal BLEU standard expression effect on the verification set image as a cross-modal description model of the network community. And all the proprietary models form a network community image content cross-modal description model group.

And training the textCNN content text recognition model to select a model with the highest recognition sensitivity accuracy on the test text set as a final network community image sensitive text information recognition model.

VGGNet and DenseNet image recognition classification model training was performed on the above data set to verify that the best performing model on the image set served as the control model.

The DenseNet subject capture unit training results are shown in the following table:

	Loss	Acc_val	Acc_test
				score of	0.225	0.9206	0.917

The mean values of the performance parameters of the cross-modal description models of the contents of all the main body categories in the model group on the verification set are shown in the following table:

Bleu_1	0.7727
		Bleu_2	0.6809
Bleu_3	0.6143
		Bleu_4	0.5834

the semantic information expression of the cross-modal perception network community image is shown in fig. 9, and the extracted text is described as "a blank view of a building covered with fire" which is a fuzzy view of a building covered by fire.

The network community image content cross-modal description model group average score performance is better, mainly because the network community sensitive image description training is pertinently carried out in the training image set, and the image content description model has better adaptability to the specific application field of the network community environment.

The image content text obtained in the process enables the text to master the information content in the text form in the process of image information propagation, and the judgment result of the sensitive image identification in the next step is higher in interpretability, so that the sensitive identification by fusing a network community text knowledge base is possible.

The accuracy of TextCNN sensitive text recognition over a 1000 test text set is shown in the following table:

	Loss	Acc_val	Acc_test
				score of	0.104	0.981	0.97

The results of the sensitive identification accuracy comparison experiment of the network community sensitive image identification model and the comparison model on the verification set and the test set based on the content text are shown in the following table:

the experimental result shows that the identification of the network community sensitive image is carried out based on cross-modal perception and by fusing a large amount of network community text sensitive identification prior knowledge, so that the identification prediction accuracy of the text is obviously improved compared with the effect of carrying out sensitive identification through a neural network.

In the discrimination process of the network community sensitive image recognition model based on cross-modal content perception, a large amount of network community text sensitive recognition prior knowledge can be fused by obtaining the cross-modal text of the image content, and a more accurate and more intelligible sensitive image recognition result can be obtained. In addition, the image information and the subsequent information related to the additional comment, the topic text and the like are unified in form after the text modal content of the image is obtained based on the information propagation content in the time dimension, and the propagation of the image information can be further traced.

The embodiment of the present disclosure further provides a cross-modality sensing-based sensitive image recognition apparatus, which is configured to execute the cross-modality sensing-based sensitive image recognition method according to the foregoing embodiment, as shown in fig. 10, and the apparatus includes:

an obtaining module 1001, configured to obtain image information to be identified in a network community;

the cross-modal description module 1002 is configured to input the image information into a cross-modal sensing module in a preset sensitive image recognition model, so as to obtain a cross-modal text description of the image information;

the identifying module 1003 is configured to input the cross-modal text description of the image information into the sensitive information identifying module in the sensitive image identifying model, so as to obtain a sensitive image containing sensitive information.

In one embodiment, further comprising:

It should be noted that, when the cross-modal sensing-based sensitive image recognition apparatus provided in the foregoing embodiment executes the cross-modal sensing-based sensitive image recognition method, only the division of the functional modules is taken as an example, and in practical application, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the cross-modal sensing-based sensitive image identification device provided in the above embodiment and the cross-modal sensing-based sensitive image identification method embodiment belong to the same concept, and details of the implementation process are found in the method embodiment and are not described herein again.

The embodiment of the present disclosure further provides an electronic device corresponding to the cross-modal sensing-based sensitive image identification method provided in the foregoing embodiment, so as to execute the cross-modal sensing-based sensitive image identification method.

Referring to fig. 11, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 11, the electronic apparatus includes: a processor 1100, a memory 1101, a bus 1102 and a communication interface 1103, the processor 1100, the communication interface 1103 and the memory 1101 being connected by the bus 1102; the memory 1101 stores a computer program that can be executed on the processor 1100, and the processor 1100 executes the computer program to execute the cross-modal perception-based sensitive image recognition method provided by any of the foregoing embodiments of the present application.

The Memory 1101 may include a Random Access Memory (RAM) and a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 1103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like may be used.

Bus 1102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 1101 is used for storing a program, and the processor 1100 executes the program after receiving an execution instruction, and the method for recognizing a sensitive image based on cross-modal perception disclosed in any one of the foregoing embodiments of the present application may be applied to the processor 1100, or implemented by the processor 1100.

Processor 1100 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 1100. The Processor 1100 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1101, and the processor 1100 reads the information in the memory 1101, and completes the steps of the above method in combination with the hardware thereof.

The electronic device provided by the embodiment of the application and the cross-mode sensing-based sensitive image identification method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.

Referring to fig. 12, the computer-readable storage medium is an optical disc 1200, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program may execute the cross-modal sensing-based sensitive image recognition method according to any of the embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the cross-modal perception-based sensitive image identification method provided by the embodiment of the present application have the same inventive concept, and have the same beneficial effects as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A sensitive image identification method based on cross-modal perception is characterized by comprising the following steps:

acquiring image information to be identified in a network community;

inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information;

2. The method according to claim 1, wherein before inputting the image information into a preset sensitive image recognition model, the method further comprises:

and constructing a training data set, and training the sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-modal perception module and a sensitive information recognition module.

3. The method of claim 1, wherein inputting the image information into a cross-modal perception module in a preset sensitive image recognition model to obtain a cross-modal textual description of the image information comprises:

identifying a salient subject within an image according to the cross-modal perception module;

4. The method of claim 3, wherein identifying salient objects within an image according to the cross-modal perception module comprises:

identifying a salient subject within an image according to a subject capture unit in the cross-modality awareness module, the subject capture unit comprising a DenseNet-Block network structure, the calculation formula of which is as follows:

X_l＝H_l([X₀,X₁...X_l-1])

wherein, [ X ]₀,X₁...X_l-1]Representing merging of profiles from layers 0 to l-1 on the channel, H_lThe normalization operation, the activation operation and the convolution operation are performed on the combined features, and xl represents the result after the l-th layer of convolution calculation.

5. The method according to claim 3, wherein performing generalized content text mode conversion on the high-level semantic information of the subject, the relationship between subjects, and the subject behavior in the image according to the cross-modal description model to obtain the cross-modal text description of the image information comprises:

6. The method of claim 1, wherein inputting the cross-modal textual description of the image information into a sensitive information recognition module in the sensitive image recognition model to obtain a sensitive image containing sensitive information comprises:

inputting the cross-modal text description of the image information into the sensitive information identification module to obtain the identified sensitive text information;

7. A sensitive image recognition device based on cross-modal perception is characterized by comprising:

and the identification module is used for inputting the cross-modal text description of the image information into the sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information.

8. The apparatus of claim 7, further comprising:

and the training module is used for constructing a training data set and training the sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-modal perception module and a sensitive information recognition module.

9. A cross-modal awareness based sensitive image recognition device comprising a processor and a memory storing program instructions, the processor being configured to perform the cross-modal awareness based sensitive image recognition method of any one of claims 1 to 6 when executing the program instructions.

10. A computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement a cross-modal perception based sensitive image recognition method according to any one of claims 1 to 6.