CN113610080B - Cross-modal perception-based sensitive image identification method, device, equipment and medium - Google Patents

Cross-modal perception-based sensitive image identification method, device, equipment and medium Download PDF

Info

Publication number
CN113610080B
CN113610080B CN202110892160.1A CN202110892160A CN113610080B CN 113610080 B CN113610080 B CN 113610080B CN 202110892160 A CN202110892160 A CN 202110892160A CN 113610080 B CN113610080 B CN 113610080B
Authority
CN
China
Prior art keywords
image
cross
modal
sensitive
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110892160.1A
Other languages
Chinese (zh)
Other versions
CN113610080A (en
Inventor
吴旭
吴京宸
高丽
颉夏青
杨金翠
孙利娟
张熙
方滨兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110892160.1A priority Critical patent/CN113610080B/en
Publication of CN113610080A publication Critical patent/CN113610080A/en
Application granted granted Critical
Publication of CN113610080B publication Critical patent/CN113610080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses a sensitive image identification method, device, equipment and medium based on cross-modal sensing, wherein the method comprises the following steps: acquiring image information to be identified in a network community; inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information; and inputting the cross-modal text description of the image information into a sensitive information recognition module in the sensitive image recognition model to obtain a sensitive image containing sensitive information. According to the sensitive image recognition method provided by the embodiment of the disclosure, semantic information content of the network community image is expressed in a cross-mode manner, a large amount of priori knowledge of sensitive text content of the network community is fused, the content of the community image is analyzed and judged more accurately, and propagation and tracing of sensitive image information are possible by acquiring cross-mode text description of the image.

Description

Cross-modal perception-based sensitive image identification method, device, equipment and medium
Technical Field
The present application relates to the field of image recognition technologies, and in particular, to a method, apparatus, device, and medium for recognizing a sensitive image based on cross-modal sensing.
Background
With the development of the multimedia age, the imaging network community environment is an important characteristic of network information transmission and development. The imaging network community environment is emphasized, the image information is accurately judged and identified, and timely and necessary intervention on the sensitive content is beneficial to maintaining the stability of the network community environment and the social security.
In the prior art, the information carried by the image is expressed in an intuitive visual mode, and the semantic information of the information cannot be directly acquired and needs to be read and understood through the brain. With the rapid development of image processing technology in recent years, the image content recognition technology is also improved to a certain extent. Currently, image classification technology is basically adopted for sensitive image content identification and analysis at home and abroad. The sensitive image recognition technology based on the image classification technology extracts image features through a neural network, takes feature vectors as input of a full-connection layer at the tail end of the neural network, and outputs the full-connection layer to obtain an image content analysis result. Low-level features of subjects, lines, colors, etc. in the sensitive image are captured through this process and discriminant analysis is performed based thereon. However, the image semantic content such as the relationship among the image subjects, the subject behaviors and the like cannot be obtained in the identification and classification process, and a large amount of network community knowledge closely related to the identification of the sensitive image content is separated in form, so that the image content identification result cannot be combined with the priori knowledge of a large amount of network community text information to carry out identification, the identification accuracy is low, and the understandability is poor. The network community environment is complex, and the propagation and fermentation of image information in the time dimension are important points for maintaining network security. Sensitive image recognition technology based on image classification cannot trace back image information propagation.
Disclosure of Invention
The embodiment of the disclosure provides a sensitive image identification method, device, equipment and medium based on cross-modal sensing. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
In a first aspect, an embodiment of the present disclosure provides a method for identifying a sensitive image based on cross-modal sensing, including:
acquiring image information to be identified in a network community;
inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information;
and inputting the cross-modal text description of the image information into a sensitive information recognition module in the sensitive image recognition model to obtain a sensitive image containing sensitive information.
In one embodiment, before inputting the image information into the preset sensitive image recognition model, the method further comprises:
and constructing a training data set, and training a sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-modal sensing module and a sensitive information recognition module.
In one embodiment, inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain a cross-modal text description of the image information, including:
identifying a salient body in the image according to the cross-modal sensing module;
determining a cross-modal description model corresponding to the identified image subject in the pre-trained cross-modal description model group;
and performing generalized content text modal transformation on the main body in the image, the relation among the main bodies and the advanced semantic information of the main body behavior according to the cross-modal description model to obtain cross-modal text description of the image information.
In one embodiment, identifying salient subjects within an image from a cross-modality awareness module includes:
according to the main body capturing unit in the cross-mode sensing module, the main body capturing unit identifies a significant main body in an image and comprises a DenseNet-Block network structure, and the calculation formula is as follows:
X l =H l ([X 0 ,X 1 …X l-1 ])
wherein [ X ] 0 ,X 1 …X l-1 ]Representing merging on channels of the feature maps of layers 0 to l-1, H l Representing the normalization operation, the activation operation and the convolution operation of the combined features, xl representing the result after the first layer convolution calculation.
In one embodiment, performing generalized content text modal transformation on subjects within an image, relationships between subjects, and advanced semantic information of subject behavior according to a cross-modal description model to obtain a cross-modal text description of the image information, including:
extracting image features through a VGGNET network structure in a cross-modal description model;
and inputting the extracted image features into a long-term and short-term memory recurrent neural network containing an attention mechanism to obtain a main body in the image, a relation among the main bodies and a cross-modal text description of high-level semantic information of main body behaviors.
In one embodiment, inputting a cross-modal text description of image information into a sensitive information recognition module in a sensitive image recognition model to obtain a sensitive image containing sensitive information, comprising:
training a textCNN convolutional neural network according to a pre-constructed training set to obtain a trained sensitive information identification module;
inputting the cross-modal text description of the image information into a sensitive information identification module to obtain identified sensitive text information;
and taking the image corresponding to the sensitive text information as a sensitive image.
In a second aspect, embodiments of the present disclosure provide a sensitive image recognition apparatus based on cross-modal sensing, including:
the acquisition module is used for acquiring image information to be identified in the network community;
the cross-modal description module is used for inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information;
the identification module is used for inputting the cross-modal text description of the image information into the sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information.
In one embodiment, further comprising:
the training module is used for constructing a training data set and training a sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-mode sensing module and a sensitive information recognition module.
In a third aspect, an embodiment of the present disclosure provides a sensitive image recognition device based on cross-modal sensing, including a processor and a memory storing program instructions, where the processor is configured to execute the sensitive image recognition method based on cross-modal sensing provided in the above embodiment when executing the program instructions.
In a fourth aspect, embodiments of the present disclosure provide a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement a cross-modal awareness based sensitive image recognition method provided by the above embodiments.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
embodiments of the present disclosure provide a web community-sensitive image recognition model (SIR-CM) based on cross-modal content awareness. The model mainly comprises an image content cross-mode sensing module and a network community sensitive text information identification module. SIR-CM realizes generalized network community image content text conversion module on MSCOCO data set and network community sensitive image label data set, and can make fine-grained cross-modal expression for community image content text. The prior knowledge of the sensitive content text set in the network community environment is fused in the network community sensitive text information recognition module, so that the network community sensitive text information recognition module has the capability of analyzing and recognizing sensitive information in the network community environment, and a more accurate and more understandable sensitive image recognition result is obtained. In addition, the addition of comments, the fermentation of topics and the like are based on information propagation contents in the time dimension, after the text modal contents of the images are obtained, the image information and the related follow-up information such as the additional comments, topic texts and the like are unified in form, and the propagation of the image information can be further traced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow diagram illustrating a cross-modality perception based sensitive image recognition method in accordance with an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating the structure of a sensitive image recognition model according to an exemplary embodiment;
FIG. 3 is a schematic diagram of a subject capture unit, shown in accordance with an exemplary embodiment;
FIG. 4 is a diagram illustrating a content text context generation process according to an example embodiment;
FIG. 5 is a diagram illustrating a hidden state generation process according to an example embodiment;
FIG. 6 is a diagram illustrating a process of generating a gate variable according to an example embodiment;
FIG. 7 is a diagram illustrating a current word generation process according to an example embodiment;
FIG. 8 is a block diagram of a textCNN convolutional neural network, shown in accordance with an exemplary embodiment;
FIG. 9 is a diagram illustrating an image content description according to an exemplary embodiment;
FIG. 10 is a schematic diagram illustrating a configuration of a cross-modal awareness based sensitive image recognition device, in accordance with an exemplary embodiment;
FIG. 11 is a schematic diagram illustrating a configuration of a cross-modality awareness based sensitive image recognition device according to an exemplary embodiment;
fig. 12 is a schematic diagram of a computer storage medium shown according to an example embodiment.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the application to enable those skilled in the art to practice them.
It should be understood that the described embodiments are merely some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of systems and methods that are consistent with aspects of the application as detailed in the accompanying claims.
In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art. Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
FIG. 1 is a diagram illustrating a cross-modal awareness based sensitive image recognition method, as shown in FIG. 1, that specifically includes the following steps.
S101, acquiring image information to be identified in a network community.
The embodiment provides a network community sensitive image recognition model based on cross-modal content perception by utilizing an image description technology, aims at cross-modal expression perception of semantic information content of a network community image, fuses a large amount of prior knowledge of network community sensitive text content, performs more accurate and more understandable analysis and discrimination on the content of the community image, and enables propagation and traceability of sensitive image information to be possible by acquiring cross-modal content text of the network community image.
Firstly, acquiring image information to be identified in a network community, and then inputting the acquired image into a trained sensitive image identification model for identification.
S102, inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information.
In one embodiment, before inputting the image information into the preset sensitive image recognition model, the method further comprises: and constructing a network community sensitive image annotation data set, and training a sensitive image recognition model based on the network community sensitive image annotation data set and the MSCOCO data set.
In one exemplary scenario, a manually annotated web community image dataset and an MSCOCO dataset are used as the cross-modal awareness module training dataset. The method comprises the steps of obtaining a training data set 25000 images, verifying 2500 images in the training data set, verifying experimental results on 1000 images in the testing set, and obtaining a trained sensitive image recognition model, wherein the sensitive image recognition model comprises a cross-modal sensing module and a sensitive information recognition module, and obtaining cross-modal text description of image information through the cross-modal sensing module.
In one embodiment, obtaining a cross-modal text description of image information from a cross-modal awareness module includes: and identifying a significant main body in the image according to the cross-modal sensing module, determining a cross-modal description model corresponding to the identified image main body in the pre-trained cross-modal description model group, and performing generalized content text modal conversion on the main body in the image, the relation among the main bodies and the high-level semantic information of the main body behavior according to the cross-modal description model to obtain cross-modal text description of the image information.
Specifically, the cross-mode sensing module comprises a main body capturing unit, which is used for identifying a significant main body in the image, performing pre-analysis for the next cross-mode, and improving the accuracy of the conversion of the image information text modes.
Fig. 3 is a schematic diagram of a main body capturing unit according to an exemplary embodiment, and as shown in fig. 3, the main body capturing unit includes a 3-layer DenseNet-Block network structure, and the calculation formula thereof is as follows:
X l =H l ([X 0 ,X 1 …X l-1 ])
wherein [ X ] 0 ,X 1 …X l-1 ]Representing merging on channels of the feature maps of layers 0 to l-1, H l Representing the normalization operation, the activation operation and the convolution operation of the combined features, xl representing the result after the first layer convolution calculation.
And obtaining a result after each layer of convolution calculation according to the formula, and combining the results after each layer of convolution calculation to obtain the identified image main body characteristics.
The DenseNet network structure combines the output characteristics of the 0-l layers on the channel, so that the gradient disappearance problem in the CNN network training process is reduced, and the training parameters are fewer, thereby showing prominence in the main body capturing unit training process.
Further, after the subjects of different images are obtained in the training process of the model, training a plurality of cross-modal description models based on the identified image subjects to obtain a trained cross-modal description model group.
And then selecting a cross-modal description model which is most matched with the main body from the cross-modal description model group according to the identified image main body, and obtaining cross-modal text description of the image information based on the selected cross-modal description model.
Further, obtaining a cross-modal text description of the image information according to the cross-modal description model, including:
and extracting image features according to a feature extraction unit in the cross-modal description model, and then inputting the extracted image features into an image content cross-modal unit to obtain text description of the image content.
In an alternative embodiment, feature extraction is performed using VGGNET pre-trained on the tens of millions of image standard dataset ImageNet, because VGGNET smaller convolution kernels and deeper networks are more suitable for the model herein than other networks in terms of extracted feature capabilities. The network takes 224 x 224 pictures as input and uses convolution kernels to extract global features and local features. The nonlinearity of the network structure is increased by using the activation function, the input feature diagram is compressed by using the pooling layer, the complexity of network calculation is simplified, and the common maximum pooling method also has the effect of highlighting main features.
In the VGGNET structure, only the smallest convolution kernel of 3*3 is used. This increases the grid depth over the same receptive field. The first layer uses 64 3*3 convolution kernels with a step size of 1*1, the input is 224×224×64, the second layer uses 64 3*3 convolution kernels, the input size is 224×224×64, and the output is 112×112×64 for the next largest pooling layer 2×2. This is an entire convolution segment. The entire VGGNET has 5 convolutions, and this embodiment uses the 14 x 512 dimensional output of conv5_3 layer in the VGGNET network as a characteristic representation. Finally flattening into a feature vector of the region number L and the dimension D. Denoted as { a } 1 …a i …a L Where L is 14 x 14 = 196 and dimension D is 512. The convolution calculation formula is as follows:
B(i,j)=∑ m=1n=1 K(m,n)×A(i-m+1,j-n+1);
wherein A is a convolved matrix, K is a convolution kernel, and B is a convolution result.
The formula for the activation function is as follows:
tanh(x)=2σ(2x)-1;
wherein sigma (x) is a sigmoid function, the feature extraction nonlinearity is improved by using an activation function, and the maximum value of the region is reserved by using a MaxPooling function after activation.
Further, the extracted image features are input into an image content cross-mode unit to obtain text description of the image content.
Specifically, the input of the cross-mode unit of the image content is the image feature vector obtained by the feature extraction unit, and the long-short-period memory recurrent neural network LSTM model of the injection attention mechanism is used for generating the content text description of the network community image.
The long-term memory recurrent neural network (LSTM) injected with the attention mechanism utilizes the focusing mechanism to carry out different degrees of self-adaptive attention on the input image through self-learning of the attention weight, thereby realizing more accurate text description on the image characteristics based on the distribution of the attention weight. The LSTM can obtain the natural language text description output which expresses important features and forgets irrelevant features under continuous memory.
FIG. 4 is a diagram illustrating a context generation process for content text according to an exemplary embodiment, as shown in FIG. 4, where the image feature a is acted upon by an attention module to obtain C contexts { Z } with D dimensions 1 …Z t …Z c "where C is understood to be the word length of the output content text, Z t Is understood to mean a D-dimensional feature representing the context of each word. The context generation process is word-by-word y t Is generated by the generation process of (a).
Wherein the context Z t Is the weighted sum of the original characteristics A, and the weight isNamely:
the weight is l=196, corresponding to the attention of each image feature region. The weight is represented by hidden variable h of the previous step network t-1 Obtained through the full connection layer, as shown in fig. 5, fig. 5 is a hidden state generation process diagram shown in an exemplary embodiment, and in addition, the weight of the first step has no hidden state and is completely generated by the image features.
Further, FIG. 6 is an illustration in accordance with an exemplary embodimentThe generated process diagram of the gate variable is shown in fig. 6, and the hidden variable h t Simulating memory function, generating input i for context according to previous hidden state t Output o t Forgetting f t "Gate variable", generating candidate g t Control the current input intensity to generate a storage c t Controlling the storage of the previous word. Hidden state h t From store c t And an output gate o t And (5) jointly controlling.
Further, FIG. 7 is a diagram illustrating a current word generation process according to an exemplary embodiment, as shown in FIG. 7, where the current hidden variable is then used to generate the current output word y via a fully connected network t
And obtaining the text description of the image content according to the generated word by word.
S103, inputting the cross-modal text description of the image information into a sensitive information recognition module in the sensitive image recognition model to obtain a sensitive image containing sensitive information.
In one embodiment, a training data set is first constructed, the embodiment crawls the posts of 50 popular websites, and 35000 sensitive texts and 50000 non-sensitive texts are distinguished as training sets according to the sensitivity of the web communities.
Further, training the textCNN convolutional neural network according to a pre-constructed training set to obtain a trained sensitive information identification module.
TextCNN is a convolutional neural network on a text that represents the text as a matrix of word vectors, each row of the input matrix representing a word, and a convolutional kernel as a local feature extractor whose length is consistent with the word length, and Filter-size functions as a different dimensional feature of an n-gram when it is different. Fig. 8 is a block diagram of a TextCNN convolutional neural network, shown in accordance with an exemplary embodiment. As shown in fig. 8, the TextCNN convolutional neural network includes sentences composed of n k-dimensional words input, and then outputs the result through a convolutional layer, a pooling layer, and a full-connection layer of multi-dimensional feature extraction.
The cross-modal text description of the image information is input into a sensitive information identification module, the identified sensitive text information is obtained, and the image corresponding to the sensitive text information is the sensitive image.
The sensitive information identification module in the embodiment of the disclosure fuses a large number of sensitive text knowledge bases in a network community environment, and can identify the sensitive information of the image content text more accurately and reliably on the basis of the knowledge bases.
FIG. 2 is a schematic diagram of the structure of a sensitive image recognition model, as shown in FIG. 2, according to an exemplary embodiment, the sensitive image recognition model comprising: the network community image content cross-mode sensing module and the network community sensitive text information identification module.
In a network community image content cross-modal sensing module, firstly, inputting a network community image, then capturing an image main body based on DenseNet, and respectively training corresponding cross-modal description models aiming at different types of main bodies in a model training process to obtain a cross-modal description model group. The cross-mode description model comprises a feature extraction unit, wherein feature extraction is carried out based on VGGNET, the cross-mode description model comprises an image content description unit, and a content text description of a network community image is generated based on a long-short-period memory cyclic neural network LSTM model of an injection attention mechanism.
In the network community sensitive text information recognition module, a textCNN convolutional neural network is trained based on a network community priori knowledge base to obtain a trained sensitive information text recognition model, and image content text description generated by the cross-mode perception module is input into the sensitive text information recognition model to obtain a recognized sensitive image.
In one exemplary scenario, the disclosed embodiments experimentally verify the above-described method. Firstly, experimental data are acquired, manually-marked network community description data and MSCOCO data are used as a cross-modal sensing module training data set, 25000 images are finally trained on the data set, 2500 images are verified on the verification set, and experimental result verification is carried out on 1000 images in total on the test set. And performing sensitive text information recognition module training by using 35000 pieces of sensitive training texts and 50000 pieces of non-sensitive training texts in total of self-crawling college network community release articles, and performing experimental result verification on 1000 pieces of test texts.
Then, an experimental environment was constructed, and a DenseNet subject capture unit was trained in a Tensorflow training environment with a maximum Epoch set to 100, batch-Size set to 64, and a learning rate set to 0.001.
Using Rmsprop optimizers in a Tensorflow environment, using cross entropy loss functions, batch-Size set to 16, maximum Epoch set to 500, learning rate training across modal description model clusters at 0.001.
The text length is set to 50, the dictionary length is 5000, the Batch-Size is set to 64, the maximum Epoch is set to 100, and the filter-Size is set to 2, 3, 4, 5, 7 and 10 respectively in the Keras environment to train the web community sensitive text information recognition module.
In a Tensorflow training environment, the maximum Epoch is set to 100, the batch-Size is set to 64, and the VGGNet and DenseNet sensitive image classification recognition models are trained under the learning rate set to 0.001 to serve as a control model. VGGNET is one of the representatives of deep neural networks, which forms deeper networks due to smaller convolution kernels, resulting in a network error rate of less than 7.3% in the field of image recognition classification. The present embodiment selects this as the control model 1 because of its prominent expression as a classical neural network in the field of image classification recognition. DenseNet, as a complex version of ResNet, its tight cross-layer connection and simplified memory consumption make it the most up-to-date representation of classical neural networks, which is chosen herein as control model 2.
Further, the experiment was started by first training a DenseNet body capturing unit on the data set, and using 12 labeled sensitive bodies and MSCOCO non-sensitive bodies as 13 class classification labels. The subject capture module trains the model with the highest classification accuracy as a capturer.
And training each subject category, and selecting a model with the best BLEU standard expression effect of the descriptive text on the verification set image as a network community cross-mode descriptive model. All proprietary models constitute a web community image content cross-modality description model group.
The TextCNN content text recognition model training selects the model with the highest recognition sensitivity accuracy on the test text set as the final network community image sensitive text information recognition model.
VGGNet and DenseNet image recognition classification model training is performed on the data set to verify that the best performing model on the image set is the control model.
The training results of the DenseNet subject capture unit are shown in the following table:
Loss Acc_val Acc_test
score of 0.225 0.9206 0.917
The mean value of the performance parameters of the cross-modal description model of the content of each subject category in the model group on the verification set is shown in the following table:
Bleu_1 0.7727
Bleu_2 0.6809
Bleu_3 0.6143
Bleu_4 0.5834
the cross-modal perceived web community image semantic information is represented in fig. 9, and the extracted text is described as "a blurry view of a building covered with fire", i.e. "a blurred view of a building covered by fire".
The cross-modal description model group score mean value of the network community image content shows better, mainly because the embodiment carries out the network community sensitive image description training in the training image set in a targeted mode, and the image content description model has better adaptability to the specific application field of the network community environment.
The text of the image content obtained in the process enables the text to grasp the information content in the text form in the process of image information transmission, has stronger interpretability on the discrimination result of next step of sensitive image recognition, and enables the fusion of the network community text knowledge base to carry out sensitive recognition to be possible.
The accuracy of TextCNN sensitive text recognition on 1000 test text sets is shown in the following table:
Loss Acc_val Acc_test
score of 0.104 0.981 0.97
The result of the sensitive recognition accuracy rate comparison experiment of the web community sensitive image recognition model and the comparison model on the verification set and the test set based on the content text is shown in the following table:
experimental results show that the recognition of the network community sensitive image is carried out based on cross-modal sensing and a large amount of network community text sensitive discrimination priori knowledge is fused, so that the discrimination prediction accuracy of the text is obviously improved compared with the effect of carrying out sensitive recognition through a neural network.
In the discrimination process of the network community sensitive image recognition model based on cross-modal content perception, a large amount of network community text sensitive recognition priori knowledge can be fused by obtaining the cross-modal text of the image content, so that a more accurate and more understandable sensitive image recognition result can be obtained. In addition, the addition of comments, the fermentation of topics and the like are based on information propagation contents in the time dimension, after the text modal contents of the images are obtained, the image information and the related follow-up information such as the additional comments, topic texts and the like are unified in form, and the propagation of the image information can be further traced.
The embodiment of the disclosure further provides a sensitive image recognition device based on cross-modal sensing, which is configured to execute the sensitive image recognition method based on cross-modal sensing in the foregoing embodiment, as shown in fig. 10, and the device includes:
an obtaining module 1001, configured to obtain image information to be identified in a network community;
the cross-modal description module 1002 is configured to input image information into a cross-modal sensing module in a preset sensitive image recognition model, so as to obtain a cross-modal text description of the image information;
the recognition module 1003 is configured to input a cross-modal text description of the image information into the sensitive information recognition module in the sensitive image recognition model, so as to obtain a sensitive image containing sensitive information.
In one embodiment, further comprising:
the training module is used for constructing a training data set and training a sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-mode sensing module and a sensitive information recognition module.
It should be noted that, when the sensitive image recognition device based on cross-modal sensing provided in the foregoing embodiment performs the sensitive image recognition method based on cross-modal sensing, only the division of the foregoing functional modules is used for illustration, in practical application, the foregoing functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the sensitive image recognition device based on cross-modal sensing provided in the above embodiment belongs to the same concept as the sensitive image recognition method embodiment based on cross-modal sensing, which embodies the detailed implementation process and is not described herein.
The embodiment of the disclosure also provides an electronic device corresponding to the sensitive image recognition method based on cross-modal sensing provided by the previous embodiment, so as to execute the sensitive image recognition method based on cross-modal sensing.
Referring to fig. 11, a schematic diagram of an electronic device according to some embodiments of the application is shown. As shown in fig. 11, the electronic device includes: processor 1100, memory 1101, bus 1102, and communication interface 1103, processor 1100, communication interface 1103, and memory 1101 being connected by bus 1102; the memory 1101 stores a computer program executable on the processor 1100, and when the processor 1100 runs the computer program, the method for identifying sensitive images based on cross-modal sensing provided by any one of the foregoing embodiments of the present application is executed.
The memory 1101 may include a high-speed random access memory (RAM: random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 1103 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc.
Bus 1102 may be an ISA bus, PCI bus, EISA bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. The memory 1101 is configured to store a program, and the processor 1100 executes the program after receiving an execution instruction, and the sensitive image recognition method based on cross-modal sensing disclosed in any of the foregoing embodiments of the present application may be applied to the processor 1100 or implemented by the processor 1100.
The processor 1100 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in processor 1100. The processor 1100 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1101, and the processor 1100 reads information in the memory 1101, and performs the steps of the above method in combination with its hardware.
The electronic equipment provided by the embodiment of the application and the sensitive image recognition method based on cross-modal sensing provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the same inventive concept.
An embodiment of the present application further provides a computer readable storage medium corresponding to the cross-modal sensing-based sensitive image recognition method provided in the foregoing embodiment, referring to fig. 12, the computer readable storage medium is shown as an optical disc 1200, on which a computer program (i.e. a program product) is stored, where the computer program, when executed by a processor, performs the cross-modal sensing-based sensitive image recognition method provided in any of the foregoing embodiments.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
The computer readable storage medium provided by the above embodiment of the present application has the same beneficial effects as the method adopted, operated or implemented by the application program stored in the computer readable storage medium, because of the same inventive concept as the method for identifying the sensitive image based on cross-modal sensing provided by the embodiment of the present application.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (8)

1. The sensitive image recognition method based on cross-modal sensing is characterized by comprising the following steps of:
acquiring image information to be identified in a network community;
inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information; comprising the following steps: identifying a significant subject within an image according to the cross-modal awareness module; determining a cross-modal description model corresponding to the identified image subject in the pre-trained cross-modal description model group; performing generalized content text modal transformation on the main body in the image, the relation among the main bodies and the advanced semantic information of the main body behavior according to the cross-modal description model to obtain cross-modal text description of the image information;
extracting image features through a VGGNET network structure in the cross-modal description model; inputting the extracted image features into a long-term and short-term memory cyclic neural network containing an attention mechanism to obtain a main body in the image, a relation among the main bodies and a cross-modal text description of high-level semantic information of main body behaviors;
and inputting the cross-modal text description of the image information into a sensitive information recognition module in the sensitive image recognition model to obtain a sensitive image containing sensitive information.
2. The method of claim 1, further comprising, prior to inputting the image information into a pre-set sensitive image recognition model:
and constructing a training data set, and training the sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-modal sensing module and a sensitive information recognition module.
3. The method of claim 1, wherein identifying salient subjects within an image according to the cross-modality awareness module comprises:
identifying a significant subject in an image according to a subject capturing unit in the cross-modal sensing module, wherein the subject capturing unit comprises a DenseNet-Block network structure, and the calculation formula is as follows:
X l =H l ([X 0 ,X 1 ...X l-1 ])
wherein [ X ] 0 ,X 1 ...X l-1 ]Representing merging on channels of the feature maps of layers 0 to l-1, H l Representing the normalization operation, the activation operation and the convolution operation of the combined features, X l Representing the result after the layer i convolution calculation.
4. The method of claim 1, wherein inputting the cross-modal text description of the image information into the sensitive information recognition module in the sensitive image recognition model results in a sensitive image containing sensitive information, comprising:
training a textCNN convolutional neural network according to a pre-constructed training set to obtain a trained sensitive information identification module;
inputting the cross-modal text description of the image information into the sensitive information identification module to obtain identified sensitive text information;
and taking the image corresponding to the sensitive text information as a sensitive image.
5. A cross-modal awareness based sensitive image recognition device, comprising:
the acquisition module is used for acquiring image information to be identified in the network community;
the cross-modal description module is used for inputting the image information into a cross-modal sensing module in a preset sensitive image recognition model to obtain cross-modal text description of the image information; comprising the following steps: identifying a significant subject within an image according to the cross-modal awareness module; determining a cross-modal description model corresponding to the identified image subject in the pre-trained cross-modal description model group; performing generalized content text modal transformation on the main body in the image, the relation among the main bodies and the advanced semantic information of the main body behavior according to the cross-modal description model to obtain cross-modal text description of the image information;
extracting image features through a VGGNET network structure in the cross-modal description model; inputting the extracted image features into a long-term and short-term memory cyclic neural network containing an attention mechanism to obtain a main body in the image, a relation among the main bodies and a cross-modal text description of high-level semantic information of main body behaviors;
the identification module is used for inputting the cross-modal text description of the image information into the sensitive information identification module in the sensitive image identification model to obtain a sensitive image containing sensitive information.
6. The apparatus as recited in claim 5, further comprising:
the training module is used for constructing a training data set and training the sensitive image recognition model based on the training data set, wherein the sensitive image recognition model comprises a cross-modal sensing module and a sensitive information recognition module.
7. A cross-modal awareness based sensitive image recognition device comprising a processor and a memory storing program instructions, the processor being configured, when executing the program instructions, to perform the cross-modal awareness based sensitive image recognition method of any one of claims 1 to 4.
8. A computer readable medium having stored thereon computer readable instructions executable by a processor to implement a cross-modality perception based sensitive image recognition method as claimed in any one of claims 1 to 4.
CN202110892160.1A 2021-08-04 2021-08-04 Cross-modal perception-based sensitive image identification method, device, equipment and medium Active CN113610080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110892160.1A CN113610080B (en) 2021-08-04 2021-08-04 Cross-modal perception-based sensitive image identification method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110892160.1A CN113610080B (en) 2021-08-04 2021-08-04 Cross-modal perception-based sensitive image identification method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN113610080A CN113610080A (en) 2021-11-05
CN113610080B true CN113610080B (en) 2023-08-25

Family

ID=78306845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110892160.1A Active CN113610080B (en) 2021-08-04 2021-08-04 Cross-modal perception-based sensitive image identification method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113610080B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399816B (en) * 2021-12-28 2023-04-07 北方工业大学 Community fire risk sensing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455630A (en) * 2013-09-23 2013-12-18 江苏刻维科技信息有限公司 Internet multimedia information mining and analyzing system
CN108960073A (en) * 2018-06-05 2018-12-07 大连理工大学 Cross-module state image steganalysis method towards Biomedical literature
CN109151502A (en) * 2018-10-11 2019-01-04 百度在线网络技术(北京)有限公司 Identify violation video method, device, terminal and computer readable storage medium
CN111126373A (en) * 2019-12-23 2020-05-08 北京中科神探科技有限公司 Internet short video violation judgment device and method based on cross-modal identification technology
CN112364198A (en) * 2020-11-17 2021-02-12 深圳大学 Cross-modal Hash retrieval method, terminal device and storage medium
CN112508077A (en) * 2020-12-02 2021-03-16 齐鲁工业大学 Social media emotion analysis method and system based on multi-modal feature fusion
CN113094533A (en) * 2021-04-07 2021-07-09 北京航空航天大学 Mixed granularity matching-based image-text cross-modal retrieval method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014062716A1 (en) * 2012-10-15 2014-04-24 Visen Medical, Inc. Systems, methods, and apparatus for imaging of diffuse media featuring cross-modality weighting of fluorescent and bioluminescent sources
US10769791B2 (en) * 2017-10-13 2020-09-08 Beijing Keya Medical Technology Co., Ltd. Systems and methods for cross-modality image segmentation
US11768262B2 (en) * 2019-03-14 2023-09-26 Massachusetts Institute Of Technology Interface responsive to two or more sensor modalities

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455630A (en) * 2013-09-23 2013-12-18 江苏刻维科技信息有限公司 Internet multimedia information mining and analyzing system
CN108960073A (en) * 2018-06-05 2018-12-07 大连理工大学 Cross-module state image steganalysis method towards Biomedical literature
CN109151502A (en) * 2018-10-11 2019-01-04 百度在线网络技术(北京)有限公司 Identify violation video method, device, terminal and computer readable storage medium
CN111126373A (en) * 2019-12-23 2020-05-08 北京中科神探科技有限公司 Internet short video violation judgment device and method based on cross-modal identification technology
CN112364198A (en) * 2020-11-17 2021-02-12 深圳大学 Cross-modal Hash retrieval method, terminal device and storage medium
CN112508077A (en) * 2020-12-02 2021-03-16 齐鲁工业大学 Social media emotion analysis method and system based on multi-modal feature fusion
CN113094533A (en) * 2021-04-07 2021-07-09 北京航空航天大学 Mixed granularity matching-based image-text cross-modal retrieval method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Research onn automated detection of sensitive information based on BERT;Meng Ding et al.;Journal of Physics;全文 *

Also Published As

Publication number Publication date
CN113610080A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
RU2701995C2 (en) Automatic determination of set of categories for document classification
Singh et al. Image classification: a survey
US11238310B2 (en) Training data acquisition method and device, server and storage medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
US20210295114A1 (en) Method and apparatus for extracting structured data from image, and device
CN111767228B (en) Interface testing method, device, equipment and medium based on artificial intelligence
CN111209384A (en) Question and answer data processing method and device based on artificial intelligence and electronic equipment
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN112257441B (en) Named entity recognition enhancement method based on counterfactual generation
CN111475622A (en) Text classification method, device, terminal and storage medium
CN111523421A (en) Multi-user behavior detection method and system based on deep learning and fusion of various interaction information
Salewski et al. Clevr-x: A visual reasoning dataset for natural language explanations
CN113722474A (en) Text classification method, device, equipment and storage medium
CN113610080B (en) Cross-modal perception-based sensitive image identification method, device, equipment and medium
Zhu et al. NAGNet: A novel framework for real‐time students' sentiment analysis in the wisdom classroom
Lin et al. Detecting multimedia generated by large ai models: A survey
CN113434722B (en) Image classification method, device, equipment and computer readable storage medium
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
Berg et al. Do you see what I see? Measuring the semantic differences in image‐recognition services' outputs
Ghaemmaghami et al. Integrated-Block: A New Combination Model to Improve Web Page Segmentation
Wang et al. Multi‐Task and Attention Collaborative Network for Facial Emotion Recognition
Yue et al. NRSTRNet: A Novel Network for Noise-Robust Scene Text Recognition
CN115658964B (en) Training method and device for pre-training model and somatosensory wind identification model
WO2024066927A1 (en) Training method and apparatus for image classification model, and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant