CN113033610B - Multi-mode fusion sensitive information classification detection method - Google Patents

Multi-mode fusion sensitive information classification detection method Download PDF

Info

Publication number
CN113033610B
CN113033610B CN202110203458.7A CN202110203458A CN113033610B CN 113033610 B CN113033610 B CN 113033610B CN 202110203458 A CN202110203458 A CN 202110203458A CN 113033610 B CN113033610 B CN 113033610B
Authority
CN
China
Prior art keywords
sensitive
emotion
classification
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110203458.7A
Other languages
Chinese (zh)
Other versions
CN113033610A (en
Inventor
张志勇
宋斌
张蓝方
梁腾翔
徐艳艳
苗坤霖
赵长伟
黄帅娜
李静
张孝国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Science and Technology
Original Assignee
Henan University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Science and Technology filed Critical Henan University of Science and Technology
Priority to CN202110203458.7A priority Critical patent/CN113033610B/en
Publication of CN113033610A publication Critical patent/CN113033610A/en
Application granted granted Critical
Publication of CN113033610B publication Critical patent/CN113033610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

A multi-mode fusion sensitive information classification detection method comprises the steps of 1, carrying out primary sensitivity detection on texts and pictures, 2, judging the sensitivity of the texts based on emotion, and 3, carrying out multi-mode sensitivity detection of image-text fusion. Sensitivity detection needs to be performed on the text and the picture respectively, and the sensitivity of the content can be accurately judged by combining the influence of the emotion polarity and intensity on the sensitive information. The problem of image-text sensitivity is solved according to a proper fusion method, and the detection precision is high.

Description

Multi-mode fusion sensitive information classification detection method
Technical Field
The invention relates to the technical field of internet, in particular to a multi-mode fusion sensitive information classification detection method.
Background
The number of global netizens is large, and an online social network has become a preferred platform for information interaction. With the popularization and the application of the social network becoming more and more extensive, the information of the social network takes pictures, texts and audios and videos as carriers, the trends of diversification, complication and quantification are presented, and a large amount of sensitive information is filled in the social network, thereby seriously affecting the network safety and the physical and psychological health of people. How to detect sensitive information efficiently and accurately by using an artificial intelligence technology becomes an urgent problem to be solved in academic and industrial fields.
Most of the existing researches on sensitive information detection adopt single-mode characteristics to perform sensitive identification, namely, single-mode data analysis. For example, Watanabe et al, paper "bite Speech on Twitter: APragmatic Approach to collectHateful and Offensive Expressions and PerformHate Speech Detection" proposes a method for detecting the information of a bite Wentre for Twitter, which can automatically detect the linguistic patterns and the most common phrase combinations of the bite, and classify them into hatable, objectionable and clean tweets by combining emotion and semantic features. An article "Sensitive Information Detection on Cyber-Space" of LinM et al proposes an iteration-based semi-supervised deep learning model and a humming melody-based search model to detect abnormal audio and video Information. The paper Text classification based on deep belief network and softmaxregression of Jiang, M. et al proposes a mixed Text classification model based on deep belief network and Softmax regression. The method solves the problem of sparse high-dimensional matrix calculation of text data by introducing a deep belief network, and classifies texts in a learning feature space by using Softmax regression after feature extraction is carried out on DBN. A paper "Convolitional Neural Network for Portomographic Images Classification" by IMade Artha Agastya et al proposes a Pornographic picture Classification based on a Convolutional Neural Network. The method adapts to the detection of pornographic pictures by changing the learning rate, the algorithm, the structure of a complete connection layer and the like, and improves the accuracy of the detection result. A paper "Bag of rocks for Efficient Text Classification" by JoulinA et al proposes a fast Text classifier, which is equivalent to a deep learning classifier in terms of accuracy and is faster than the deep learning classifier by multiple orders of magnitude in terms of training and evaluation. An Anthony Hu et al paper, Multimodal Sentiment Analysis To apply the Structure of issues, proposes a Multimodal emotion Analysis method, utilizes a fusion technology, combines Multimodal features, fully excavates emotion types of users, and improves emotion classification accuracy.
Some achievements have been made in the aspect of sensitive information detection, but the following problems still exist: 1) no internal connections and complementary roles between multimodal data features are considered. In fact, in the sensitive information detection, the consideration of interaction among modes is necessary, and the mutual supplement among the information can more fully understand the sensitive information. 2) Although the importance of the emotion factors on the text sensitivity detection is partially considered, the influence of emotion polarity and emotion intensity on the text sensitivity judgment is ignored. 3) Neglecting the problem of picture classification, the essence of illegal quality inspection of pictures is that picture classification is different. Sensitive features of the sensitive picture are numerous, the feature parts are difficult to extract, and if only two categories of simple sensitive and non-sensitive are carried out, the accuracy is low. Compared with a single-mode sensitivity method, the global sensitivity analysis of the tweet has unique advantages, so that the influence of the combined action of data among the modes on the result is considered more reasonably and accurately.
Disclosure of Invention
In order to solve the technical problems, the invention provides a multi-mode fusion sensitive information classification detection method based on deep learning by combining two modes of texts and pictures aiming at the problems of insufficient and inaccurate detection of sensitive information in a social network.
In order to realize the technical purpose, the adopted technical scheme is as follows: a multi-modal fusion sensitive information classification detection method comprises the following steps:
step 1, carrying out sensitivity primary detection on texts and pictures
Adopting FastText to detect the sensitivity of the text, judging which sensitive class or non-sensitive class the text belongs to, and obtaining a classification probability set of characters, which is recorded as:
Figure RE-GDA0003071558000000021
detecting the sensitivity of the picture by adopting an Inception V3 network, judging which sensitive class or non-sensitive class the picture belongs to, and obtaining a classification probability set of the picture, which is recorded as:
Figure RE-GDA0003071558000000022
wherein n represents the classification number of the pictures or the texts, and the classification number of the pictures and the classification number of the texts are equal; if the text belongs to the sensitive class, executing the step 2, and if the text belongs to the non-sensitive class, executing the step 3;
step 2, judging the sensitivity of the text based on the emotion
Step 2.1, segmenting the text into a plurality of words by adopting jieba word segmentation, matching the words with the existing emotion word library and sensitive word library to obtain an emotion word set and a sensitive word set, and performing Cartesian product operation on the two sets to judge whether the emotion words and the sensitive words co-occur or not, wherein the emotion words have emotion polarity intensity comprising emotion polarity and emotion intensity;
step 2.2, judging the sensitivity of the text by combining the emotional polarity intensity of the emotional words and the sensitive words, wherein the calculation method comprises the following steps:
Figure RE-GDA0003071558000000031
Figure RE-GDA0003071558000000032
AllSensitiveCount=PositiveSensitiveCount-NegativeSensitiveCount (4)
wherein PositiveSe sensible count represents the positive emotion score of the sensitive word, NegativeSesensible count represents the negative emotion score of the sensitive word, AllSesensible C count represents the overall emotion score of the sensitive word,
Figure RE-GDA0003071558000000033
is a sensitive word w i With positive emotional words w j The number of co-occurrences is such that,
Figure RE-GDA0003071558000000034
is a sensitive word w i With negative emotional words w j The number of co-occurrence times, n is the total word number after the jieba word segmentation, lambda is the positive emotion intensity of the emotion words, and beta is the negative emotion intensity of the emotion words;
step 2.3, if the total emotion score AllSensitivecount is greater than 0, the text is directly judged as the original sensitive classification, and the probability set formed by the sensitive classification probabilities is still recorded as:
Figure RE-GDA0003071558000000035
when the AllSensitivecount is less than or equal to 0, secondary judgment is needed, the word frequency of the sensitive words is calculated, when the word frequency is greater than a set threshold value, the original sensitive classification is directly judged, and the probability set formed by the sensitive classification probabilities is still recorded as:
Figure RE-GDA0003071558000000036
otherwise, the text is judged as other sensitive classifications, and a probability set formed by the sensitive classification probabilities is recorded as: p g =[0,0,...,0,1];
Step 3, multi-modal sensitivity detection of image-text fusion
Obtaining the sensitivity type probability P by the fusion algorithm according to the sensitivity classification probability of the text and the sensitivity classification probability of the picture i Through MAX (P) i ) And taking the maximum sensitivity type probability P, and taking the sensitivity classification corresponding to the maximum sensitivity type probability P as a final sensitivity classification result.
The final sensitive classification probability P distribution calculation formula is as follows:
Figure RE-GDA0003071558000000037
P=MAX(P i ) (6)
wherein w is a fusion weight and has a value range of [0, 1%],P i Is a sensitive type probability distribution.
The method for judging whether the emotional words and the sensitive words are co-occurring is to take the minimum distance in the sentence where the emotional words and the sensitive words are located as the co-occurrence of the emotional words and the sensitive words according to the principle of the shortest distance.
The invention has the beneficial effects that: the invention provides a multi-mode fusion sensitive information classification detection method based on deep learning, which can accurately judge the sensitivity of contents by combining the influence of emotion polarity and intensity on sensitive information. The problem of image-text sensitivity is solved according to a proper fusion method, and the detection precision is high.
Drawings
FIG. 1 is a diagram of a detection framework of the present invention;
FIG. 2 is a flow chart of the detection according to the present invention.
Detailed Description
The multi-mode fusion sensitive information classification detection method provided by the invention can be roughly divided into three stages: the complete framework of the image-text sensitive feature extraction stage, the sensitive detection classification stage and the image-text feature fusion stage is shown in figure 1. The device mainly comprises three parts: text sensitive information classification, picture sensitive information classification and image-text fusion sensitive classification based on the text sensitive information classification and the picture sensitive information classification. And the text sensitive classification model is used for performing sensitive classification on the text finally by combining the training result of the FastText model and the text classification word bank. And the picture sensitive classification model is characterized in that firstly, a skeleton model is loaded, parameters of each layer are adjusted, then fine tuning training is carried out on a picture training set, and finally the trained model is applied to picture classification of a test set. And calculating to obtain the final image-text fusion sensitive classification probability according to a fusion formula by combining the sensitive classification probabilities of the text and the image.
The invention combines two modes of text and picture to detect and classify sensitive information. The sensitivity detection needs to be performed on the text and the picture respectively, and then the detection results are subjected to fusion processing to obtain the final detection result.
1. Primary detection of text sensitivity. The invention uses FastText to detect the sensitivity of the text. The method includes determining which of the sensitive classes or the non-sensitive classes the text belongs to, for example, classifying the text into four classes including a sensitive class a, a sensitive class B, a sensitive class C, and others. The sensitive classes comprise a sensitive class A, a sensitive class B and a sensitive class C (three sensitive classes), the other classes can be a single class or a plurality of classes which do not belong to the sensitive classes, and FastText is a machine learning training tool which integrates word2vec, text classification and the like, and is a simple and efficient text classification model. The FastText model includes three layers, an input layer, a hidden layer, and an output layer. The model firstly decomposes a text vocabulary through a character level N-gram characteristic format, adds the text vocabulary and an original word to obtain a text sequence (x) 1 ,x 2 ,...,x n-1 ,x n ) As data input for the network input layer. The hidden layer is the superposition average of a plurality of word vectors, and finally, classified categories are output at the output layer. FastText adopts layered Softmax to construct a Huffman tree according to class frequency, so that the number of model prediction targets is greatly reduced, and the training efficiency and the classification efficiency of the model are improved.
And carrying out sensitivity classification on the text by using a FastText method, and acquiring a probability set of each sensitivity classification, wherein the probability set is recorded as:
Figure RE-GDA0003071558000000051
and the number of the sensitive classifications of the n texts is consistent with the number of the classifications of the pictures and the texts.
2. A text sensitivity judgment method based on fine-grained emotion. The sensitivity detection and classification of text directly using the FastText model has certain errors, such as: a tweet contains a certain amount of information related to terrorism, but the text context reflects objection and reprimation to the sensitive information, and if the text is directly defined as a sensitive type, the text is certainly biased, so that the subjective emotion of an author in the text has a certain decisive effect on the judgment of the sensitivity of the text, and therefore, the invention introduces emotional polarity to judge the overall sensitivity of the text. And if the text belongs to the sensitive class in the primary detection of the text sensitivity, executing a text sensitivity judgment method based on the fine-grained emotion, and if the text belongs to the non-sensitive class, not performing the text sensitivity judgment method based on the fine-grained emotion.
(1) And (5) performing fine-grained emotion analysis. And taking the emotion polarity of the text into consideration, and adopting fine-grained emotion analysis according to the emotion polarity and strength of the emotion words in the text. A sensitive information identification method based on emotional word and sensitive word co-occurrence analysis is provided. The invention uses the ontology library of the emotional vocabulary of the university of the great courseware to match the emotional words in the text, and each emotional word in the ontology library is divided into three emotional polarities of positive (1), negative (-1) and neutral (0). The emotion word has emotion polarity strength comprising emotion polarity and emotion strength, the emotion polarity strength is higher processing of emotion analysis, the invention sets the value range of the emotion strength as [ -3,3], positive and negative values of the value represent emotion polarity, negative values represent negative emotion, positive values represent positive emotion, 0 represents neutral attitude, and the size represents emotion strength. The method comprises the steps of dividing a text into a plurality of words by using jieba participles, matching the words with an existing emotion word bank and a sensitive word bank to obtain an emotion word set and a sensitive word set, and performing Cartesian product operation on the emotion word set and the sensitive word set. Acquiring word frequency of the co-occurrence of the emotion words and the sensitive words and emotion intensity of a single emotion word according to whether elements in a Cartesian product co-occur or not, so as to calculate the emotion polarity of the text, wherein the co-occurrence in the sensitive information identification method based on the co-occurrence analysis of the emotion words and the sensitive words refers to the co-occurrence of the emotion words and the sensitive words, and the minimum distance between the emotion words and the sentence in which the sensitive words are located is taken as the co-occurrence of the emotion words and the sensitive words according to the nearest distance principle. The distance between the two is calculated as follows:
dis(w i ,w j )=|index(w i )-index(w j ) (1)
wherein, dis (w) i ,w j ) The expression w i And w j Distance of, index (w) i ) And index (w) j ) The subscripts of the positions of the two in the phrases after the Jieba word segmentation are shown, and the subscript of the first word is 1 and is increased in sequence.
(2) And determining sensitivity based on emotion. The invention reflects the emotional characteristics of the text from two aspects of the text emotional polarity and the emotional intensity, and judges the sensitivity of the text by combining the emotional polarity intensity of the emotional words and the sensitive words. The calculation method has the following formula:
Figure RE-GDA0003071558000000061
Figure RE-GDA0003071558000000062
AllSensitiveCount=PositiveSensitiveCount-NegativeSensitiveCount (4)
wherein PositiveSe sensible count represents the positive emotion score of the sensitive word, NegativeSesensible count represents the negative emotion score of the sensitive word, AllSesensible C count represents the overall emotion score of the sensitive word,
Figure RE-GDA0003071558000000063
is a sensitive word w i With positive emotional words w j The number of co-occurrences is such that,
Figure RE-GDA0003071558000000064
is a sensitive word w i With negative emotional words w j The number of co-occurrence times, n is the total word number after the jieba word segmentation, lambda is the positive emotion intensity of the emotion words, and beta is the negative emotion intensity of the emotion words.
The invention divides the emotion polarity into three categories, which can be obtained according to experience and most researchers' results, because most sensitive words contain negative part of speech, and the combination with positive emotion shows that the sensitive words are supported or acquiescent, the sensitive information text containing positive emotion is more tolerantAnd (3) easily obtaining the conclusion of sensitive information, and if the total emotion score AllSensitivecount is greater than 0 and the text is directly judged as the original sensitive classification, the probability set formed by the sensitive classification probabilities is still recorded as:
Figure RE-GDA0003071558000000065
when AllSensitivecount is less than or equal to 0, secondary judgment is needed, the word frequency of the sensitive words is calculated, when the word frequency is greater than a set threshold value, the sensitive words are directly judged to be original sensitive classification, and a probability set formed by the sensitive classification probabilities is still recorded as:
Figure RE-GDA0003071558000000066
otherwise, the text is judged as other sensitive classifications, and a probability set formed by the sensitive classification probabilities is recorded as: p g =[0,0,...,0,1]Except that the probability of other classes of text judgment is 1, the probability of other classes of text judgment is 0.
3. And detecting the sensitivity of the picture. The invention uses an IncepotionV 3 network to detect the sensitivity of pictures. The classification number of the sensitive classes is the same as that of the text, for example, the text is classified into four classes including a sensitive class a, a sensitive class B, a sensitive class C and other classes, and the picture is also classified into four classes including a sensitive class a, a sensitive class B, a sensitive class C and other classes. Firstly loading a skeleton model, constructing a pre-training model without a classifier, then adding a global average pooling layer, on one hand, saving a large number of parameters, accelerating operation and reducing overfitting, simultaneously adding a layer of nonlinear expansion model expression capability, then connecting a full-connection layer with 512 nodes, and finally an output layer with 4 nodes, carrying out probability filtering by using a Softmax activation function, detecting the sensitivity of a picture through an IncepotionV 3 network, judging which sensitive class or non-sensitive class the picture belongs to, and obtaining a classification probability set of the picture, and recording as:
Figure RE-GDA0003071558000000071
4. and (3) multi-modal sensitivity detection of image-text fusion. The invention adopts a decision layer fusion strategy to classify the sensitivity of the text into the classification probability and the sensitivity of the pictureThe classification probability obtains the sensitive type probability P through a fusion algorithm i By MAX (P) i ) And taking the maximum sensitivity type probability P, and taking the sensitivity classification corresponding to the maximum sensitivity type probability P as a final sensitivity classification result.
Compared with single-mode sensitivity detection, the image-text fusion mode can effectively form feature complementation, and the final classification probability distribution calculation formula is as follows:
Figure RE-GDA0003071558000000072
P=MAX(P i ) (6)
wherein w is fusion weight, and the numeric area is [0,1 ]],P i For sensitive type probability distribution, by adding fusion weight w, converting single-mode text or picture into a multi-mode detection method determined by the two,
Figure RE-GDA0003071558000000073
and
Figure RE-GDA0003071558000000074
respectively multiplied by the fusion weights corresponding to the fusion weights respectively and then added to obtain the sensitive type probability P i Through MAX (P) i ) And taking the maximum type probability, and taking the corresponding classification as a final sensitive classification result after the fusion algorithm is adopted.
Example 1
In order to verify the effectiveness of the invention, sensitive classes are classified into four classes including a sensitive class A, a sensitive class B, a sensitive class C and other classes by using different types of pictures crawled from the web by a crawler program. The sensitive text data set is obtained by carrying out manual processing modes such as splicing, recombination and the like on related sensitive words, and comprises a sensitive class A, a sensitive class B, a sensitive class C and other classes. The normal text data set is derived from a normal review set of microblogs. The technical scheme of the invention can be implemented as follows:
(1) in the sensitive model training phase. Training a text model by using a sensitive word bank of jieba participles; in the process of training the picture model, firstly, a data set is expanded by randomly turning, cutting, amplifying, cutting and the like the input picture, the diversity of the picture is increased, finally, the picture is normalized to the same size and the label lengths are consistent, the size of the input picture is uniformly set to be 3 multiplied by 224, and the size of the batch _ size is set to be 32. In order to further improve the performance of the model, fine tuning training is carried out on the model, rmsprop is used as an optimizer, a cross entropy function is used as a loss function, an accuracy is used as an evaluation function, and the learning rate of the model is set to be 0.001.
(2) In the stage of fusing the image-text detection results. Setting the fusion weight w to 0.5 indicates that the text and the picture have equal influence on the sensitivity determination.
(3) In the stage of evaluating the detection result. The method is used for carrying out sensitivity detection on 1000 pieces of tweets containing pictures and texts, and the detection result is evaluated through the accuracy, the recall rate and the F value.
(4) Specific examples are given.
After the text content contained in one text is 'support of strict attack of' X work 'of a public security department, opposition to X education' and contains related pictures of X work activities, after the text content is published on a social network, the algorithm firstly classifies the text content into sensitive type B sensitive information by using FastText, then according to an emotional word and sensitive word co-occurrence algorithm, an emotional word set is { 'support', 'strict attack', 'opposition' }, a sensitive word set is { 'X work', 'X education' }, the Cartesian product of the two sets is { ('support', 'X work'), ('support', 'X education'), ('strict harsh attack', 'X work'), 'X education'), the ('opposition', 'X'), 'X'), and then screening is carried out according to the co-occurrence algorithm to obtain a set { ('strict attack', 'X work'), (opposition and teaching of X) }, obtaining posivesensitvecount 0 and negotivesensitvecount 3 according to formulas (2), (3) and (4), (assuming that the harsh emotion intensity is 2, the sensitive word is specified as a negative value, and the negative emotion word is also a negative value) and allsensitvecount-3, and determining that the text is of a non-sensitive type according to the condition that the number of the sensitive words is less than a threshold value 5; and then detecting the picture content by using an Inception V3 model, obtaining a detection result as a sensitive B-type sensitive picture, and judging the whole tweet to be sensitive B according to a multi-mode fusion algorithm. According to the reality, it is obvious that the whole tweet is resisting the X work, but since the picture is sensitive content of sensitive class B and is sensitive information which is not allowed to be issued, the whole tweet still belongs to sensitivity, and therefore the whole tweet is still judged to be sensitive information, namely sensitive class B. Otherwise, if the picture is a normal (other class) picture, the text is non-sensitive information.

Claims (3)

1. A multi-mode fusion sensitive information classification detection method is characterized by comprising the following steps: the method comprises the following steps:
step 1, carrying out sensitivity primary detection on texts and pictures
Adopting FastText to detect the sensitivity of the text, judging which sensitive class or non-sensitive class the text belongs to, and obtaining a classification probability set of the text, and recording the classification probability set as:
Figure FDA0003768587840000011
detecting the sensitivity of the pictures by adopting an Inception V3 network, judging which sensitive class or non-sensitive class the pictures belong to, and obtaining a classification probability set of the pictures, wherein the classification probability set is recorded as:
Figure FDA0003768587840000012
wherein n represents the classification number of the pictures or the texts, and the classification number of the pictures and the classification number of the texts are equal; if the text belongs to the sensitive class, executing the step 2, and if the text belongs to the non-sensitive class, executing the step 3;
step 2, judging the text sensitivity based on emotion
Step 2.1, segmenting the text into a plurality of words by adopting jieba word segmentation, matching the words with the existing emotion word library and sensitive word library to obtain an emotion word set and a sensitive word set, and performing Cartesian product operation on the two sets to judge whether emotion and sensitive words coexist, wherein the emotion words have emotion polarity strength including emotion polarity and emotion strength;
step 2.2, judging the sensitivity of the text by combining the emotional polarity intensity of the emotional words and the sensitive words, wherein the calculation method comprises the following steps:
Figure FDA0003768587840000013
Figure FDA0003768587840000014
AllSensitiveCount=PositiveSensitiveCount-NegativeSensitiveCount (4)
wherein PositiveSensitiveCount represents the positive emotion score of the sensitive word, NegativeSensitiveCount represents the negative emotion score of the sensitive word, AllSensiveCount represents the overall emotion score of the sensitive word,
Figure FDA0003768587840000015
the number of times the sensitive word co-occurs with the positive emotion word,
Figure FDA0003768587840000016
the number of the co-occurrence of the sensitive words and the negative emotion words is n, the number of the total words after the jieba word segmentation is n, lambda is the positive emotion intensity of the emotion words, and beta is the negative emotion intensity of the emotion words;
step 2.3, if the total emotion score AllSensitivecount is greater than 0, the text is directly judged as the original sensitive classification, and the probability set formed by the sensitive classification probabilities is still recorded as:
Figure FDA0003768587840000017
when the AllSensitivecount is less than or equal to 0, secondary judgment is needed, the word frequency of the sensitive words is calculated, when the word frequency is greater than a set threshold value, the original sensitive classification is directly judged, and the probability set formed by the sensitive classification probabilities is still recorded as:
Figure FDA0003768587840000021
otherwise, the reverse is carried outAnd judging the text as other sensitive classifications, and recording a probability set formed by the sensitive classification probabilities as: p g =[0,0,...,0,1];
Step 3, multi-modal sensitivity detection of image-text fusion
Obtaining the sensitivity type probability P by the fusion algorithm according to the sensitivity classification probability of the text and the sensitivity classification probability of the picture i Through MAX (P) i ) And taking the maximum sensitivity type probability P, and taking the sensitivity classification corresponding to the maximum sensitivity type probability P as a final sensitivity classification result.
2. The method according to claim 1, wherein the multi-modal fusion sensitive information classification detection method comprises: the final sensitive classification probability P distribution calculation formula is as follows:
Figure FDA0003768587840000022
P=MAX(P i ) (6)
wherein w is a fusion weight and has a value range of [0, 1%],P i Is a sensitive type probability distribution.
3. The method according to claim 1, wherein the multi-modal fusion sensitive information classification detection method comprises: the method for judging whether the sentiment words and the sensitive words are co-occurring comprises the step of taking the minimum distance in the sentence where the sentiment words and the sensitive words are located as the co-occurrence of the sentiment words and the sensitive words according to the principle of the minimum distance.
CN202110203458.7A 2021-02-23 2021-02-23 Multi-mode fusion sensitive information classification detection method Active CN113033610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110203458.7A CN113033610B (en) 2021-02-23 2021-02-23 Multi-mode fusion sensitive information classification detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110203458.7A CN113033610B (en) 2021-02-23 2021-02-23 Multi-mode fusion sensitive information classification detection method

Publications (2)

Publication Number Publication Date
CN113033610A CN113033610A (en) 2021-06-25
CN113033610B true CN113033610B (en) 2022-09-13

Family

ID=76460956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110203458.7A Active CN113033610B (en) 2021-02-23 2021-02-23 Multi-mode fusion sensitive information classification detection method

Country Status (1)

Country Link
CN (1) CN113033610B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627550A (en) * 2021-08-17 2021-11-09 北京计算机技术及应用研究所 Image-text emotion analysis method based on multi-mode fusion
CN115909374A (en) * 2021-09-30 2023-04-04 腾讯科技(深圳)有限公司 Information identification method, device, equipment, storage medium and program product
CN114579964A (en) * 2022-04-29 2022-06-03 成都明途科技有限公司 Information monitoring method and device, electronic equipment and storage medium
CN114782670A (en) * 2022-05-11 2022-07-22 中航信移动科技有限公司 Multi-mode sensitive information identification method, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832663A (en) * 2017-09-30 2018-03-23 天津大学 A kind of multi-modal sentiment analysis method based on quantum theory
CN107908715A (en) * 2017-11-10 2018-04-13 中国民航大学 Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content
CN109934260A (en) * 2019-01-31 2019-06-25 中国科学院信息工程研究所 Image, text and data fusion sensibility classification method and device based on random forest
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN112256878A (en) * 2020-10-29 2021-01-22 沈阳农业大学 Rice knowledge text classification method based on deep convolution

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200293874A1 (en) * 2019-03-12 2020-09-17 Microsoft Technology Licensing, Llc Matching based intent understanding with transfer learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832663A (en) * 2017-09-30 2018-03-23 天津大学 A kind of multi-modal sentiment analysis method based on quantum theory
CN107908715A (en) * 2017-11-10 2018-04-13 中国民航大学 Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content
CN109934260A (en) * 2019-01-31 2019-06-25 中国科学院信息工程研究所 Image, text and data fusion sensibility classification method and device based on random forest
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN112256878A (en) * 2020-10-29 2021-01-22 沈阳农业大学 Rice knowledge text classification method based on deep convolution

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Survey of Popular Image and Text analysis Techniques;Rahul Suresh,and etc;《2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS)》;20200312;第1-8页 *
基于空间变换密集卷积网络的图片敏感文字识别;林金朝等;《计算机系统应用》;20200131;第29卷(第1期);第137-143页 *

Also Published As

Publication number Publication date
CN113033610A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN113033610B (en) Multi-mode fusion sensitive information classification detection method
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
KR102008845B1 (en) Automatic classification method of unstructured data
Sundararajan et al. Multi-rule based ensemble feature selection model for sarcasm type detection in twitter
CN109670039B (en) Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis
CN107092596A (en) Text emotion analysis method based on attention CNNs and CCR
CN113343126B (en) Rumor detection method based on event and propagation structure
CN116578705A (en) Microblog emotion classification method based on pre-training language model and integrated neural network
CN110909529A (en) User emotion analysis and prejudgment system of company image promotion system
CN106599824A (en) GIF cartoon emotion identification method based on emotion pairs
Rauf et al. Using bert for checking the polarity of movie reviews
CN115329085A (en) Social robot classification method and system
Saha et al. The Corporeality of Infotainment on Fans Feedback Towards Sports Comment Employing Convolutional Long-Short Term Neural Network
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN111259651A (en) User emotion analysis method based on multi-model fusion
Shalinda et al. Hate words detection among sri lankan social media text messages
Al-Onazi et al. Modified Seagull Optimization with Deep Learning for Affect Classification in Arabic Tweets
Jalani et al. Performance of Sentiment Classification on Tweets of Clothing Brands
Gao et al. Chinese short text classification method based on word embedding and Long Short-Term Memory Neural Network
Cumalat Puig Sentiment analysis on short Spanish and Catalan texts using contextual word embeddings
Lora et al. Ben-sarc: A corpus for sarcasm detection from bengali social media comments and its baseline evaluation
Lv et al. Stakeholder opinion classification for supporting large-scale transportation project decision making
Agbesi et al. Multichannel 2D-CNN Attention-Based BiLSTM Method for Low-Resource Ewe Sentiment Analysis
CN113535948B (en) LSTM-Attention text classification method introducing essential point information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant