CN111897952A - Sensitive data discovery method for social media - Google Patents

Sensitive data discovery method for social media Download PDF

Info

Publication number
CN111897952A
CN111897952A CN202010523627.0A CN202010523627A CN111897952A CN 111897952 A CN111897952 A CN 111897952A CN 202010523627 A CN202010523627 A CN 202010523627A CN 111897952 A CN111897952 A CN 111897952A
Authority
CN
China
Prior art keywords
word
sensitive information
words
document
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010523627.0A
Other languages
Chinese (zh)
Other versions
CN111897952B (en
Inventor
杨翊
朱嘉奇
王宏安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202010523627.0A priority Critical patent/CN111897952B/en
Publication of CN111897952A publication Critical patent/CN111897952A/en
Application granted granted Critical
Publication of CN111897952B publication Critical patent/CN111897952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a sensitive data discovery method facing social media, which belongs to the field of artificial intelligence, realizes a weakly supervised text classification algorithm by using word similarity and word co-occurrence information in a document through a topic model and a word vector model, classifies and filters sensitive information by setting a small number of keywords related to the sensitive information through depending on realization and combining word vectors of large-scale corpus training, and solves the problem of social media sensitive data discovery with high efficiency and low cost.

Description

Sensitive data discovery method for social media
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a social media-oriented sensitive data discovery method.
Background
Social media include news websites, forums, microblogs, WeChat and the like, which are integrated into daily life of people, and people acquire and communicate information through the social media, so that the social media information shows a tendency of explosive growth. In the aspects of public opinion analysis, public security investigation and the like, sensitive information related to tasks can be found from the massive information, the tasks are very challenging, and the problem of finding the sensitive information is difficult to efficiently and accurately solve under the condition of social media massive data by relying on the traditional keyword matching and supervised classification algorithm.
Many sensitive information may be communicated through jargon, terminology to evade surveillance. Based on a traditional keyword filtering method, sensitive information is filtered in a mode of constructing a sensitive information dictionary and matching character strings. For example, application No. CN201911195301.3 discloses a data center query system, in which a top page manually defines a pattern of sensitive data, and then matches the data one by one through a matching formula regular matching and dictionary matching method, so as to find the sensitive data. However, due to the use of polysemy, non-standard social media wording and jargon, when the method is applied to social media data, a large amount of irrelevant information is filtered out, meaningful contents are screened from the information, a large amount of manpower is consumed, and timeliness is poor.
Meanwhile, a text classification model based on a neural network algorithm needs to rely on experts to mark a large amount of sensitive information, and then model training is carried out through the neural network algorithm, so that the classification algorithm is successful in many scenes. For example, application No. CN201911195301.3 discloses a sensitive information discovery method and system based on text recognition, which identify sensitive information by collecting sample data, constructing data features, then constructing a training data set through labeling, and then constructing a classification model through a catboost algorithm. However, the large amount of tagging required by such algorithms is time and labor intensive, and makes it difficult to tag sensitive information in large amounts efficiently due to frequent changes in social media terminology
Disclosure of Invention
Aiming at the problems, the invention provides a sensitive data discovery method facing social media, which realizes a weakly supervised text classification algorithm by using word similarity and word co-occurrence information in a document through a topic model and a word vector model, classifies and filters sensitive information by depending on setting a small number of keywords related to the sensitive information and combining word vectors of large-scale corpus training, and solves the problem of social media sensitive data discovery with high efficiency and low cost.
The invention solves the technical problems through the following technical means:
a social media-oriented sensitive data discovery method comprises the following steps:
extracting all vocabularies of the document to be found to obtain document vocabularies;
calculating the maximum similarity between each document word and a representative word of each type of sensitive information based on the word vector, and taking the maximum similarity as the similarity between the document word and each type of sensitive information, wherein each type of sensitive information forms a sensitive information type, and the representative word is a marked keyword in each type of sensitive information;
inputting the similarity between the document words and the sensitive information categories into a weak supervision text classification model to obtain subject words and corresponding documents;
calculating the similarity between the subject word and the sensitive information category, and if the similarity is higher than a set threshold and the number of the subject words is not less than a certain number, judging that the subject of the subject word is consistent with the sensitive information type and is a sensitive information subject;
and screening out the document with the maximum probability topic as the sensitive information type from the sensitive information topic, and if the maximum probability of the document is greater than a set threshold value, judging that the content of the document belongs to sensitive data.
Further, if the sensitive information has generality and universality, word vectors disclosed in the internet, such as large-scale and high-quality Chinese word vector data sourced by Tencent AI Lab, can be used; if the content has the high domain characteristics, a large amount of text corpora of related domains need to be crawled, the obtained content is segmented, and then training is carried out through a word vector model.
Further, the Word vector model comprises Word2Vec algorithm and Glove algorithm.
Further, calculating cosine similarity between the word vectors of the document words and the representative words, setting a minimum threshold of semantic similarity, and when the cosine similarity is smaller than the minimum threshold, designating the cosine similarity as the minimum threshold; because the sensitive information comprises a plurality of representative words, the maximum similarity value between the document word and each type of sensitive information representative word is taken as the similarity between the document word and the sensitive information type; the maximum similarity is calculated as follows:
Figure BDA0002532928460000021
in the formula, sz,iThe ith representative word representing the z-th type of sensitive information, w represents a document word,z,wrepresenting the calculated similarity of the sensitive information z and the document word w, and sim () representing the cosine similarity of the word vector.
Further, the weak supervision text classification model is preferably a SeedTBTM model, and the model is a topic model based on a user short text topic model Twitter-BTM, aggregates texts by a user who sends the texts, and then generates word pairs, biterm and document vocabularies based on double-layer Dirichlet distribution. Twitter-BTM assumes that each user's aggregate text has an independent distribution of text topics, and each topic has an independent distribution of subject words. The invention combines the document vocabulary and the sensitive information category as prior subject word distribution into the Twitter-BTM model, and can enable the generated subjects to correspond to the sensitive information one by one through the prior knowledge.
Further, when there is no such sensitive information in the document, the a priori knowledge is consistent for each category, thus generating new topics for the document based only on word co-occurrence information in the topic model. The invention provides that the similarity between the subject term and the sensitive information category is judged by the similarity between the slave term vectors, and when the similarity between the subject term and the sensitive information category is greater than a specified threshold value, the subject term is considered to be consistent with the sensitive information category; and when the number of the subject words consistent with the sensitive information category is more than or equal to a certain number, the subject is considered to be consistent with the sensitive information.
Further, the threshold values related to the similarity and the probability and the number of the subject words not less than the threshold values are set according to actual conditions.
Compared with the prior art, the invention has the following positive effects: for various types of sensitive information to be found, sensitive information can be found through a weakly supervised text classification algorithm by only giving a few representative sensitive words of each category, and a good result can be obtained without a large amount of labeled data, and the main innovation points are as follows:
1) the similarity between the document vocabulary and the sensitive vocabulary is calculated through the word vectors, and then the maximum similarity between the document vocabulary and the sensitive vocabulary of each category is used as prior knowledge, so that the similarity between all the document vocabularies and the sensitive categories can be calculated only by a small number of representative sensitive vocabularies;
2) based on the Twitter-BTM topic model, a SeedTBTM model with similarity as prior knowledge is provided, and the model simultaneously utilizes word similarity and word co-occurrence information to correspond sensitive information to topics one by one, so that interested sensitive words are found. In addition, the method can utilize the user information sent by the social media as additional information supplement to improve the accuracy, which cannot be achieved by the conventional keyword matching algorithm and supervised classification algorithm.
Drawings
FIG. 1 is a flow chart of a proposed method of the present invention;
FIG. 2 is a SeedTBTM model probability map of step S05 in the proposed method of the present invention;
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings and specific examples in order to enable those skilled in the art to better understand the invention, but the invention is not limited thereto.
Fig. 1 is a flowchart of a social media-oriented sensitive data discovery method according to an embodiment of the present invention. Referring to fig. 1, the embodiment of the present invention specifically includes the following steps:
s01, constructing a sensitive information dictionary library, defining various types of sensitive information to be found, and adding new sensitive words for each type of sensitive information; taking two types of sensitive information, namely a counterfeit money sale crime (counterfeit money crime for short) and a gun sale crime (gun crime for short) in public security investigation as an example, the following representative words of the sensitive information can be set as follows:
category1 counterfeit money crime: red bull, watermark, fluorescent liquid, gold stamping and silk printing.
Category2 gun crime: bald hawk, cricket, tube, silencer, dovetail.
It can be found through the above-mentioned sensitive information keywords that many sensitive information keywords adopt terms for evading supervision, such as red bull for 100-yuan criminals who buy and sell counterfeit money, bald eagle for us high-pressure air gun, and dovetail for gun body. Similarly, in many social media, similar phenomena exist, and related people adopt terms to refer to sensitive information in order to escape from supervision, so that a large amount of irrelevant information can be matched by a keyword matching method.
S02, training Word vectors, and obtaining Word vectors through classical Word vector models such as Word2Vec algorithm and Glove algorithm based on massive social media text data containing various sensitive information;
in some scenarios, sensitive information is relatively common information in the internet, and word vectors trained based on large-scale data disclosed by the internet can be used for obtaining better results, such as word vectors trained based on mass data disclosed by Tencent AI Lab and word vectors trained based on Wikipedia. However, in most scenarios, sensitive information is in the minority category and is easily confused with common information, so that the content related to the sensitive information needs to be crawled and the word vector needs to be trained together with the content of the common information so as to be used in the subsequent steps. The steps for training the word vector are as follows:
1) acquiring related information from a thesis, a microblog and a news website according to the sensitive information keywords in the S01;
2) merging the crawled text with a corpus such as Wikipedia and the like, and then segmenting words;
3) a word vector is trained.
S03, extracting the vocabulary of the document to be found, extracting all the vocabulary contained in the document, i.e. the document vocabulary, for all the documents to be found, and obtaining the document table W ═ { W ═ W1,w2,…,wV};
S04, calculating the document word w and the sensitive information category Z ═ Z1,z2,…,zkCalculating the similarity between each document word w and the sensitive information category Z, and representing the word w and each type of sensitive information by the word w
Figure BDA0002532928460000043
The maximum similarity of the word is used as the similarity of the word and each type of sensitive information category, and the calculation formula of the maximum similarity is as follows:
Figure BDA0002532928460000041
examples of similarity are shown in table 1 below:
TABLE 1
Figure BDA0002532928460000042
The hot printing and gold stamping lines are words related to counterfeit money crime, so that the similarity with the categories of sensitive information of counterfeit money crime is high, and the similarity is 1 because the gold stamping lines are sensitive words; bald hawk is gun crime sensitive vocabulary, and the similarity with the gun crime sensitive information category is 0.99; the goods are not specially inclined in counterfeit money crime and gun crime, so the similarity between the two is low.
S05, operating a weak supervision text classification model SeedTBTM, combining the similarity between the document words and the sensitive information with a user short text topic model Twitter-BTM, and integrating the similarity as prior knowledge into the generation process of the topic model to obtain the topic words and the corresponding documents, wherein the topic words correspond to the sensitive information (representative words/keywords), and the topics correspond to the sensitive information categories. SeedTBTM probabilistic graphical model As shown in FIG. 2, the present invention adds the probability of Twitter-BTM to the probabilistic graphical model of SeedTBTMz,wComing watchThe similarity of the document words calculated in S04 to the sensitive information category is shown. In FIG. 2, the subject interest θ of user uuIs a multinomial distribution on K subjects obtained by Dirichlet prior distribution Dir (alpha), and each subject t has a subject-word multinomial distribution in a dictionary V
Figure BDA0002532928460000051
Background topic B has background topic-word multinomial distribution in dictionary V
Figure BDA0002532928460000052
The Dirichlet prior distribution Dir (beta) is obtained (beta is a Dirichlet prior parameter which is a prior parameter for generating topic-word distribution, if the beta value of a certain topic is better, the probability of sampling the topic is relatively high, all topics in the invention are the same beta),
Figure BDA0002532928460000053
equal ratio Dir (beta) sum of dirichlet prior distributionsz,wSo as to integrate the similarity of words and sensitive information into the model, the tendency of the user u to select a subject word or a background subject word is distributed by Bernoulli distribution piuDenotes,. piuThe Beta prior distribution Beta (gamma) is used for obtaining (gamma is a Beta prior parameter which is a prior parameter of whether the user u selects the subject term or the background subject term, and all users are the same gamma in the invention).
SeedTBTM is generated as follows:
1) generating a topic prior distribution: by using
Figure BDA0002532928460000054
Obtaining the prior distribution of the subject words of the background subject B
Figure BDA0002532928460000055
Using a piuBeta (gamma), the tendency pi of the user u to select the subject word or the background subject word is obtainedu
2) For each topic z 1, …, K, topic z is a sensitive information category, and K is a positive integer;
(a) benefit toBy using
Figure BDA0002532928460000056
Obtaining the distribution of subject words of a subject z;
(b) by using
Figure BDA0002532928460000057
Updating the prior distribution of topic words for topic zz,wAnd combining the inlet and outlet models.
3) For each user U is 1, …, U is a positive integer;
(a) using thetauDir (alpha) obtains the user topic distribution z of user uu,bWhere b is a shorthand for biterm, meaning a word pair;
(b) for each bitterm b 1, …, Nu,NuEach biterm is formed by combining any two different words in the document and is a positive integer;
(i) using zu,bDir (beta) sampling to obtain subject z of bitterm b in user uu,b
(ii) For two words in bitterm, n is 1,2
(A) By yu,b,nDir (beta) sampling to obtain conversion control variable yu,b,n,yu,b,nThe conversion control variable value corresponding to the nth word of the biterm b representing the user u is used for determining whether the word is a background word or a subject word;
(B) when y isu,b,nEqual to 0, this means that the word is a background word, utilizing
Figure BDA0002532928460000058
(Multi represents a polynomial distribution) to obtain wu,b,n(which represents the nth word of user u's bitterm b); when y isu,b,nEqual to 1, it means that the word is the subject word, utilize
Figure BDA0002532928460000061
To obtain wu,b,n
Because the probability P (z | b) that the biterm takes the topic z cannot be directly calculated, the topic z of each biterm and the conversion control variable y of each word are iteratively sampled by Gibbs sampling, and the conditional probability distribution of z is as follows:
Figure BDA0002532928460000062
wherein
Figure BDA0002532928460000063
Representing the number of users u for which biterm belongs to topic z,
Figure BDA0002532928460000064
the expression wu,b,1Number of assigned topics z, nw|zRepresenting the number of all words assigned to the topic z,
Figure BDA0002532928460000069
is the set of topics z for all the bitterm except user u's bitterm b,z,wu,b,1representing a topic z and a word wu,b,1Similarity of (2), wu,b,1Indicates that the word is the first word of biterm b of user u, yu,b,1The corresponding transition control variable value y representing the first word of biterm b for user u. The conditional probability distribution of y is as follows:
Figure BDA0002532928460000065
Figure BDA0002532928460000066
wherein n (0) and n (1) represent the number of words assigned to the background topic B and the category topic, respectively,
Figure BDA00025329284600000610
indicating in addition to the switching control variable yu,b,nSet of all other y, n(.)|BRepresenting the number of times all words are considered background words, nw|BRepresenting the number of times the word w is considered as a background word, V representing the number of whole words, and V β representing the number of words V multiplied by the dirichlet priors parameter β. Due to the backgroundTopic B has no seed word and no prior similarity value, so P (y) needs to be calculatedu,b,n1), the subject word distribution probability is normalized.
After the topic distribution of each bitterm of each user u is obtained, the topic distribution of the document can be obtained by summarizing the topic distributions of all the bitterm in the document, and the formula is as follows:
Figure BDA0002532928460000067
p (z | b) has been calculated from the conditional probability distribution of z, and P (b | d) is estimated from the relative frequency of biterm in document d.
Then, the maximum probability theme of the document is calculated to obtain a theme z to which the document d belongs, and the formula is as follows:
Figure BDA0002532928460000068
after 50 iterations, representatives of the topic are as follows:
fluorescent, yellow-blue, anti-fake, stamping, printing, yellow goods, price, paper and color.
Size, finished, cm, shipment, price, problem, manufacturer, agency, printing.
As a result, it can be found that when the input document contains a chat record of a counterfeit money criminal suspect, the Topic1 corresponds to the sensitive information Category1 one by one, and then the document belonging to Topic1 is associated with counterfeit money criminals with a higher probability. However, since the input document has no document related to gun crime, Topic2 is not affected by prior knowledge of Category2, and the representative word of Topic2 has no relation to Category 2. Therefore, subsequent S06 is required to further process the results of the SeedTBTM.
And S06, confirming the sensitive information, and judging whether the analyzed subject and the corresponding document are the concerned sensitive information or not by calculating the similarity between the subject word and the sensitive information category. As can be seen from the example of S05, when there is no such sensitive information in the document, the corresponding topic word will be the keyword of other unrelated topics.
The invention judges the similarity between the subject term and the sensitive information category through the term vector similarity, and the judging process is as follows:
1) when the similarity between the subject term and the sensitive information is greater than a specified threshold (0.4), the subject term is considered to be consistent with the sensitive information category;
2) when the number of the subject words consistent with the sensitive information category is more than or equal to n (5), the subject is considered to be consistent with the sensitive information type.
Examples are as follows:
table 2 below shows the similarity between the topic-representative word and the category-counterfeit currency.
TABLE 2
Figure BDA0002532928460000071
Because the similarity between the subject words in the subject one and the counterfeit money is more than or equal to 0.4, 5 subject words are considered to be in one-to-one correspondence with the sensitive information counterfeit money.
Table 3 below is a topic two representative, similarity to class two guns.
TABLE 3
Figure BDA0002532928460000072
Because the similarity between the subject words in the subject I and the counterfeit money crime is more than or equal to 0.4, 0 subject words are considered to be not corresponding to the sensitive information gun crime, and documents related to the gun crime do not exist in the documents.
S07, sensitive data screening, wherein based on the confirmed sensitive information subjects, the subject distribution of the articles is calculated for each article, and when the maximum probability of the article is the sensitive information subject and is greater than a specified threshold value 0.2, the article is considered to belong to the sensitive information.
Examples are as follows:
TABLE 4
Figure BDA0002532928460000081
The first article "ask how much money a bald eagle does not want a tube for? ", gun crime probability 0.46 is maximum and greater than a specified threshold of 0.2, so the article is of gun crime.
The second article, "original factory series ox detritus, finished yellow goods with line", has a counterfeit money crime probability of 0.42 being the maximum and greater than a specified threshold of 0.2, and thus is a counterfeit money crime.
The above embodiments are only intended to illustrate the technical solution of the present invention, but not to limit it, and a person skilled in the art can modify the technical solution of the present invention or substitute it with an equivalent, and the protection scope of the present invention is subject to the claims.

Claims (10)

1. A social media-oriented sensitive data discovery method is characterized by comprising the following steps:
extracting all vocabularies of the document to be found to obtain document vocabularies;
calculating the maximum similarity between each document word and a representative word of each type of sensitive information based on the word vector, and taking the maximum similarity as the similarity between the document word and each type of sensitive information, wherein each type of sensitive information forms a sensitive information type, and the representative word is a marked keyword in each type of sensitive information;
inputting the similarity between the document words and the sensitive information categories into a weak supervision text classification model to obtain subject words and corresponding documents;
calculating the similarity between the subject word and the sensitive information category, and if the similarity is higher than a set threshold and the number of the subject words is not less than a certain number, judging that the subject of the subject word is consistent with the sensitive information type and is a sensitive information subject;
and screening out the document with the maximum probability topic as the sensitive information type from the sensitive information topic, and if the maximum probability of the document is greater than a set threshold, judging that the document content belongs to sensitive data.
2. The method of claim 1, wherein the sensitive information comprises Word vector data published by the internet or Word vector data obtained by Word vector training through a Word vector model, the Word vector model comprising a Word2Vec algorithm or a Glove algorithm.
3. The method of claim 2, wherein the step of word vector training comprises:
crawling texts related to the keywords from the thesis, the microblog and the news websites according to the sensitive information keywords;
merging the crawled text and a public corpus, and then segmenting words, wherein the corpus comprises Wikipedia and Baidu encyclopedia;
and performing word vector training on the text after word segmentation.
4. The method of claim 1, wherein the similarity of the document words and the representative words is obtained by calculating cosine similarity between word vectors of the document words and the representative words.
5. The method of claim 4, wherein the maximum similarity calculation is formulated as:
Figure FDA0002532928450000011
wherein the content of the first and second substances,z,wrepresenting the calculated similarity, s, of the sensitive information z to the document word wz,iThe ith representative word representing the z-th sensitive information, and sim () representing the cosine similarity of the word vector.
6. The method as claimed in claim 1, wherein the weakly supervised text classification model is preferably a SeedTBTM model, which is based on a user short text topic model Twitter-BTM and increases similarity parameters of document words and sensitive information categoriesz,w
7. The method of claim 6, wherein the similarity of the document words to the sensitive information categories is input into a SeedTBTM model, and the processing step comprises:
1) by using
Figure FDA0002532928450000012
Obtaining the prior distribution of the subject words of the background subject B
Figure FDA0002532928450000013
Reuse of piuBeta (gamma) obtains the tendency pi of the user u to select the subject word or the background subject worduWherein
Figure FDA0002532928450000021
For background topic-word polynomial distribution of background topic B in dictionary, Dir (beta) is Dirichlet prior distribution, beta is Dirichlet prior parameter, piuIs Bernoulli distribution, Beta (gamma) is Beta prior distribution, gamma is Beta prior parameter;
2) for each topic z 1.. K, topic z is a sensitive information category and K is a positive integer, using
Figure FDA0002532928450000022
Get the distribution of topic words of topic z
Figure FDA0002532928450000023
By using
Figure FDA0002532928450000024
Updating the prior distribution of the subject words of the subject z;
3) for each user U1uDir (alpha) obtains the user topic distribution z of user uu,bWherein b is a word pair;
for each bitermb 1u,NuThe number of the words is a positive integer, the biterm is a word pair, and each biterm is formed by combining any two different words in the document; using zu,bDir (beta) sampling to obtain theme z of bittermb in user uu,b
For two words in bitterm, n is 1,2, with yu,b,nDir (beta) sampling to obtain conversion control variable yu,b,nWherein, yu,b,nThe value y of the conversion control variable corresponding to the nth word of the biterm b representing the user u;
if yu,b,nEqual to 0, this means that the word is a background word, utilizing
Figure FDA0002532928450000025
To obtain wu,b,n(ii) a If yu,b,nEqual to 1, it means that the word is the subject word, utilize
Figure FDA0002532928450000026
To obtain wu,b,nWhere Multi represents a polynomial distribution, wu,b,nThe nth word representing user u's bitterm b.
8. The method of claim 7, wherein the probability P of a topic z is taken by biterm by: iteratively sampling the subject z of each biterm and the transition control variable y of each word by gibbs sampling, the conditional probability distribution P of z being as follows:
Figure FDA0002532928450000027
wherein the content of the first and second substances,
Figure FDA0002532928450000028
representing the number of users u for which biterm belongs to topic z,
Figure FDA0002532928450000029
the expression wu,b,1Number of assigned topics z, nw|zRepresenting the number of all words assigned to the topic z,
Figure FDA00025329284500000210
a set of topics z representing all the biterm except user u's biterm b,z,wu,b,1representing a topic z and a word wu,b,1Similarity of (2), wu,b,1The first word, y, representing user u's bitterm bu,b,1The corresponding transition control variable value y representing the first word of biterm b for user u.
9. The method of claim 8, wherein the conditional probability distribution P of y is as follows:
Figure FDA00025329284500000211
Figure FDA0002532928450000031
wherein n (0) and n (1) represent the number of words assigned to the background topic B and the category topic, respectively,
Figure FDA0002532928450000034
indicating in addition to the switching control variable yu,b,nSet of all other conversion control variables y, n(.)|BRepresenting the number of times all words are considered background words, nw|BThe number of times the word w is considered as a background word is indicated, and V indicates the number of total words.
10. The method of claim 9, wherein the step of obtaining subject terms and corresponding documents comprises:
1) after the topic distribution of each bitterm of each user u is obtained, the topic distribution of the document is obtained by summarizing the topic distributions of all the bitterm in the document, and the formula is as follows:
Figure FDA0002532928450000032
wherein P (zb) is calculated from the conditional probability distribution of z, and P (b d) is estimated according to the relative frequency of the bitterm in the document d;
2) calculating the maximum probability theme of the document to obtain a theme z to which the document d belongs, wherein the formula is as follows:
Figure FDA0002532928450000033
CN202010523627.0A 2020-06-10 2020-06-10 Sensitive data discovery method for social media Active CN111897952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010523627.0A CN111897952B (en) 2020-06-10 2020-06-10 Sensitive data discovery method for social media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010523627.0A CN111897952B (en) 2020-06-10 2020-06-10 Sensitive data discovery method for social media

Publications (2)

Publication Number Publication Date
CN111897952A true CN111897952A (en) 2020-11-06
CN111897952B CN111897952B (en) 2022-10-14

Family

ID=73206697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010523627.0A Active CN111897952B (en) 2020-06-10 2020-06-10 Sensitive data discovery method for social media

Country Status (1)

Country Link
CN (1) CN111897952B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280164A (en) * 2018-01-18 2018-07-13 武汉大学 A kind of short text filtering and sorting technique based on classification related words
CN108710611A (en) * 2018-05-17 2018-10-26 南京大学 A kind of short text topic model generation method of word-based network and term vector
CN109086375A (en) * 2018-07-24 2018-12-25 武汉大学 A kind of short text subject extraction method based on term vector enhancing
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280164A (en) * 2018-01-18 2018-07-13 武汉大学 A kind of short text filtering and sorting technique based on classification related words
CN108710611A (en) * 2018-05-17 2018-10-26 南京大学 A kind of short text topic model generation method of word-based network and term vector
CN109086375A (en) * 2018-07-24 2018-12-25 武汉大学 A kind of short text subject extraction method based on term vector enhancing
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XINGYUAN CHEN等: "Dataless Text Classification with Descriptive LDA", 《PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *
韩栋等: "结合半监督学习和LDA模型的文本分类方法", 《计算机工程与设计》 *

Also Published As

Publication number Publication date
CN111897952B (en) 2022-10-14

Similar Documents

Publication Publication Date Title
Geetha et al. Improving the performance of aspect based sentiment analysis using fine-tuned Bert Base Uncased model
Vijayan et al. A comprehensive study of text classification algorithms
Mallick et al. Digital media news categorization using Bernoulli document model for web content convergence
Moxley et al. Video annotation through search and graph reinforcement mining
CN108596637B (en) Automatic E-commerce service problem discovery system
Briliani et al. Hate speech detection in indonesian language on instagram comment section using K-nearest neighbor classification method
Patel et al. A review: Text classification on social media data
Halevy et al. Discovering structure in the universe of attribute names
Karim et al. An unsupervised approach for content-based clustering of emails into spam and ham through multiangular feature formulation
Khan et al. Lifelong aspect extraction from big data: knowledge engineering
Petkos et al. Graph-based multimodal clustering for social multimedia
Krenn et al. Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network
CN113111178B (en) Method and device for disambiguating homonymous authors based on expression learning without supervision
Trupthi et al. Possibilistic fuzzy C-means topic modelling for twitter sentiment analysis
Gündoğan et al. Deep learning for journal recommendation system of research papers
Sheeba et al. A fuzzy logic based on sentiment classification
Gowda et al. Sentiment analysis of Twitter data using Naive Bayes classifier
Argueta et al. Unsupervised graph-based patterns extraction for emotion classification
CN111897952B (en) Sensitive data discovery method for social media
Sharaff et al. Deep learning‐based smishing message identification using regular expression feature generation
Papadopoulos et al. Knowledge-assisted image analysis based on context and spatial optimization
Cao et al. Intention classification in multiturn dialogue systems with key sentences mining
Chandana et al. BCC NEWS classification comparison between naive bayes, support vector machine, recurrent neural network
Schmitt et al. Outlier detection on semantic space for sentiment analysis with convolutional neural networks
Jain et al. Review on analysis of classifiers for fake news detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant