CN111897952A

CN111897952A - Sensitive data discovery method for social media

Info

Publication number: CN111897952A
Application number: CN202010523627.0A
Authority: CN
Inventors: 杨翊; 朱嘉奇; 王宏安
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2020-06-10
Filing date: 2020-06-10
Publication date: 2020-11-06
Anticipated expiration: 2040-06-10
Also published as: CN111897952B

Abstract

The invention provides a sensitive data discovery method facing social media, which belongs to the field of artificial intelligence, realizes a weakly supervised text classification algorithm by using word similarity and word co-occurrence information in a document through a topic model and a word vector model, classifies and filters sensitive information by setting a small number of keywords related to the sensitive information through depending on realization and combining word vectors of large-scale corpus training, and solves the problem of social media sensitive data discovery with high efficiency and low cost.

Description

Sensitive data discovery method for social media

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a social media-oriented sensitive data discovery method.

Background

Social media include news websites, forums, microblogs, WeChat and the like, which are integrated into daily life of people, and people acquire and communicate information through the social media, so that the social media information shows a tendency of explosive growth. In the aspects of public opinion analysis, public security investigation and the like, sensitive information related to tasks can be found from the massive information, the tasks are very challenging, and the problem of finding the sensitive information is difficult to efficiently and accurately solve under the condition of social media massive data by relying on the traditional keyword matching and supervised classification algorithm.

Many sensitive information may be communicated through jargon, terminology to evade surveillance. Based on a traditional keyword filtering method, sensitive information is filtered in a mode of constructing a sensitive information dictionary and matching character strings. For example, application No. CN201911195301.3 discloses a data center query system, in which a top page manually defines a pattern of sensitive data, and then matches the data one by one through a matching formula regular matching and dictionary matching method, so as to find the sensitive data. However, due to the use of polysemy, non-standard social media wording and jargon, when the method is applied to social media data, a large amount of irrelevant information is filtered out, meaningful contents are screened from the information, a large amount of manpower is consumed, and timeliness is poor.

Meanwhile, a text classification model based on a neural network algorithm needs to rely on experts to mark a large amount of sensitive information, and then model training is carried out through the neural network algorithm, so that the classification algorithm is successful in many scenes. For example, application No. CN201911195301.3 discloses a sensitive information discovery method and system based on text recognition, which identify sensitive information by collecting sample data, constructing data features, then constructing a training data set through labeling, and then constructing a classification model through a catboost algorithm. However, the large amount of tagging required by such algorithms is time and labor intensive, and makes it difficult to tag sensitive information in large amounts efficiently due to frequent changes in social media terminology

Disclosure of Invention

Aiming at the problems, the invention provides a sensitive data discovery method facing social media, which realizes a weakly supervised text classification algorithm by using word similarity and word co-occurrence information in a document through a topic model and a word vector model, classifies and filters sensitive information by depending on setting a small number of keywords related to the sensitive information and combining word vectors of large-scale corpus training, and solves the problem of social media sensitive data discovery with high efficiency and low cost.

The invention solves the technical problems through the following technical means:

a social media-oriented sensitive data discovery method comprises the following steps:

extracting all vocabularies of the document to be found to obtain document vocabularies;

calculating the maximum similarity between each document word and a representative word of each type of sensitive information based on the word vector, and taking the maximum similarity as the similarity between the document word and each type of sensitive information, wherein each type of sensitive information forms a sensitive information type, and the representative word is a marked keyword in each type of sensitive information;

inputting the similarity between the document words and the sensitive information categories into a weak supervision text classification model to obtain subject words and corresponding documents;

calculating the similarity between the subject word and the sensitive information category, and if the similarity is higher than a set threshold and the number of the subject words is not less than a certain number, judging that the subject of the subject word is consistent with the sensitive information type and is a sensitive information subject;

and screening out the document with the maximum probability topic as the sensitive information type from the sensitive information topic, and if the maximum probability of the document is greater than a set threshold value, judging that the content of the document belongs to sensitive data.

Further, if the sensitive information has generality and universality, word vectors disclosed in the internet, such as large-scale and high-quality Chinese word vector data sourced by Tencent AI Lab, can be used; if the content has the high domain characteristics, a large amount of text corpora of related domains need to be crawled, the obtained content is segmented, and then training is carried out through a word vector model.

Further, the Word vector model comprises Word2Vec algorithm and Glove algorithm.

Further, calculating cosine similarity between the word vectors of the document words and the representative words, setting a minimum threshold of semantic similarity, and when the cosine similarity is smaller than the minimum threshold, designating the cosine similarity as the minimum threshold; because the sensitive information comprises a plurality of representative words, the maximum similarity value between the document word and each type of sensitive information representative word is taken as the similarity between the document word and the sensitive information type; the maximum similarity is calculated as follows:

in the formula, s_z,iThe ith representative word representing the z-th type of sensitive information, w represents a document word,_z,wrepresenting the calculated similarity of the sensitive information z and the document word w, and sim () representing the cosine similarity of the word vector.

Further, the weak supervision text classification model is preferably a SeedTBTM model, and the model is a topic model based on a user short text topic model Twitter-BTM, aggregates texts by a user who sends the texts, and then generates word pairs, biterm and document vocabularies based on double-layer Dirichlet distribution. Twitter-BTM assumes that each user's aggregate text has an independent distribution of text topics, and each topic has an independent distribution of subject words. The invention combines the document vocabulary and the sensitive information category as prior subject word distribution into the Twitter-BTM model, and can enable the generated subjects to correspond to the sensitive information one by one through the prior knowledge.

Further, when there is no such sensitive information in the document, the a priori knowledge is consistent for each category, thus generating new topics for the document based only on word co-occurrence information in the topic model. The invention provides that the similarity between the subject term and the sensitive information category is judged by the similarity between the slave term vectors, and when the similarity between the subject term and the sensitive information category is greater than a specified threshold value, the subject term is considered to be consistent with the sensitive information category; and when the number of the subject words consistent with the sensitive information category is more than or equal to a certain number, the subject is considered to be consistent with the sensitive information.

Further, the threshold values related to the similarity and the probability and the number of the subject words not less than the threshold values are set according to actual conditions.

Compared with the prior art, the invention has the following positive effects: for various types of sensitive information to be found, sensitive information can be found through a weakly supervised text classification algorithm by only giving a few representative sensitive words of each category, and a good result can be obtained without a large amount of labeled data, and the main innovation points are as follows:

1) the similarity between the document vocabulary and the sensitive vocabulary is calculated through the word vectors, and then the maximum similarity between the document vocabulary and the sensitive vocabulary of each category is used as prior knowledge, so that the similarity between all the document vocabularies and the sensitive categories can be calculated only by a small number of representative sensitive vocabularies;

2) based on the Twitter-BTM topic model, a SeedTBTM model with similarity as prior knowledge is provided, and the model simultaneously utilizes word similarity and word co-occurrence information to correspond sensitive information to topics one by one, so that interested sensitive words are found. In addition, the method can utilize the user information sent by the social media as additional information supplement to improve the accuracy, which cannot be achieved by the conventional keyword matching algorithm and supervised classification algorithm.

Drawings

FIG. 1 is a flow chart of a proposed method of the present invention;

FIG. 2 is a SeedTBTM model probability map of step S05 in the proposed method of the present invention;

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings and specific examples in order to enable those skilled in the art to better understand the invention, but the invention is not limited thereto.

Fig. 1 is a flowchart of a social media-oriented sensitive data discovery method according to an embodiment of the present invention. Referring to fig. 1, the embodiment of the present invention specifically includes the following steps:

s01, constructing a sensitive information dictionary library, defining various types of sensitive information to be found, and adding new sensitive words for each type of sensitive information; taking two types of sensitive information, namely a counterfeit money sale crime (counterfeit money crime for short) and a gun sale crime (gun crime for short) in public security investigation as an example, the following representative words of the sensitive information can be set as follows:

category1 counterfeit money crime: red bull, watermark, fluorescent liquid, gold stamping and silk printing.

Category2 gun crime: bald hawk, cricket, tube, silencer, dovetail.

It can be found through the above-mentioned sensitive information keywords that many sensitive information keywords adopt terms for evading supervision, such as red bull for 100-yuan criminals who buy and sell counterfeit money, bald eagle for us high-pressure air gun, and dovetail for gun body. Similarly, in many social media, similar phenomena exist, and related people adopt terms to refer to sensitive information in order to escape from supervision, so that a large amount of irrelevant information can be matched by a keyword matching method.

S02, training Word vectors, and obtaining Word vectors through classical Word vector models such as Word2Vec algorithm and Glove algorithm based on massive social media text data containing various sensitive information;

in some scenarios, sensitive information is relatively common information in the internet, and word vectors trained based on large-scale data disclosed by the internet can be used for obtaining better results, such as word vectors trained based on mass data disclosed by Tencent AI Lab and word vectors trained based on Wikipedia. However, in most scenarios, sensitive information is in the minority category and is easily confused with common information, so that the content related to the sensitive information needs to be crawled and the word vector needs to be trained together with the content of the common information so as to be used in the subsequent steps. The steps for training the word vector are as follows:

1) acquiring related information from a thesis, a microblog and a news website according to the sensitive information keywords in the S01;

2) merging the crawled text with a corpus such as Wikipedia and the like, and then segmenting words;

3) a word vector is trained.

S03, extracting the vocabulary of the document to be found, extracting all the vocabulary contained in the document, i.e. the document vocabulary, for all the documents to be found, and obtaining the document table W ═ { W ═ W₁,w₂,…,w_V}；

S04, calculating the document word w and the sensitive information category Z ═ Z₁,z₂,…,z_kCalculating the similarity between each document word w and the sensitive information category Z, and representing the word w and each type of sensitive information by the word w

The maximum similarity of the word is used as the similarity of the word and each type of sensitive information category, and the calculation formula of the maximum similarity is as follows:

examples of similarity are shown in table 1 below:

TABLE 1

The hot printing and gold stamping lines are words related to counterfeit money crime, so that the similarity with the categories of sensitive information of counterfeit money crime is high, and the similarity is 1 because the gold stamping lines are sensitive words; bald hawk is gun crime sensitive vocabulary, and the similarity with the gun crime sensitive information category is 0.99; the goods are not specially inclined in counterfeit money crime and gun crime, so the similarity between the two is low.

S05, operating a weak supervision text classification model SeedTBTM, combining the similarity between the document words and the sensitive information with a user short text topic model Twitter-BTM, and integrating the similarity as prior knowledge into the generation process of the topic model to obtain the topic words and the corresponding documents, wherein the topic words correspond to the sensitive information (representative words/keywords), and the topics correspond to the sensitive information categories. SeedTBTM probabilistic graphical model As shown in FIG. 2, the present invention adds the probability of Twitter-BTM to the probabilistic graphical model of SeedTBTM_z,wComing watchThe similarity of the document words calculated in S04 to the sensitive information category is shown. In FIG. 2, the subject interest θ of user u^uIs a multinomial distribution on K subjects obtained by Dirichlet prior distribution Dir (alpha), and each subject t has a subject-word multinomial distribution in a dictionary V

Background topic B has background topic-word multinomial distribution in dictionary V

The Dirichlet prior distribution Dir (beta) is obtained (beta is a Dirichlet prior parameter which is a prior parameter for generating topic-word distribution, if the beta value of a certain topic is better, the probability of sampling the topic is relatively high, all topics in the invention are the same beta),

equal ratio Dir (beta) sum of dirichlet prior distributions_z,wSo as to integrate the similarity of words and sensitive information into the model, the tendency of the user u to select a subject word or a background subject word is distributed by Bernoulli distribution pi^uDenotes,. pi^uThe Beta prior distribution Beta (gamma) is used for obtaining (gamma is a Beta prior parameter which is a prior parameter of whether the user u selects the subject term or the background subject term, and all users are the same gamma in the invention).

SeedTBTM is generated as follows:

1) generating a topic prior distribution: by using

Obtaining the prior distribution of the subject words of the background subject B

Using a pi^uBeta (gamma), the tendency pi of the user u to select the subject word or the background subject word is obtained^u。

2) For each topic z 1, …, K, topic z is a sensitive information category, and K is a positive integer;

(a) benefit toBy using

Obtaining the distribution of subject words of a subject z;

(b) by using

Updating the prior distribution of topic words for topic z_z,wAnd combining the inlet and outlet models.

3) For each user U is 1, …, U is a positive integer;

(a) using theta^uDir (alpha) obtains the user topic distribution z of user u_u,bWhere b is a shorthand for biterm, meaning a word pair;

(b) for each bitterm b 1, …, N_u，N_uEach biterm is formed by combining any two different words in the document and is a positive integer;

(i) using z_u,bDir (beta) sampling to obtain subject z of bitterm b in user u_u,b

(ii) For two words in bitterm, n is 1,2

(A) By y_u,b,nDir (beta) sampling to obtain conversion control variable y_u,b,n，y_u,b,nThe conversion control variable value corresponding to the nth word of the biterm b representing the user u is used for determining whether the word is a background word or a subject word;

(B) when y is_u,b,nEqual to 0, this means that the word is a background word, utilizing

(Multi represents a polynomial distribution) to obtain w_u,b,n(which represents the nth word of user u's bitterm b); when y is_u,b,nEqual to 1, it means that the word is the subject word, utilize

To obtain w_u,b,n。

Because the probability P (z | b) that the biterm takes the topic z cannot be directly calculated, the topic z of each biterm and the conversion control variable y of each word are iteratively sampled by Gibbs sampling, and the conditional probability distribution of z is as follows:

wherein

Representing the number of users u for which biterm belongs to topic z,

the expression w_u,b,1Number of assigned topics z, n_w|zRepresenting the number of all words assigned to the topic z,

is the set of topics z for all the bitterm except user u's bitterm b,_z,wu,b,1representing a topic z and a word w_u,b,1Similarity of (2), w_u,b,1Indicates that the word is the first word of biterm b of user u, y_u,b,1The corresponding transition control variable value y representing the first word of biterm b for user u. The conditional probability distribution of y is as follows:

wherein n (0) and n (1) represent the number of words assigned to the background topic B and the category topic, respectively,

indicating in addition to the switching control variable y_u,b,nSet of all other y, n_(.)|BRepresenting the number of times all words are considered background words, n_w|BRepresenting the number of times the word w is considered as a background word, V representing the number of whole words, and V β representing the number of words V multiplied by the dirichlet priors parameter β. Due to the backgroundTopic B has no seed word and no prior similarity value, so P (y) needs to be calculated_u,b,n1), the subject word distribution probability is normalized.

After the topic distribution of each bitterm of each user u is obtained, the topic distribution of the document can be obtained by summarizing the topic distributions of all the bitterm in the document, and the formula is as follows:

p (z | b) has been calculated from the conditional probability distribution of z, and P (b | d) is estimated from the relative frequency of biterm in document d.

Then, the maximum probability theme of the document is calculated to obtain a theme z to which the document d belongs, and the formula is as follows:

after 50 iterations, representatives of the topic are as follows:

fluorescent, yellow-blue, anti-fake, stamping, printing, yellow goods, price, paper and color.

Size, finished, cm, shipment, price, problem, manufacturer, agency, printing.

As a result, it can be found that when the input document contains a chat record of a counterfeit money criminal suspect, the Topic1 corresponds to the sensitive information Category1 one by one, and then the document belonging to Topic1 is associated with counterfeit money criminals with a higher probability. However, since the input document has no document related to gun crime, Topic2 is not affected by prior knowledge of Category2, and the representative word of Topic2 has no relation to Category 2. Therefore, subsequent S06 is required to further process the results of the SeedTBTM.

And S06, confirming the sensitive information, and judging whether the analyzed subject and the corresponding document are the concerned sensitive information or not by calculating the similarity between the subject word and the sensitive information category. As can be seen from the example of S05, when there is no such sensitive information in the document, the corresponding topic word will be the keyword of other unrelated topics.

The invention judges the similarity between the subject term and the sensitive information category through the term vector similarity, and the judging process is as follows:

1) when the similarity between the subject term and the sensitive information is greater than a specified threshold (0.4), the subject term is considered to be consistent with the sensitive information category;

2) when the number of the subject words consistent with the sensitive information category is more than or equal to n (5), the subject is considered to be consistent with the sensitive information type.

Examples are as follows:

table 2 below shows the similarity between the topic-representative word and the category-counterfeit currency.

TABLE 2

Because the similarity between the subject words in the subject one and the counterfeit money is more than or equal to 0.4, 5 subject words are considered to be in one-to-one correspondence with the sensitive information counterfeit money.

Table 3 below is a topic two representative, similarity to class two guns.

TABLE 3

Because the similarity between the subject words in the subject I and the counterfeit money crime is more than or equal to 0.4, 0 subject words are considered to be not corresponding to the sensitive information gun crime, and documents related to the gun crime do not exist in the documents.

S07, sensitive data screening, wherein based on the confirmed sensitive information subjects, the subject distribution of the articles is calculated for each article, and when the maximum probability of the article is the sensitive information subject and is greater than a specified threshold value 0.2, the article is considered to belong to the sensitive information.

Examples are as follows:

TABLE 4

The first article "ask how much money a bald eagle does not want a tube for? ", gun crime probability 0.46 is maximum and greater than a specified threshold of 0.2, so the article is of gun crime.

The second article, "original factory series ox detritus, finished yellow goods with line", has a counterfeit money crime probability of 0.42 being the maximum and greater than a specified threshold of 0.2, and thus is a counterfeit money crime.

The above embodiments are only intended to illustrate the technical solution of the present invention, but not to limit it, and a person skilled in the art can modify the technical solution of the present invention or substitute it with an equivalent, and the protection scope of the present invention is subject to the claims.

Claims

1. A social media-oriented sensitive data discovery method is characterized by comprising the following steps:

and screening out the document with the maximum probability topic as the sensitive information type from the sensitive information topic, and if the maximum probability of the document is greater than a set threshold, judging that the document content belongs to sensitive data.

2. The method of claim 1, wherein the sensitive information comprises Word vector data published by the internet or Word vector data obtained by Word vector training through a Word vector model, the Word vector model comprising a Word2Vec algorithm or a Glove algorithm.

3. The method of claim 2, wherein the step of word vector training comprises:

crawling texts related to the keywords from the thesis, the microblog and the news websites according to the sensitive information keywords;

merging the crawled text and a public corpus, and then segmenting words, wherein the corpus comprises Wikipedia and Baidu encyclopedia;

and performing word vector training on the text after word segmentation.

4. The method of claim 1, wherein the similarity of the document words and the representative words is obtained by calculating cosine similarity between word vectors of the document words and the representative words.

5. The method of claim 4, wherein the maximum similarity calculation is formulated as:

wherein the content of the first and second substances,_z，wrepresenting the calculated similarity, s, of the sensitive information z to the document word w_z，iThe ith representative word representing the z-th sensitive information, and sim () representing the cosine similarity of the word vector.

6. The method as claimed in claim 1, wherein the weakly supervised text classification model is preferably a SeedTBTM model, which is based on a user short text topic model Twitter-BTM and increases similarity parameters of document words and sensitive information categories_z，w。

7. The method of claim 6, wherein the similarity of the document words to the sensitive information categories is input into a SeedTBTM model, and the processing step comprises:

1) by using

Reuse of pi^uBeta (gamma) obtains the tendency pi of the user u to select the subject word or the background subject word^uWherein

For background topic-word polynomial distribution of background topic B in dictionary, Dir (beta) is Dirichlet prior distribution, beta is Dirichlet prior parameter, pi^uIs Bernoulli distribution, Beta (gamma) is Beta prior distribution, gamma is Beta prior parameter;

2) for each topic z 1.. K, topic z is a sensitive information category and K is a positive integer, using

Get the distribution of topic words of topic z

By using

Updating the prior distribution of the subject words of the subject z;

3) for each user U1^uDir (alpha) obtains the user topic distribution z of user u_u，bWherein b is a word pair;

for each bitermb 1_u，N_uThe number of the words is a positive integer, the biterm is a word pair, and each biterm is formed by combining any two different words in the document; using z_u，bDir (beta) sampling to obtain theme z of bittermb in user u_u，b；

For two words in bitterm, n is 1,2, with y_u，b，nDir (beta) sampling to obtain conversion control variable y_u，b，nWherein, y_u，b，nThe value y of the conversion control variable corresponding to the nth word of the biterm b representing the user u;

if y_u，b，nEqual to 0, this means that the word is a background word, utilizing

To obtain w_u，b，n(ii) a If y_u，b，nEqual to 1, it means that the word is the subject word, utilize

To obtain w_u，b，nWhere Multi represents a polynomial distribution, w_u，b，nThe nth word representing user u's bitterm b.

8. The method of claim 7, wherein the probability P of a topic z is taken by biterm by: iteratively sampling the subject z of each biterm and the transition control variable y of each word by gibbs sampling, the conditional probability distribution P of z being as follows:

wherein the content of the first and second substances,

representing the number of users u for which biterm belongs to topic z,

the expression w_u，b，1Number of assigned topics z, n_w|zRepresenting the number of all words assigned to the topic z,

a set of topics z representing all the biterm except user u's biterm b,_{z，wu，b，1}representing a topic z and a word w_u，b，1Similarity of (2), w_u，b，1The first word, y, representing user u's bitterm b_u，b，1The corresponding transition control variable value y representing the first word of biterm b for user u.

9. The method of claim 8, wherein the conditional probability distribution P of y is as follows:

indicating in addition to the switching control variable y_u，b，nSet of all other conversion control variables y, n_(.)|BRepresenting the number of times all words are considered background words, n_w|BThe number of times the word w is considered as a background word is indicated, and V indicates the number of total words.

10. The method of claim 9, wherein the step of obtaining subject terms and corresponding documents comprises:

1) after the topic distribution of each bitterm of each user u is obtained, the topic distribution of the document is obtained by summarizing the topic distributions of all the bitterm in the document, and the formula is as follows:

wherein P (zb) is calculated from the conditional probability distribution of z, and P (b d) is estimated according to the relative frequency of the bitterm in the document d;

2) calculating the maximum probability theme of the document to obtain a theme z to which the document d belongs, wherein the formula is as follows: