CN115718792A

CN115718792A - Sensitive information extraction method based on natural semantic processing and deep learning

Info

Publication number: CN115718792A
Application number: CN202211270245.7A
Authority: CN
Inventors: 程兴防; 陈剑飞; 王云霄; 徐明伟; 赵丽娜
Original assignee: Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Current assignee: Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2023-02-28

Abstract

The invention discloses a sensitive information extraction method based on natural semantic processing and deep learning technology, which can be used for extracting sensitive information in large-scale unstructured data. Firstly, downloading a plurality of text files from the Internet, and analyzing rich text files with different formats by using an open source tool; then, extracting sensitive information with a predictable mode by using a regular matching algorithm for the analyzed text file; secondly, embedding words and generating word vectors by using a natural semantic processing model-BERT, and constructing a text sequence classification model by combining a deep learning algorithm-a bidirectional long-short term memory network (ABi-LSTM) with an attention mechanism; and finally, performing sensitive information extraction on the text sequence by using the trained ABi-LSTM model. In the process, a BIO labeling strategy is adopted for performing labeling operation on the text sequence. The invention aims to provide a sensitive information extraction method for unstructured data, and the accuracy of sensitive information extraction is ensured.

Description

Sensitive information extraction method based on natural semantic processing and deep learning

Technical Field

The invention relates to the technical field of information security, in particular to a network sensitive information extraction method based on a natural semantic processing method such as regular matching and BERT and an improved long-short term memory network, which considers an intelligent extraction method of sensitive information in a network and a processing process of a text file.

Background

Information leakage is always an important problem in the field of information security, and once sensitive information is leaked, serious consequences can be caused. Most sensitive information is stored in unstructured data, and how to extract sensitive information from a large amount of unstructured data has become one of the most important challenges. Research directed to the extraction of sensitive information in networks has been generated in this context.

The purpose of sensitive information extraction is to extract sensitive information from a text file through a natural semantic processing method, protect the sensitive information through judging the existence of the sensitive information, most of the text files existing in a network are stored in unstructured data, and the problems that the sensitive information in the extracted files is not extracted completely, the extraction accuracy is low and the like are identified by only using a method based on regular matching. Therefore, it is necessary to process unstructured data, and further, an existing intelligent method is adopted to extract sensitive information, so that high integrity and accuracy of sensitive information extraction are achieved.

The format of the unstructured data file is various, and the files with different formats need to be analyzed by specific tools or software. At present, research on sensitive information extraction can be mainly divided into a method based on regular matching and a method based on a machine learning algorithm. The method focuses more on the definition of the sensitive information mode, information extraction is carried out through the definition extraction template, the accuracy rate of the method depends strongly on the definition extraction template, and the main effect is large. The information extraction method based on machine learning is mainly based on statistical models, such as hidden Markov models, maximum entropy models, support vector machines and the like, but the methods are easy to suffer from the problem of data sparseness caused by the problems of a prediction library, in practice, non-sensitive information is often mistaken for sensitive information to be extracted, the false alarm rate is high, and the popular deep learning method in recent years can input text contents in a sequence mode for model training and can learn the word vector relationship of the text sequence contents. In order to convert text content into Word vectors to realize intelligent extraction, the existing Word2Vector belongs to a static model and cannot process ambiguous words in a text file, so that a dynamic model is necessary to process the text content.

Meanwhile, the long short term memory network (LSTM) is used as an efficient deep learning algorithm, the relation among word vectors in a text sequence content learning sequence can be input at one time, the accuracy of sensitive information identification is increased undoubtedly, and the invention forms a bidirectional long short term memory network (ABi-LSTM) with attention mechanism by improving the LSTM so as to realize accurate identification of sensitive information. Aiming at the text with a predictable model, the sensitive information is extracted by adopting a traditional regular matching method, and in order to improve the sensitive information identification effect, the method adopts a BERT model to vectorize the text content to realize vector conversion of different meanings of the same word in different sentences, further generates a text sequence and trains ABi-LSTM to realize secondary extraction of the sensitive information.

Disclosure of Invention

The invention aims to provide a sensitive information extraction method based on natural semantic processing and deep learning, which can be applied to sensitive information extraction in the fields of information security, information retrieval and the like.

A sensitive information extraction method based on natural semantic processing and deep learning comprises the following steps:

step S1: collecting a text file, namely dividing the text file into a plain text file set P and a rich text file set R according to formats, wherein the format of the rich text file comprises HTML, XML, pdf, doc, pst and rtf;

step S2: analyzing the rich text file, namely analyzing the rich text file with different formats by using open source tools HTMLParser, pugixml, PDFLib, python-docx, libpst and win32 com;

and step S3: extracting sensitive information of the predictable model, namely extracting the sensitive information of the text with the predictable model by using a regular matching method, such as IP address, MAC address, mailbox, AIP keyword, certificate request and private key content;

and step S4: generating a text sequence, and performing text cleaning, text segmentation and text replacement on the analyzed text file;

step S5: embedding word vectors, and performing vector transformation on words in a text sequence by using a dynamic word embedding algorithm BERT;

step S6: training, verifying and testing data set dividing, namely dividing word vectors into a training data set, a verifying data set and a testing data set in proportion;

step S7: model training, inputting the vector data set of the test words into a bidirectional long and short term memory network (Bi-LSTM), and adding an attention mechanism to the training model to form a bidirectional attention long and short term memory network model (ABi-LSTM);

step S8: and (4) testing the effectiveness of the model, namely testing the ABi-LSTM model by using a test set.

In the above technical solution, in step S2, the specific steps of parsing the rich text file are as follows:

step S201: aiming at an HTML file, a Parser is created by using a Parser class in an HTMLparser, a Filter or a Visator Visitor rule is created, a text node meeting the conditions is obtained by using the Parser according to the Filter or the Visator, and the text node is analyzed;

step S202: aiming at the XML file, converting the original XML file into an object model set according to the label in the file, storing the converted data structure by using a DOM tree, and randomly accessing the stored data through a DOM interface to realize the analysis of the text file;

step S203: analyzing the tail of the file to obtain a cross application table and a root object number aiming at the pdf file, and analyzing the file layer by using a PDFLib library according to the cross reference table and the root object number;

step S204: aiming at doc and docx documents, acquiring a document object to be analyzed, outputting each segment of content in the document, and outputting a segment number and segment content to finish analysis;

step S205: aiming at the pst file, the file is directly analyzed by using a libpst, and the mail text, the attachment and the like are extracted for analysis; for the rtf file, the file text is extracted and analyzed by using win32 com.

In the above technical solution, the sensitive information with the predictable model features is extracted in step S3, and the extraction of the sensitive information in the predictable model text may specifically be represented as:

step S301: defining sensitive information with predictable model characteristics, such as IP addresses, email addresses, API keywords, private keys and certificate text;

step S302: extracting the sensitive information with the predictable model characteristics defined in the step S301 by using a sub () function in a re module in the Python;

step S303: and saving the sensitive information.

In the above technical solution, in step S4, the text sequence generated for the parsed text may be specifically expressed as:

step S401: cleaning the text, removing non-ASCII characters and space characters at the beginning and the end of each line in the text, and converting capital characters into corresponding lowercase characters;

step S402: segmenting a text, namely segmenting the text into a plurality of lines, taking each line of text as a sentence, and performing word segmentation by using WordPiece;

step S403: text replacement, namely performing format replacement on the text content URL and the email, wherein the format after the replacement is as follows: email username domain and http domain letters.

In the above technical solution, in step S5, vector conversion is performed on words in the text sequence, which may specifically be represented as:

step S501: defining the text sequence generated in step S4 as X = { X ₁ ,x ₂ ,x ₃ ,...,x _n }，x _n Is the nth word in the text sequence;

step S502: make it possible toCalculating a word vector sequence E = { E) corresponding to the text sequence X by using a BERT algorithm ₁ ,e ₂ ,e ₃ ,...,e _n In which e _n For the nth word x _n The corresponding word vector.

In the above technical solution, in step S6, the word vector sequence generated in step S5 is divided into a training data set, and a verification data set and a test data set, which may be specifically expressed as:

step S601: performing labeling operation on the text sequence, wherein the labeling operation adopts a BIO strategy;

step S602: dividing the word vector sequence set into a training data set, a verification data set and a test data set according to a certain proportion, wherein the data volume proportion of the training data set, the verification data set and the test data set is 7:1:2.

in the above technical solution, in step S7, the training data set generated in step S6 is used to verify that the data set trains the model and adjusts the ABi-LSTM model, which may be specifically expressed as:

step S701: the method for updating the gate structure of the long-short term memory network (LSTM) comprises the following specific steps:

(1) forgetting gate to update LSTM:

f _t ＝σ(W _f e _t +U _f h _t-1 +b _f ) (1)

(2) forgetting gate to update LSTM:

i _t ＝σ(W _i e _t +U _i h _t-1 +b _i ) (2)

(3) update input modulation gate of LSTM:

m _t ＝tanh(W _m e _t +U _m h _t-1 +b _m ) (3)

(4) update output gate of LSTM:

o _t ＝σ(W _o e _t +U _o h _t-1 +b _o ) (4)

(5) generate the next implicit vector state:

h _t ＝o _t tanhσ(C _t ) (5)

wherein e _t To input a word vector, h _t-1 The implicit state of LSTM at the previous time, σ and tan h are "Sigmoid"A function and a hyperbolic tangent function;

step S702: capturing past and future states using feed-forward and feed-back LSTM, training a Bi-directional LSTM (Bi-LSTM) model, generating acquisition feed-forward hidden layer states at time t

Step S703: adding an attention layer on the uppermost layer of the Bi-LSTM to form a bidirectional attention LSTM (ABi-LSTM), and specifically comprising the following steps:

m _t,t' ＝tanh(W _m h _t +W _m' h _m' +b _m ) (6)

a _t,t' ＝σ(W _a m _t,t' +b _a ) (7)

wherein a is _t,t' Is an element in the attention matrix and can be used to capture the hidden state h _t And h' _t Similarity between, W _m And W' _m For implying a layer state h _t And h' _t Corresponding weight matrix, W _a For corresponding non-linear combining weights, b _m And b _a Is a bias vector.

Step S704: generation of hidden layer State Classification tags l Using an attention mechanism, according to the following formula _t And completing ABi-LSTM model training.

Step S705: using the validation dataset, the model is adjusted to the optimum according to the F1 value.

In the above technical solution, in step S8, the test data set is input to the ABi-LSTM classification model to obtain a classification label, and the detection accuracy, precision, and recall ratio are calculated by comparing the classification label with an actual label, specifically including the following steps:

step S801: inputting test data text sequence sample X into the trained ABi-LSTM classifier _t ＝{x _1t ,x _2t ,...,x _nt }, calculate sample X _t A corresponding tag value;

step S802: inputting all test text sequence samples into ABi-LSTM, and counting the values of TP (True Positive), FP (False Positive), FN (False Negative) and TN (True Negative);

step S803: the detection Accuracy (Accuracy), precision (Precision), recall (Recall) were calculated according to the following formula:

Precision＝TP/(TP+FP)*100％

Recall＝TP/(TP+FN)*100％

Accuracy＝(TP+TN)/(TP+FN+TN+FP)

compared with the prior art, the invention has the following beneficial effects:

the method is used for analyzing a sensitive information extraction method, different tools are used for analyzing the content of a rich text file, text cleaning, text segmentation and text replacement are carried out on text data, a BERT model is used for word vectorization on the text to generate a data sample set, the data sample set is further divided into a training data set, a verification data set and a test data set, and the three data sets are applied to the provided improved long-short term memory network model to achieve model training, adjustment and verification of sensitive information extraction. The experimental result verifies the effectiveness of the method.

The invention adopts a bidirectional long-short term memory network with attention mechanism as a detector, the detection model has high detection accuracy and strong generalization capability, and the invention can be applied to the extraction process of large-scale sensitive information.

Drawings

In order to clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the prior art will be briefly introduced by using the accompanying drawings, which are only used for illustration and are not to be construed as limiting the patent;

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the classification model of the present invention combining BERT and ABi-LSTM;

table 1 is the "BIO" tagging strategy involved in the present invention;

FIG. 3 is a schematic diagram illustrating validity verification of the method of the present invention;

Detailed Description

To better illustrate the objects, technical solutions and advantages of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a sensitive information extraction method based on natural semantic processing and deep learning, which is based on regular matching and BERT and improves a long-short term memory network to extract sensitive information in a network, wherein the flow schematic diagram is shown in figure 1 and comprises the following steps:

step S1: collecting a text file, and dividing the text file into a plain text file set P and a rich text file set R according to formats, wherein the format of the rich text file comprises HTML, XML, pdf, doc, pst and rtf;

step S2: analyzing the rich text file by using rich text files with different formats such as open source tools HTMLParser, pugixml, PDFLib, python-docx, libpst and win32com, and specifically comprising the following steps:

step S204: aiming at doc and docx documents, acquiring a document object to be analyzed, outputting each section of content in the documents, and outputting a section number and the section content to finish analysis;

step S205: aiming at the pst file, the file is directly analyzed by using a libpst, and the mail text, the attachment and the like are extracted for analysis; for the rtf file, the text of the file is extracted and analyzed by the win32 com.

And step S3: extracting sensitive information of a predictable model, namely extracting sensitive information of a text with the predictable model by using a regular matching method, such as an IP address, an MAC address, a mailbox, AIP keywords, a certificate request and private key content, and specifically comprising the following steps:

step S302: extracting the sensitive information with predictable model characteristics defined in the step S301 by using a sub () function in a re module in the Python; step S303: and storing the sensitive information.

And step S4: generating a text sequence, and performing text cleaning, text segmentation and text replacement on the analyzed text file, wherein the specific steps are as follows:

step S403: text replacement, namely performing format replacement on the text content URL and the email, wherein the format after the replacement is as follows: email usernamemomain and http domain letters.

Step S5: embedding word vectors, and performing vector transformation on words in a text sequence by using a dynamic word embedding algorithm BERT, wherein the method comprises the following specific steps:

step S502: calculating a word vector sequence E = { E) corresponding to the text sequence X by using a BERT algorithm ₁ ,e ₂ ,e ₃ ,...,e _n In which e _n For the nth word x _n The corresponding word vector.

Step S6: training, verifying and testing data set dividing, wherein word vectors are divided into a training data set, a verifying data set and a testing data set according to proportion, and the method comprises the following specific steps:

step S7: model training, inputting a test word vector data set into a bidirectional long-short term memory network (Bi-LSTM), adding an attention mechanism to a training model to form a bidirectional attention long-short term memory network model (ABi-LSTM), and specifically comprising the following steps:

(1) forgetting gate to update LSTM:

f _t ＝σ(W _f e _t +U _f h _t-1 +b _f ) (1)

(2) forgetting gate to update LSTM:

i _t ＝σ(W _i e _t +U _i h _t-1 +b _i ) (2)

(3) update input modulation gate of LSTM:

m _t ＝tanh(W _m e _t +U _m h _t-1 +b _m ) (3)

(4) update output gate of LSTM:

o _t ＝σ(W _o e _t +U _o h _t-1 +b _o ) (4)

(5) generating the next implicit vector state:

h _t ＝o _t tanhσ(C _t ) (5)

wherein e _t For input of word vectors, h _t-1 Sigma and tanh are Sigmoid functions and hyperbolic tangent functions, which are implicit states of the LSTM at the previous moment;

step S702: capturing past and future states using feedforward and feedback LSTM, training bidirectional LSTM (Bi-LSTM) model, generating acquisition feedforward and feedback hidden layer states at time t

Step S703: adding an attention layer on the uppermost layer of the Bi-LSTM to form a bidirectional attention LSTM (ABi-LSTM), which is as follows:

m _t,t' ＝tanh(W _m h _t +W _m' h _m' +b _m ) (6)

a _t,t' ＝σ(W _a m _t,t' +b _a ) (7)

wherein a is _t,t' Is an element of the attention matrix that can be used to capture the hidden state h _t And h' _t Similarity between, W _m And W' _m For implying a layer state h _t And h' _t Corresponding weight matrix, W _a For corresponding non-linear combining weights, b _m And b _a Is a bias vector.

Step S704: generation of hidden layer State Classification tags l Using the attention mechanism according to the following formula _t And completing ABi-LSTM model training.

Step S705: using the validation dataset, the model was adjusted to the optimum according to the F1 value.

Step S8: the test set is used for testing the effectiveness of the ABi-LSTM model, and the specific steps are as follows:

step S801: inputting test data text sequence sample X into trained ABi-LSTM classifier _t ＝{x _1t ,x _2t ,...,x _nt }, calculate sample X _t A corresponding tag value;

step S803: the detection Accuracy (Accuracy), precision (Precision), recall (Recall) was calculated according to the following formula:

Precision＝TP/(TP+FP)*100％

Recall＝TP/(TP+FN)*100％

Accuracy＝(TP+TN)/(TP+FN+TN+FP)

the method adopts data on an open source text sharing platform Pastebin, the data are collected from 11 months in 2019 to 2 months in 2020, and 1035634 text files are collected in total. To train the ABi-LSTM model, 12673 documents were manually selected and 144967 text sequences were obtained as training data. The training data is tagged according to the tagging strategy shown in table 1.

TABLE 1 tag policy

The invention compares the detection method based on the mixture of ABi-LSTM, bi-LSTM, BERT, CRF and BiGRU in the three aspects of detection accuracy, precision and recall rate. Based on the mentioned method ABi-LSTM + BERT, the method has the most effective performance, wherein the detection precision reaches 98.42%, the recall rate reaches 99.58%, and the accuracy reaches 98.96%. The results demonstrate the correctness and effectiveness of the proposed method.

While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A sensitive information extraction method based on natural semantic processing and deep learning is characterized by comprising the following steps:

step S2: analyzing the rich text files, namely analyzing the rich text files with different formats by using open source tools HTMLParser, pugixml, PDFLib, python-docx, libpst and win32 com;

step S5: embedding word vectors, and performing vector conversion on words in a text sequence by using a dynamic word embedding algorithm BERT;

step S7: model training, inputting the vector data set of the test words into a bidirectional long-short term memory network (Bi-LSTM), and adding an attention mechanism to the training model to form a bidirectional attention long-short term memory network model (ABi-LSTM);

2. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 1, wherein in step S2, the specific steps of parsing the rich text file are as follows:

3. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 2, wherein the step S3 extracts sensitive information with predictable model features, and the extraction of sensitive information in a predictable model text can be specifically expressed as:

step S302: extracting the sensitive information with predictable model characteristics defined in the step S301 by using a sub () function in a re module in the Python;

step S303: and storing the sensitive information.

4. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 2, wherein in step S4, a text sequence is generated for the parsed text, which can be specifically expressed as:

5. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 4, wherein in step S5, vector transformation is performed on words in the text sequence, which can be specifically expressed as:

step S502: calculating a word vector sequence E = { E } corresponding to the text sequence X by using a BERT algorithm ₁ ,e ₂ ,e ₃ ,...,e _n In which e _n For the nth word x _n The corresponding word vector.

6. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 5, wherein in step S6, the word vector sequence generated in step S5 is divided into a training data set, a verification data set and a test data set, which may be specifically expressed as:

step S601: performing a labeling operation on the text sequence, wherein the labeling operation adopts a BIO strategy;

7. the method for extracting sensitive information based on natural semantic processing and deep learning according to claim 6, wherein in step S7, the training dataset generated in step S6 is used, and the verification dataset is used to train the model and adjust the ABi-LSTM model, which can be specifically expressed as:

(1) forgetting gate to update LSTM:

f _t ＝σ(W _f e _t +U _f h _t-1 +b _f ) (1)

(2) forgetting gate to update LSTM:

i _t ＝σ(W _i e _t +U _i h _t-1 +b _i ) (2)

(3) update input modulation gate of LSTM:

m _t ＝tanh(W _m e _t +U _m h _t-1 +b _m ) (3)

(4) update output gate of LSTM:

o _t ＝σ(W _o e _t +U _o h _t-1 +b _o ) (4)

(5) generating the next implicit vector state:

h _t ＝o _t tanhσ(C _t ) (5)

wherein e _t For input of word vectors, h _t-1 The hidden state of the LSTM at the previous moment is shown, and sigma and tanh are a Sigmoid function and a hyperbolic tangent function;

Step S703: and adding an attention layer on the uppermost layer of the Bi-LSTM to form a bidirectional attention LSTM (ABi-LSTM), which is as follows:

m _t,t′ ＝tanh(W _m h _t +W _m' h _m' +b _m ) (6)

a _t,t' ＝σ(W _a m _t,t' +b _a ) (7)

Step S704: using attention mechanism according to the following formulaHidden layer state classification label l _t And completing ABi-LSTM model training.

8. The method for extracting sensitive information based on natural semantic processing and deep learning as claimed in claim 7, wherein in step S8, the test data set is input to the ABi-LSTM classification model in real time to obtain classification labels, and the detection accuracy, precision and recall are calculated by comparing with actual labels, the specific steps are as follows:

Precision＝TP/(TP+FP)*100％

Recall＝TP/(TP+FN)*100％

Accuracy＝(TP+TN)/(TP+FN+TN+FP)