CN115718792A - Sensitive information extraction method based on natural semantic processing and deep learning - Google Patents

Sensitive information extraction method based on natural semantic processing and deep learning Download PDF

Info

Publication number
CN115718792A
CN115718792A CN202211270245.7A CN202211270245A CN115718792A CN 115718792 A CN115718792 A CN 115718792A CN 202211270245 A CN202211270245 A CN 202211270245A CN 115718792 A CN115718792 A CN 115718792A
Authority
CN
China
Prior art keywords
text
lstm
sensitive information
file
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211270245.7A
Other languages
Chinese (zh)
Inventor
程兴防
陈剑飞
王云霄
徐明伟
赵丽娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Original Assignee
Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd filed Critical Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority to CN202211270245.7A priority Critical patent/CN115718792A/en
Publication of CN115718792A publication Critical patent/CN115718792A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a sensitive information extraction method based on natural semantic processing and deep learning technology, which can be used for extracting sensitive information in large-scale unstructured data. Firstly, downloading a plurality of text files from the Internet, and analyzing rich text files with different formats by using an open source tool; then, extracting sensitive information with a predictable mode by using a regular matching algorithm for the analyzed text file; secondly, embedding words and generating word vectors by using a natural semantic processing model-BERT, and constructing a text sequence classification model by combining a deep learning algorithm-a bidirectional long-short term memory network (ABi-LSTM) with an attention mechanism; and finally, performing sensitive information extraction on the text sequence by using the trained ABi-LSTM model. In the process, a BIO labeling strategy is adopted for performing labeling operation on the text sequence. The invention aims to provide a sensitive information extraction method for unstructured data, and the accuracy of sensitive information extraction is ensured.

Description

Sensitive information extraction method based on natural semantic processing and deep learning
Technical Field
The invention relates to the technical field of information security, in particular to a network sensitive information extraction method based on a natural semantic processing method such as regular matching and BERT and an improved long-short term memory network, which considers an intelligent extraction method of sensitive information in a network and a processing process of a text file.
Background
Information leakage is always an important problem in the field of information security, and once sensitive information is leaked, serious consequences can be caused. Most sensitive information is stored in unstructured data, and how to extract sensitive information from a large amount of unstructured data has become one of the most important challenges. Research directed to the extraction of sensitive information in networks has been generated in this context.
The purpose of sensitive information extraction is to extract sensitive information from a text file through a natural semantic processing method, protect the sensitive information through judging the existence of the sensitive information, most of the text files existing in a network are stored in unstructured data, and the problems that the sensitive information in the extracted files is not extracted completely, the extraction accuracy is low and the like are identified by only using a method based on regular matching. Therefore, it is necessary to process unstructured data, and further, an existing intelligent method is adopted to extract sensitive information, so that high integrity and accuracy of sensitive information extraction are achieved.
The format of the unstructured data file is various, and the files with different formats need to be analyzed by specific tools or software. At present, research on sensitive information extraction can be mainly divided into a method based on regular matching and a method based on a machine learning algorithm. The method focuses more on the definition of the sensitive information mode, information extraction is carried out through the definition extraction template, the accuracy rate of the method depends strongly on the definition extraction template, and the main effect is large. The information extraction method based on machine learning is mainly based on statistical models, such as hidden Markov models, maximum entropy models, support vector machines and the like, but the methods are easy to suffer from the problem of data sparseness caused by the problems of a prediction library, in practice, non-sensitive information is often mistaken for sensitive information to be extracted, the false alarm rate is high, and the popular deep learning method in recent years can input text contents in a sequence mode for model training and can learn the word vector relationship of the text sequence contents. In order to convert text content into Word vectors to realize intelligent extraction, the existing Word2Vector belongs to a static model and cannot process ambiguous words in a text file, so that a dynamic model is necessary to process the text content.
Meanwhile, the long short term memory network (LSTM) is used as an efficient deep learning algorithm, the relation among word vectors in a text sequence content learning sequence can be input at one time, the accuracy of sensitive information identification is increased undoubtedly, and the invention forms a bidirectional long short term memory network (ABi-LSTM) with attention mechanism by improving the LSTM so as to realize accurate identification of sensitive information. Aiming at the text with a predictable model, the sensitive information is extracted by adopting a traditional regular matching method, and in order to improve the sensitive information identification effect, the method adopts a BERT model to vectorize the text content to realize vector conversion of different meanings of the same word in different sentences, further generates a text sequence and trains ABi-LSTM to realize secondary extraction of the sensitive information.
Disclosure of Invention
The invention aims to provide a sensitive information extraction method based on natural semantic processing and deep learning, which can be applied to sensitive information extraction in the fields of information security, information retrieval and the like.
A sensitive information extraction method based on natural semantic processing and deep learning comprises the following steps:
step S1: collecting a text file, namely dividing the text file into a plain text file set P and a rich text file set R according to formats, wherein the format of the rich text file comprises HTML, XML, pdf, doc, pst and rtf;
step S2: analyzing the rich text file, namely analyzing the rich text file with different formats by using open source tools HTMLParser, pugixml, PDFLib, python-docx, libpst and win32 com;
and step S3: extracting sensitive information of the predictable model, namely extracting the sensitive information of the text with the predictable model by using a regular matching method, such as IP address, MAC address, mailbox, AIP keyword, certificate request and private key content;
and step S4: generating a text sequence, and performing text cleaning, text segmentation and text replacement on the analyzed text file;
step S5: embedding word vectors, and performing vector transformation on words in a text sequence by using a dynamic word embedding algorithm BERT;
step S6: training, verifying and testing data set dividing, namely dividing word vectors into a training data set, a verifying data set and a testing data set in proportion;
step S7: model training, inputting the vector data set of the test words into a bidirectional long and short term memory network (Bi-LSTM), and adding an attention mechanism to the training model to form a bidirectional attention long and short term memory network model (ABi-LSTM);
step S8: and (4) testing the effectiveness of the model, namely testing the ABi-LSTM model by using a test set.
In the above technical solution, in step S2, the specific steps of parsing the rich text file are as follows:
step S201: aiming at an HTML file, a Parser is created by using a Parser class in an HTMLparser, a Filter or a Visator Visitor rule is created, a text node meeting the conditions is obtained by using the Parser according to the Filter or the Visator, and the text node is analyzed;
step S202: aiming at the XML file, converting the original XML file into an object model set according to the label in the file, storing the converted data structure by using a DOM tree, and randomly accessing the stored data through a DOM interface to realize the analysis of the text file;
step S203: analyzing the tail of the file to obtain a cross application table and a root object number aiming at the pdf file, and analyzing the file layer by using a PDFLib library according to the cross reference table and the root object number;
step S204: aiming at doc and docx documents, acquiring a document object to be analyzed, outputting each segment of content in the document, and outputting a segment number and segment content to finish analysis;
step S205: aiming at the pst file, the file is directly analyzed by using a libpst, and the mail text, the attachment and the like are extracted for analysis; for the rtf file, the file text is extracted and analyzed by using win32 com.
In the above technical solution, the sensitive information with the predictable model features is extracted in step S3, and the extraction of the sensitive information in the predictable model text may specifically be represented as:
step S301: defining sensitive information with predictable model characteristics, such as IP addresses, email addresses, API keywords, private keys and certificate text;
step S302: extracting the sensitive information with the predictable model characteristics defined in the step S301 by using a sub () function in a re module in the Python;
step S303: and saving the sensitive information.
In the above technical solution, in step S4, the text sequence generated for the parsed text may be specifically expressed as:
step S401: cleaning the text, removing non-ASCII characters and space characters at the beginning and the end of each line in the text, and converting capital characters into corresponding lowercase characters;
step S402: segmenting a text, namely segmenting the text into a plurality of lines, taking each line of text as a sentence, and performing word segmentation by using WordPiece;
step S403: text replacement, namely performing format replacement on the text content URL and the email, wherein the format after the replacement is as follows: email username domain and http domain letters.
In the above technical solution, in step S5, vector conversion is performed on words in the text sequence, which may specifically be represented as:
step S501: defining the text sequence generated in step S4 as X = { X 1 ,x 2 ,x 3 ,...,x n },x n Is the nth word in the text sequence;
step S502: make it possible toCalculating a word vector sequence E = { E) corresponding to the text sequence X by using a BERT algorithm 1 ,e 2 ,e 3 ,...,e n In which e n For the nth word x n The corresponding word vector.
In the above technical solution, in step S6, the word vector sequence generated in step S5 is divided into a training data set, and a verification data set and a test data set, which may be specifically expressed as:
step S601: performing labeling operation on the text sequence, wherein the labeling operation adopts a BIO strategy;
step S602: dividing the word vector sequence set into a training data set, a verification data set and a test data set according to a certain proportion, wherein the data volume proportion of the training data set, the verification data set and the test data set is 7:1:2.
in the above technical solution, in step S7, the training data set generated in step S6 is used to verify that the data set trains the model and adjusts the ABi-LSTM model, which may be specifically expressed as:
step S701: the method for updating the gate structure of the long-short term memory network (LSTM) comprises the following specific steps:
(1) forgetting gate to update LSTM:
f t =σ(W f e t +U f h t-1 +b f ) (1)
(2) forgetting gate to update LSTM:
i t =σ(W i e t +U i h t-1 +b i ) (2)
(3) update input modulation gate of LSTM:
m t =tanh(W m e t +U m h t-1 +b m ) (3)
(4) update output gate of LSTM:
o t =σ(W o e t +U o h t-1 +b o ) (4)
(5) generate the next implicit vector state:
h t =o t tanhσ(C t ) (5)
wherein e t To input a word vector, h t-1 The implicit state of LSTM at the previous time, σ and tan h are "Sigmoid"A function and a hyperbolic tangent function;
step S702: capturing past and future states using feed-forward and feed-back LSTM, training a Bi-directional LSTM (Bi-LSTM) model, generating acquisition feed-forward hidden layer states at time t
Figure SMS_1
Step S703: adding an attention layer on the uppermost layer of the Bi-LSTM to form a bidirectional attention LSTM (ABi-LSTM), and specifically comprising the following steps:
m t,t' =tanh(W m h t +W m' h m' +b m ) (6)
a t,t' =σ(W a m t,t' +b a ) (7)
wherein a is t,t' Is an element in the attention matrix and can be used to capture the hidden state h t And h' t Similarity between, W m And W' m For implying a layer state h t And h' t Corresponding weight matrix, W a For corresponding non-linear combining weights, b m And b a Is a bias vector.
Step S704: generation of hidden layer State Classification tags l Using an attention mechanism, according to the following formula t And completing ABi-LSTM model training.
Figure SMS_2
Step S705: using the validation dataset, the model is adjusted to the optimum according to the F1 value.
In the above technical solution, in step S8, the test data set is input to the ABi-LSTM classification model to obtain a classification label, and the detection accuracy, precision, and recall ratio are calculated by comparing the classification label with an actual label, specifically including the following steps:
step S801: inputting test data text sequence sample X into the trained ABi-LSTM classifier t ={x 1t ,x 2t ,...,x nt }, calculate sample X t A corresponding tag value;
step S802: inputting all test text sequence samples into ABi-LSTM, and counting the values of TP (True Positive), FP (False Positive), FN (False Negative) and TN (True Negative);
step S803: the detection Accuracy (Accuracy), precision (Precision), recall (Recall) were calculated according to the following formula:
Precision=TP/(TP+FP)*100%
Recall=TP/(TP+FN)*100%
Accuracy=(TP+TN)/(TP+FN+TN+FP)
compared with the prior art, the invention has the following beneficial effects:
the method is used for analyzing a sensitive information extraction method, different tools are used for analyzing the content of a rich text file, text cleaning, text segmentation and text replacement are carried out on text data, a BERT model is used for word vectorization on the text to generate a data sample set, the data sample set is further divided into a training data set, a verification data set and a test data set, and the three data sets are applied to the provided improved long-short term memory network model to achieve model training, adjustment and verification of sensitive information extraction. The experimental result verifies the effectiveness of the method.
The invention adopts a bidirectional long-short term memory network with attention mechanism as a detector, the detection model has high detection accuracy and strong generalization capability, and the invention can be applied to the extraction process of large-scale sensitive information.
Drawings
In order to clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the prior art will be briefly introduced by using the accompanying drawings, which are only used for illustration and are not to be construed as limiting the patent;
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of the classification model of the present invention combining BERT and ABi-LSTM;
table 1 is the "BIO" tagging strategy involved in the present invention;
FIG. 3 is a schematic diagram illustrating validity verification of the method of the present invention;
Detailed Description
To better illustrate the objects, technical solutions and advantages of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
The invention provides a sensitive information extraction method based on natural semantic processing and deep learning, which is based on regular matching and BERT and improves a long-short term memory network to extract sensitive information in a network, wherein the flow schematic diagram is shown in figure 1 and comprises the following steps:
step S1: collecting a text file, and dividing the text file into a plain text file set P and a rich text file set R according to formats, wherein the format of the rich text file comprises HTML, XML, pdf, doc, pst and rtf;
step S2: analyzing the rich text file by using rich text files with different formats such as open source tools HTMLParser, pugixml, PDFLib, python-docx, libpst and win32com, and specifically comprising the following steps:
step S201: aiming at an HTML file, a Parser is created by using a Parser class in an HTMLparser, a Filter or a Visator Visitor rule is created, a text node meeting the conditions is obtained by using the Parser according to the Filter or the Visator, and the text node is analyzed;
step S202: aiming at the XML file, converting the original XML file into an object model set according to the label in the file, storing the converted data structure by using a DOM tree, and randomly accessing the stored data through a DOM interface to realize the analysis of the text file;
step S203: analyzing the tail of the file to obtain a cross application table and a root object number aiming at the pdf file, and analyzing the file layer by using a PDFLib library according to the cross reference table and the root object number;
step S204: aiming at doc and docx documents, acquiring a document object to be analyzed, outputting each section of content in the documents, and outputting a section number and the section content to finish analysis;
step S205: aiming at the pst file, the file is directly analyzed by using a libpst, and the mail text, the attachment and the like are extracted for analysis; for the rtf file, the text of the file is extracted and analyzed by the win32 com.
And step S3: extracting sensitive information of a predictable model, namely extracting sensitive information of a text with the predictable model by using a regular matching method, such as an IP address, an MAC address, a mailbox, AIP keywords, a certificate request and private key content, and specifically comprising the following steps:
step S301: defining sensitive information with predictable model characteristics, such as IP addresses, email addresses, API keywords, private keys and certificate text;
step S302: extracting the sensitive information with predictable model characteristics defined in the step S301 by using a sub () function in a re module in the Python; step S303: and storing the sensitive information.
And step S4: generating a text sequence, and performing text cleaning, text segmentation and text replacement on the analyzed text file, wherein the specific steps are as follows:
step S401: cleaning the text, removing non-ASCII characters and space characters at the beginning and the end of each line in the text, and converting capital characters into corresponding lowercase characters;
step S402: segmenting a text, namely segmenting the text into a plurality of lines, taking each line of text as a sentence, and performing word segmentation by using WordPiece;
step S403: text replacement, namely performing format replacement on the text content URL and the email, wherein the format after the replacement is as follows: email usernamemomain and http domain letters.
Step S5: embedding word vectors, and performing vector transformation on words in a text sequence by using a dynamic word embedding algorithm BERT, wherein the method comprises the following specific steps:
step S501: defining the text sequence generated in step S4 as X = { X 1 ,x 2 ,x 3 ,...,x n },x n Is the nth word in the text sequence;
step S502: calculating a word vector sequence E = { E) corresponding to the text sequence X by using a BERT algorithm 1 ,e 2 ,e 3 ,...,e n In which e n For the nth word x n The corresponding word vector.
Step S6: training, verifying and testing data set dividing, wherein word vectors are divided into a training data set, a verifying data set and a testing data set according to proportion, and the method comprises the following specific steps:
step S601: performing labeling operation on the text sequence, wherein the labeling operation adopts a BIO strategy;
step S602: dividing the word vector sequence set into a training data set, a verification data set and a test data set according to a certain proportion, wherein the data volume proportion of the training data set, the verification data set and the test data set is 7:1:2.
step S7: model training, inputting a test word vector data set into a bidirectional long-short term memory network (Bi-LSTM), adding an attention mechanism to a training model to form a bidirectional attention long-short term memory network model (ABi-LSTM), and specifically comprising the following steps:
step S701: the method for updating the gate structure of the long-short term memory network (LSTM) comprises the following specific steps:
(1) forgetting gate to update LSTM:
f t =σ(W f e t +U f h t-1 +b f ) (1)
(2) forgetting gate to update LSTM:
i t =σ(W i e t +U i h t-1 +b i ) (2)
(3) update input modulation gate of LSTM:
m t =tanh(W m e t +U m h t-1 +b m ) (3)
(4) update output gate of LSTM:
o t =σ(W o e t +U o h t-1 +b o ) (4)
(5) generating the next implicit vector state:
h t =o t tanhσ(C t ) (5)
wherein e t For input of word vectors, h t-1 Sigma and tanh are Sigmoid functions and hyperbolic tangent functions, which are implicit states of the LSTM at the previous moment;
step S702: capturing past and future states using feedforward and feedback LSTM, training bidirectional LSTM (Bi-LSTM) model, generating acquisition feedforward and feedback hidden layer states at time t
Figure SMS_3
Step S703: adding an attention layer on the uppermost layer of the Bi-LSTM to form a bidirectional attention LSTM (ABi-LSTM), which is as follows:
m t,t' =tanh(W m h t +W m' h m' +b m ) (6)
a t,t' =σ(W a m t,t' +b a ) (7)
wherein a is t,t' Is an element of the attention matrix that can be used to capture the hidden state h t And h' t Similarity between, W m And W' m For implying a layer state h t And h' t Corresponding weight matrix, W a For corresponding non-linear combining weights, b m And b a Is a bias vector.
Step S704: generation of hidden layer State Classification tags l Using the attention mechanism according to the following formula t And completing ABi-LSTM model training.
Figure SMS_4
Step S705: using the validation dataset, the model was adjusted to the optimum according to the F1 value.
Step S8: the test set is used for testing the effectiveness of the ABi-LSTM model, and the specific steps are as follows:
step S801: inputting test data text sequence sample X into trained ABi-LSTM classifier t ={x 1t ,x 2t ,...,x nt }, calculate sample X t A corresponding tag value;
step S802: inputting all test text sequence samples into ABi-LSTM, and counting the values of TP (True Positive), FP (False Positive), FN (False Negative) and TN (True Negative);
step S803: the detection Accuracy (Accuracy), precision (Precision), recall (Recall) was calculated according to the following formula:
Precision=TP/(TP+FP)*100%
Recall=TP/(TP+FN)*100%
Accuracy=(TP+TN)/(TP+FN+TN+FP)
the method adopts data on an open source text sharing platform Pastebin, the data are collected from 11 months in 2019 to 2 months in 2020, and 1035634 text files are collected in total. To train the ABi-LSTM model, 12673 documents were manually selected and 144967 text sequences were obtained as training data. The training data is tagged according to the tagging strategy shown in table 1.
TABLE 1 tag policy
Figure SMS_5
The invention compares the detection method based on the mixture of ABi-LSTM, bi-LSTM, BERT, CRF and BiGRU in the three aspects of detection accuracy, precision and recall rate. Based on the mentioned method ABi-LSTM + BERT, the method has the most effective performance, wherein the detection precision reaches 98.42%, the recall rate reaches 99.58%, and the accuracy reaches 98.96%. The results demonstrate the correctness and effectiveness of the proposed method.
While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A sensitive information extraction method based on natural semantic processing and deep learning is characterized by comprising the following steps:
step S1: collecting a text file, namely dividing the text file into a plain text file set P and a rich text file set R according to formats, wherein the format of the rich text file comprises HTML, XML, pdf, doc, pst and rtf;
step S2: analyzing the rich text files, namely analyzing the rich text files with different formats by using open source tools HTMLParser, pugixml, PDFLib, python-docx, libpst and win32 com;
and step S3: extracting sensitive information of the predictable model, namely extracting the sensitive information of the text with the predictable model by using a regular matching method, such as IP address, MAC address, mailbox, AIP keyword, certificate request and private key content;
and step S4: generating a text sequence, and performing text cleaning, text segmentation and text replacement on the analyzed text file;
step S5: embedding word vectors, and performing vector conversion on words in a text sequence by using a dynamic word embedding algorithm BERT;
step S6: training, verifying and testing data set dividing, namely dividing word vectors into a training data set, a verifying data set and a testing data set in proportion;
step S7: model training, inputting the vector data set of the test words into a bidirectional long-short term memory network (Bi-LSTM), and adding an attention mechanism to the training model to form a bidirectional attention long-short term memory network model (ABi-LSTM);
step S8: and (4) testing the effectiveness of the model, namely testing the ABi-LSTM model by using a test set.
2. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 1, wherein in step S2, the specific steps of parsing the rich text file are as follows:
step S201: aiming at an HTML file, a Parser is created by using a Parser class in an HTMLparser, a Filter or a Visator Visitor rule is created, a text node meeting the conditions is obtained by using the Parser according to the Filter or the Visator, and the text node is analyzed;
step S202: aiming at the XML file, converting the original XML file into an object model set according to the label in the file, storing the converted data structure by using a DOM tree, and randomly accessing the stored data through a DOM interface to realize the analysis of the text file;
step S203: analyzing the tail of the file to obtain a cross application table and a root object number aiming at the pdf file, and analyzing the file layer by using a PDFLib library according to the cross reference table and the root object number;
step S204: aiming at doc and docx documents, acquiring a document object to be analyzed, outputting each segment of content in the document, and outputting a segment number and segment content to finish analysis;
step S205: aiming at the pst file, the file is directly analyzed by using a libpst, and the mail text, the attachment and the like are extracted for analysis; for the rtf file, the file text is extracted and analyzed by using win32 com.
3. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 2, wherein the step S3 extracts sensitive information with predictable model features, and the extraction of sensitive information in a predictable model text can be specifically expressed as:
step S301: defining sensitive information with predictable model characteristics, such as IP addresses, email addresses, API keywords, private keys and certificate text;
step S302: extracting the sensitive information with predictable model characteristics defined in the step S301 by using a sub () function in a re module in the Python;
step S303: and storing the sensitive information.
4. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 2, wherein in step S4, a text sequence is generated for the parsed text, which can be specifically expressed as:
step S401: cleaning the text, removing non-ASCII characters and space characters at the beginning and the end of each line in the text, and converting capital characters into corresponding lowercase characters;
step S402: segmenting a text, namely segmenting the text into a plurality of lines, taking each line of text as a sentence, and performing word segmentation by using WordPiece;
step S403: text replacement, namely performing format replacement on the text content URL and the email, wherein the format after the replacement is as follows: email username domain and http domain letters.
5. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 4, wherein in step S5, vector transformation is performed on words in the text sequence, which can be specifically expressed as:
step S501: defining the text sequence generated in step S4 as X = { X 1 ,x 2 ,x 3 ,...,x n },x n Is the nth word in the text sequence;
step S502: calculating a word vector sequence E = { E } corresponding to the text sequence X by using a BERT algorithm 1 ,e 2 ,e 3 ,...,e n In which e n For the nth word x n The corresponding word vector.
6. The method for extracting sensitive information based on natural semantic processing and deep learning according to claim 5, wherein in step S6, the word vector sequence generated in step S5 is divided into a training data set, a verification data set and a test data set, which may be specifically expressed as:
step S601: performing a labeling operation on the text sequence, wherein the labeling operation adopts a BIO strategy;
step S602: dividing the word vector sequence set into a training data set, a verification data set and a test data set according to a certain proportion, wherein the data volume proportion of the training data set, the verification data set and the test data set is 7:1:2.
7. the method for extracting sensitive information based on natural semantic processing and deep learning according to claim 6, wherein in step S7, the training dataset generated in step S6 is used, and the verification dataset is used to train the model and adjust the ABi-LSTM model, which can be specifically expressed as:
step S701: the method for updating the gate structure of the long-short term memory network (LSTM) comprises the following specific steps:
(1) forgetting gate to update LSTM:
f t =σ(W f e t +U f h t-1 +b f ) (1)
(2) forgetting gate to update LSTM:
i t =σ(W i e t +U i h t-1 +b i ) (2)
(3) update input modulation gate of LSTM:
m t =tanh(W m e t +U m h t-1 +b m ) (3)
(4) update output gate of LSTM:
o t =σ(W o e t +U o h t-1 +b o ) (4)
(5) generating the next implicit vector state:
h t =o t tanhσ(C t ) (5)
wherein e t For input of word vectors, h t-1 The hidden state of the LSTM at the previous moment is shown, and sigma and tanh are a Sigmoid function and a hyperbolic tangent function;
step S702: capturing past and future states using feed-forward and feed-back LSTM, training a Bi-directional LSTM (Bi-LSTM) model, generating acquisition feed-forward hidden layer states at time t
Figure FDA0003894871010000031
Step S703: and adding an attention layer on the uppermost layer of the Bi-LSTM to form a bidirectional attention LSTM (ABi-LSTM), which is as follows:
m t,t′ =tanh(W m h t +W m' h m' +b m ) (6)
a t,t' =σ(W a m t,t' +b a ) (7)
wherein a is t,t' Is an element in the attention matrix and can be used to capture the hidden state h t And h' t Similarity between, W m And W' m For implying a layer state h t And h' t Corresponding weight matrix, W a For corresponding non-linear combining weights, b m And b a Is a bias vector.
Step S704: using attention mechanism according to the following formulaHidden layer state classification label l t And completing ABi-LSTM model training.
Figure FDA0003894871010000032
Step S705: using the validation dataset, the model is adjusted to the optimum according to the F1 value.
8. The method for extracting sensitive information based on natural semantic processing and deep learning as claimed in claim 7, wherein in step S8, the test data set is input to the ABi-LSTM classification model in real time to obtain classification labels, and the detection accuracy, precision and recall are calculated by comparing with actual labels, the specific steps are as follows:
step S801: inputting test data text sequence sample X into trained ABi-LSTM classifier t ={x 1t ,x 2t ,...,x nt }, calculate sample X t A corresponding tag value;
step S802: inputting all test text sequence samples into ABi-LSTM, and counting the values of TP (True Positive), FP (False Positive), FN (False Negative) and TN (True Negative);
step S803: the detection Accuracy (Accuracy), precision (Precision), recall (Recall) was calculated according to the following formula:
Precision=TP/(TP+FP)*100%
Recall=TP/(TP+FN)*100%
Accuracy=(TP+TN)/(TP+FN+TN+FP)
CN202211270245.7A 2022-10-18 2022-10-18 Sensitive information extraction method based on natural semantic processing and deep learning Pending CN115718792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211270245.7A CN115718792A (en) 2022-10-18 2022-10-18 Sensitive information extraction method based on natural semantic processing and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211270245.7A CN115718792A (en) 2022-10-18 2022-10-18 Sensitive information extraction method based on natural semantic processing and deep learning

Publications (1)

Publication Number Publication Date
CN115718792A true CN115718792A (en) 2023-02-28

Family

ID=85254186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211270245.7A Pending CN115718792A (en) 2022-10-18 2022-10-18 Sensitive information extraction method based on natural semantic processing and deep learning

Country Status (1)

Country Link
CN (1) CN115718792A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116432208A (en) * 2023-06-08 2023-07-14 长扬科技(北京)股份有限公司 Security management method, device, server and system for industrial Internet data
CN116932487A (en) * 2023-09-15 2023-10-24 北京安联通科技有限公司 Quantized data analysis method and system based on data paragraph division
CN117389769A (en) * 2023-12-11 2024-01-12 杭州中房信息科技有限公司 Browser-end rich text copying method and system based on cloud service and cloud platform

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116432208A (en) * 2023-06-08 2023-07-14 长扬科技(北京)股份有限公司 Security management method, device, server and system for industrial Internet data
CN116432208B (en) * 2023-06-08 2023-09-05 长扬科技(北京)股份有限公司 Security management method, device, server and system for industrial Internet data
CN116932487A (en) * 2023-09-15 2023-10-24 北京安联通科技有限公司 Quantized data analysis method and system based on data paragraph division
CN116932487B (en) * 2023-09-15 2023-11-28 北京安联通科技有限公司 Quantized data analysis method and system based on data paragraph division
CN117389769A (en) * 2023-12-11 2024-01-12 杭州中房信息科技有限公司 Browser-end rich text copying method and system based on cloud service and cloud platform
CN117389769B (en) * 2023-12-11 2024-04-30 杭州中房信息科技有限公司 Browser-end rich text copying method and system based on cloud service and cloud platform

Similar Documents

Publication Publication Date Title
CN115718792A (en) Sensitive information extraction method based on natural semantic processing and deep learning
CN112069831A (en) Unreal information detection method based on BERT model and enhanced hybrid neural network
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN111931935A (en) Network security knowledge extraction method and device based on One-shot learning
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
Suyanto Synonyms-based augmentation to improve fake news detection using bidirectional LSTM
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN113987125A (en) Text structured information extraction method based on neural network and related equipment thereof
CN115759071A (en) Government affair sensitive information identification system and method based on big data
Wang et al. Cyber threat intelligence entity extraction based on deep learning and field knowledge engineering
CN111325036A (en) Emerging technology prediction-oriented evidence fact extraction method and system
Wang et al. A novel calibrated label ranking based method for multiple emotions detection in Chinese microblogs
Sagcan et al. Toponym recognition in social media for estimating the location of events
Arbaatun et al. Hate Speech Detection on Twitter through Natural Language Processing using LSTM Model
CN114169447B (en) Event detection method based on self-attention convolution bidirectional gating cyclic unit network
CN115712713A (en) Text matching method, device and system and storage medium
CN112528653B (en) Short text entity recognition method and system
CN111159405B (en) Irony detection method based on background knowledge
CN114298041A (en) Network security named entity identification method and identification device
Bhanu Prasad et al. Author verification using rich set of linguistic features
CN110705287B (en) Method and system for generating text abstract
CN112287072A (en) Multi-dimensional Internet text risk data identification method
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination