CN114298035A

CN114298035A - Text recognition desensitization method and system thereof

Info

Publication number: CN114298035A
Application number: CN202111640408.1A
Authority: CN
Inventors: 张宏莉; 韩培义; 叶麟; 余翔湛; 李东; 于海宁; 方滨兴; 林华娟
Original assignee: Guangdong Electronic Information Engineering Research Institute of UESTC
Current assignee: Guangdong Electronic Information Engineering Research Institute of UESTC
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-08

Abstract

The invention discloses a text recognition desensitization method and a system thereof, wherein the method comprises the following steps: acquiring a text to be recognized, splitting the text to be recognized into a plurality of sentences, and segmenting the sentences into a plurality of words; converting each word of the text to be recognized into a corresponding vector; inputting the vector into a Bert model, converting the vector into a word expression vector by the Bert model, and classifying and labeling the word expression vector; inputting the word expression vector after the label into a conditional random field, and calculating the conditional random field based on the incidence dependency relationship among labels to obtain a global optimal label sequence; desensitizing the words labeled as sensitive entities. The invention can realize concurrent execution, simultaneously extract the relation characteristics of the words in the sentences so as to more comprehensively reflect the semantics of the sentences, and can also obtain the word senses according to the context of the sentences so as to avoid ambiguity.

Description

Text recognition desensitization method and system thereof

Technical Field

The invention relates to the technical field of data processing, in particular to a text recognition desensitization method and a text recognition desensitization system.

Background

Sensitive data of individuals or businesses, such as knowledge patents, transaction contracts, electronic medical records, and the like, are often in the form of text documents. If the document is encrypted and then directly uploaded to the cloud server, some important cloud service functions cannot be used. For example, after the document is encrypted, the document online editing and previewing function of the cloud service will fail. Desensitizing a document may not only remove private information but also preserve the structural integrity of the document. Therefore, it is a technical challenge how to automatically locate and desensitize private information in a document. The sensitive text recognition is particularly important in the sensitive text recognition and desensitization, and the core part of the sensitive text protection scheme is to select sensitive words from massive texts and finish the accurate recognition of the sensitive words.

Existing named entity recognition is mainly used for recognizing sensitive data entities such as names, addresses, telephone numbers and the like in texts. The rule-based method identifies the sensitive entities through the regular expression, the rule dictionary and the like, a large amount of training data is not needed, however, the rule writing needs an expert knowledge background, the rule writing cannot adapt to complex and variable sensitive data, and the identification accuracy rate is poor.

The existing method based on machine learning adopts a hidden Markov model, a maximum entropy model, a laminated conditional random field model, a support vector machine model and the like to identify and label sensitive information in unstructured data, but needs a large amount of labeled data, and has weak text semantic feature extraction capability and poor accuracy in identifying part of sensitive entities.

With the rapid development of deep learning research, sensitive data categories are predicted through a one-way Short term memory Neural Network (LSTM) and a conditional random field combined model, but the text context semantic feature extraction capability is weak, and the parallelism is poor.

Disclosure of Invention

In view of the problems existing in the background art, the invention aims to provide a text recognition desensitization method and a text recognition desensitization system, which solve the problems of poor extraction capability of context semantic features and low recognition accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the invention discloses a text recognition desensitization method, which comprises the following steps:

step 1, obtaining a text to be recognized, splitting the text to be recognized into a plurality of sentences, and segmenting the sentences into a plurality of words;

step 2, converting each word of the text to be recognized into a corresponding vector;

step 3, inputting the vector into a trained Bert model, converting the vector into a word expression vector by the Bert model, and classifying and labeling the word expression vector;

step 4, inputting the word expression vector after the label into a conditional random field, and calculating the conditional random field based on the incidence dependency relationship among labels to obtain a global optimal label sequence;

and 5, desensitizing the words marked as sensitive entities.

Further, in the step 3, the Bert model is obtained by training in the following manner:

step 3-1, acquiring a text to be identified containing a sensitive entity, and establishing a text data set;

step 3-2, performing word segmentation on the text to be recognized, classifying and labeling sensitive entities, and constructing a training sample;

and 3-3, pre-training the Bert model by using the training samples to obtain the trained Bert model.

Further, in the step 1, when the sentence is an english sentence, the word is segmented into the fine-grained words by using a WordPiece word segmentation method, and when the sentence is a chinese sentence, the single word of the chinese sentence is directly segmented.

Further, in the step 2, the vector is a superposition sum of a word vector, a segment vector and a position vector.

Further, in the step 3, each word is labeled as "B-X", "I-X", "O", "E-X", or "S", where "B" is the sensitive entity starting position, "I" is the sensitive entity middle position, "O" is a word other than the sensitive entity, "E" is the sensitive entity ending position, the "S" is a single entity, and "X" is a label of the type to which the sensitive entity belongs.

Further, in the step 4, the sequence of the word expression vector is used as an observation sequence, the labeled sequence is used as a labeled sequence, a first probability of the labeled sequence corresponding to the observation sequence is calculated, the first probability is normalized, and the second probability is obtained, where the labeled sequence with the largest value of the second probability is the optimal labeled sequence.

Further, the calculation formula of the first probability of the marker sequence corresponding to the observed sequence is as follows:

m is an observation sequence, K is a labeling sequence, i is the ith word in the observation sequence, is a conversion matrix, is the prediction probability value of the first label of the ith word, is the labeling sequence of the (i-1) th word, N belongs to N, and N is more than or equal to 1.

Further, the value of score (M | K) is normalized by Softmax to obtain a final probability value, and the second probability P (K | M) is calculated as follows:

k' is any one of the noted sequences.

Further, in the step 5, the desensitization operation includes replacement, erasure, data generalization, and format-preserving encryption.

In a second aspect, the present invention discloses a text recognition desensitization system, comprising:

a text input module configured to input a text to be recognized;

the preprocessing module is configured to perform word segmentation processing on the input text to be recognized to obtain words;

the Bert model module is configured to convert each word of the text to be recognized into a corresponding vector, convert the vector into a word representation vector, and label the word representation vector;

the annotation sequence prediction module is configured to calculate the association dependency relationship among the annotations of the word expression vectors to obtain a global optimal annotation sequence;

a desensitization module configured to perform desensitization operations on the labeled sensitive entities;

a text output module configured to output desensitized text.

This application adopts above-mentioned technical scheme, has following technological effect at least:

the invention discloses a text recognition desensitization method and a text recognition desensitization system, which predict sensitive data types relative to the existing one-way long-short time memory neural network and conditional random field combined model, but have weak text context semantic feature extraction capability and poor parallelism. The Bert model and the conditional random field combined model can be executed concurrently, meanwhile, the relational features of words in sentences can be extracted, different relational features can be extracted at a plurality of different levels, further, the semantics of the sentences can be reflected more comprehensively, the word senses can be obtained according to the context of the sentences, so that ambiguity is avoided, in addition, different desensitization operations can be conveniently carried out on sensitive entities of different types in the follow-up process by classifying and labeling the words, and finally, optimal labeling of the sensitive entities is obtained by predicting the labeling through the conditional random field, so that the accuracy of the desensitization operation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flow chart of a text recognition desensitization method in an embodiment of the present invention.

Fig. 2 is a block diagram of a structure of a text recognition desensitization system according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.

As shown in fig. 1, the present implementation provides a text recognition desensitization method, including:

s1, obtaining a text to be recognized, splitting the text to be recognized into a plurality of sentences, and segmenting the sentences into a plurality of words;

s2, converting each word of the text to be recognized into a corresponding vector by the trained Bert model;

s3, inputting the vector into the trained Bert model, converting the vector into a word expression vector by the Bert model, and classifying and labeling the word expression vector;

s4, inputting the tagged word expression vector into a conditional random field, and calculating the conditional random field based on the association dependency relationship between tags to obtain a global optimal tag sequence;

and S5, desensitizing the words marked as sensitive entities.

Compared with the existing one-way long-short time memory neural network and conditional random field combined model, the method has the advantages that sensitive data categories are predicted, the text context semantic feature extraction capability is weak, and the parallelism is poor. The Bert model and the conditional random field combined model can be executed concurrently, meanwhile, the relational features of words in sentences are extracted, different relational features can be extracted at a plurality of different levels, further, the semantics of the sentences can be reflected more comprehensively, the word senses can be obtained according to the context of the sentences, so that the occurrence of ambiguity is avoided, and finally, the optimal sensitive entity label is obtained by predicting the label through the conditional random field, so that the accuracy of desensitization operation is improved.

Preferably, in S1, when the sentence is an english sentence, the WordPiece word segmentation method is used to segment the sentence into fine-grained words, and when the sentence is a chinese sentence, the single word segmentation is directly performed on the chinese sentence.

Preferably, in S2, the vector is a sum of a word vector, a segment vector, and a position vector. The word vector, the segment vector and the position vector are used as input, the word vector representation of a target word, the segment vector representation of each word in context and the position vector representation of each word in the target word and the context are obtained through linear transformation, the word vector, the segment vector and the position vector are overlapped, each vector is fused with context information, and abundant semantic and syntactic characteristics are represented.

Preferably, in S3, the Bert model is obtained by training in the following way:

s3-1, acquiring a text to be recognized containing a sensitive entity, and establishing a text data set;

step S3-2, performing word segmentation on the text to be recognized, classifying and labeling sensitive entities, and constructing a training sample;

and S3-3, pre-training the Bert model by using the training samples to obtain the trained Bert model.

The Bert model is obtained by pre-training a large amount of text data based on wiki data (2500 million words) and book corpora (800 million words), a training sample is constructed by labeling sensitive entities of the text, and then the recognition of the sensitive entities by the Bert model is enhanced, so that the recognition accuracy of the Bert model on the sensitive entities is improved.

Preferably, in S3, each word is labeled as "B-X", "I-X", "O", "E-X", or "S", where "B" is the sensitive entity starting position, "I" is the sensitive entity middle position, "O" is a word other than the sensitive entity, "E" is the sensitive entity ending position, "S" is a single entity, and "X" is the type to which the labeled sensitive entity belongs. In addition, "[ GLS ]" and "[ SEP ]" added to the beginning and end of a sentence are used to indicate the beginning and end of the sentence. The sensitive entities are labeled, so that the sensitive entities can be conveniently identified by desensitization operation, and meanwhile, different desensitization operations can be performed according to different types of the sensitive entities during desensitization operation by labeling the types of the sensitive entities. Wherein the types of sensitive entities include: name, age, address, contact, bank account number, identification card number, medical history, etc. For example: "Xiaoming 13 years of age this year", which corresponds to the notation: "[ GLS ] B-NAME E-NAME O B-AEG I-AEG E-AEG [ SEP ]".

Preferably, in S4, the sequence of the word expression vector is used as the observation sequence, the sequence of the label is used as the label sequence, the first probability of the label sequence corresponding to the observation sequence is calculated, and the first probability is normalized to obtain the second probability, and the label sequence with the maximum value of the second probability is the optimal label sequence.

Preferably, in S4, the sequence of the word expression vector is used as the observation sequence, the sequence of the label is used as the label sequence, the first probability of the label sequence corresponding to the observation sequence is calculated, and the first probability is normalized to obtain the second probability, and the label sequence with the maximum value of the second probability is the optimal label sequence. Through the calculation, the optimal labeling sequence can be obtained, and the recognition and desensitization can have good precision.

Preferably, the formula for calculating the first probability of the marker sequence corresponding to the observed sequence is as follows:

m is an observation sequence, K is a labeling sequence, i is the ith word in the observation sequence, A is a transformation matrix,

is the Kth of the ith word_iThe predicted probability value of each label, K_iFor the annotated sequence of the ith word, K_i-1Is the labeled sequence of the i-1 th word, N belongs to N, and N is more than or equal to 1.

Preferably, the value of score (M | K) is normalized by Softmax to obtain a final probability value, and the second probability P (K | M) is calculated as follows:

k' is any one of the annotated sequences.

Preferably, in S5, the desensitization operation includes replacement, erasure, data generalization, and format-preserving encryption.

As shown in fig. 2, the present embodiment further provides a text recognition desensitization system, including:

a text input module configured to input a text to be recognized;

the preprocessing module is configured to perform word segmentation processing on an input text to be recognized to obtain words;

the Bert model module is configured to convert each word of the text to be recognized into a corresponding vector, convert the vector into a word expression vector, and classify and label the word expression vector;

the annotation sequence prediction module is configured to calculate the incidence dependency relationship among the annotations of the word expression vectors to obtain a global optimal annotation sequence;

a text output module configured to output desensitized text.

The text recognition desensitization system of the embodiment realizes concurrent execution, simultaneously extracts the relation characteristics of words in sentences, can extract different relation characteristics at a plurality of different levels, further more comprehensively reflects sentence semantics, can acquire word senses according to sentence contexts, further avoids ambiguity, finally predicts the labels through a conditional random field to obtain the optimal labels of sensitive entities, and further improves the accuracy of desensitization operation.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of text recognition desensitization, comprising:

step 3, inputting the vector into a Bert model, wherein the Bert model converts the vector into a word expression vector and classifies and labels the word expression vector;

and 5, desensitizing the words marked as sensitive entities.

2. The text recognition desensitization method according to claim 1, wherein in said step 3, said Bert model is trained in the following way:

3. The method of desensitization of text recognition according to claim 1, wherein in said step 1, when said sentence is an english sentence, said sentence is segmented into fine-grained said words using WordPiece segmentation method, and when said sentence is a chinese sentence, said chinese sentence is directly word-segmented.

4. The method of text recognition desensitization according to claim 1, wherein in said step 2, said vector is a sum of a superposition of a word vector, a segment vector, and a position vector.

5. The method of text recognition desensitization according to claim 2, wherein in said step 3, each of said words is labeled "B-X", "I-X", "O", "E-X", or "S", wherein said "B" is said sensitive entity starting position, said "I" is said sensitive entity intermediate position, said "O" is a word other than said sensitive entity, said "E" is said sensitive entity ending position, said "S" is a single entity, and said "X" is labeled as a type to which said sensitive entity belongs.

6. The method of desensitization of text recognition according to claim 1, wherein in said step 4, a sequence of said word expression vectors is used as an observation sequence, said labeled sequence is used as a labeled sequence, a first probability of said labeled sequence corresponding to said observation sequence is calculated, and said first probability is normalized to obtain said second probability, wherein said labeled sequence having the largest value of said second probability is the optimal labeled sequence.

7. The text recognition desensitization method of claim 6, wherein the first probability of the marker sequence corresponding to the observation sequence is calculated as follows:

m is an observation sequence, K is a labeling sequence, i is the ith word in the observation sequence, A is a transformation matrix, P_i,KiIs the Kth of the ith word_iThe predicted probability value of each label, K_iFor the annotated sequence of the ith word, K_i-1Is the labeled sequence of the i-1 th word, N belongs to N, and N is more than or equal to 1.

8. Text recognition desensitization method according to claim 7, characterized in that said score (M | K) values are normalized by Softmax to obtain the final probability value, said second probability P (K | M) being calculated as follows:

k' is any one of the noted sequences.

9. The text recognition desensitization method according to claim 1, wherein in said step 5, said desensitization operations include replacement, erasure, data generalization, format preserving encryption.

10. A text recognition desensitization system, comprising:

a text input module configured to input a text to be recognized;

a text output module configured to output desensitized text.