CN114298035A - Text recognition desensitization method and system thereof - Google Patents

Text recognition desensitization method and system thereof Download PDF

Info

Publication number
CN114298035A
CN114298035A CN202111640408.1A CN202111640408A CN114298035A CN 114298035 A CN114298035 A CN 114298035A CN 202111640408 A CN202111640408 A CN 202111640408A CN 114298035 A CN114298035 A CN 114298035A
Authority
CN
China
Prior art keywords
text
word
sequence
vector
desensitization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111640408.1A
Other languages
Chinese (zh)
Inventor
张宏莉
韩培义
叶麟
余翔湛
李东
于海宁
方滨兴
林华娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Electronic Information Engineering Research Institute of UESTC
Original Assignee
Guangdong Electronic Information Engineering Research Institute of UESTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Electronic Information Engineering Research Institute of UESTC filed Critical Guangdong Electronic Information Engineering Research Institute of UESTC
Priority to CN202111640408.1A priority Critical patent/CN114298035A/en
Publication of CN114298035A publication Critical patent/CN114298035A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention discloses a text recognition desensitization method and a system thereof, wherein the method comprises the following steps: acquiring a text to be recognized, splitting the text to be recognized into a plurality of sentences, and segmenting the sentences into a plurality of words; converting each word of the text to be recognized into a corresponding vector; inputting the vector into a Bert model, converting the vector into a word expression vector by the Bert model, and classifying and labeling the word expression vector; inputting the word expression vector after the label into a conditional random field, and calculating the conditional random field based on the incidence dependency relationship among labels to obtain a global optimal label sequence; desensitizing the words labeled as sensitive entities. The invention can realize concurrent execution, simultaneously extract the relation characteristics of the words in the sentences so as to more comprehensively reflect the semantics of the sentences, and can also obtain the word senses according to the context of the sentences so as to avoid ambiguity.

Description

Text recognition desensitization method and system thereof
Technical Field
The invention relates to the technical field of data processing, in particular to a text recognition desensitization method and a text recognition desensitization system.
Background
Sensitive data of individuals or businesses, such as knowledge patents, transaction contracts, electronic medical records, and the like, are often in the form of text documents. If the document is encrypted and then directly uploaded to the cloud server, some important cloud service functions cannot be used. For example, after the document is encrypted, the document online editing and previewing function of the cloud service will fail. Desensitizing a document may not only remove private information but also preserve the structural integrity of the document. Therefore, it is a technical challenge how to automatically locate and desensitize private information in a document. The sensitive text recognition is particularly important in the sensitive text recognition and desensitization, and the core part of the sensitive text protection scheme is to select sensitive words from massive texts and finish the accurate recognition of the sensitive words.
Existing named entity recognition is mainly used for recognizing sensitive data entities such as names, addresses, telephone numbers and the like in texts. The rule-based method identifies the sensitive entities through the regular expression, the rule dictionary and the like, a large amount of training data is not needed, however, the rule writing needs an expert knowledge background, the rule writing cannot adapt to complex and variable sensitive data, and the identification accuracy rate is poor.
The existing method based on machine learning adopts a hidden Markov model, a maximum entropy model, a laminated conditional random field model, a support vector machine model and the like to identify and label sensitive information in unstructured data, but needs a large amount of labeled data, and has weak text semantic feature extraction capability and poor accuracy in identifying part of sensitive entities.
With the rapid development of deep learning research, sensitive data categories are predicted through a one-way Short term memory Neural Network (LSTM) and a conditional random field combined model, but the text context semantic feature extraction capability is weak, and the parallelism is poor.
Disclosure of Invention
In view of the problems existing in the background art, the invention aims to provide a text recognition desensitization method and a text recognition desensitization system, which solve the problems of poor extraction capability of context semantic features and low recognition accuracy.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the invention discloses a text recognition desensitization method, which comprises the following steps:
step 1, obtaining a text to be recognized, splitting the text to be recognized into a plurality of sentences, and segmenting the sentences into a plurality of words;
step 2, converting each word of the text to be recognized into a corresponding vector;
step 3, inputting the vector into a trained Bert model, converting the vector into a word expression vector by the Bert model, and classifying and labeling the word expression vector;
step 4, inputting the word expression vector after the label into a conditional random field, and calculating the conditional random field based on the incidence dependency relationship among labels to obtain a global optimal label sequence;
and 5, desensitizing the words marked as sensitive entities.
Further, in the step 3, the Bert model is obtained by training in the following manner:
step 3-1, acquiring a text to be identified containing a sensitive entity, and establishing a text data set;
step 3-2, performing word segmentation on the text to be recognized, classifying and labeling sensitive entities, and constructing a training sample;
and 3-3, pre-training the Bert model by using the training samples to obtain the trained Bert model.
Further, in the step 1, when the sentence is an english sentence, the word is segmented into the fine-grained words by using a WordPiece word segmentation method, and when the sentence is a chinese sentence, the single word of the chinese sentence is directly segmented.
Further, in the step 2, the vector is a superposition sum of a word vector, a segment vector and a position vector.
Further, in the step 3, each word is labeled as "B-X", "I-X", "O", "E-X", or "S", where "B" is the sensitive entity starting position, "I" is the sensitive entity middle position, "O" is a word other than the sensitive entity, "E" is the sensitive entity ending position, the "S" is a single entity, and "X" is a label of the type to which the sensitive entity belongs.
Further, in the step 4, the sequence of the word expression vector is used as an observation sequence, the labeled sequence is used as a labeled sequence, a first probability of the labeled sequence corresponding to the observation sequence is calculated, the first probability is normalized, and the second probability is obtained, where the labeled sequence with the largest value of the second probability is the optimal labeled sequence.
Further, the calculation formula of the first probability of the marker sequence corresponding to the observed sequence is as follows:
Figure RE-GDA0003498082320000041
m is an observation sequence, K is a labeling sequence, i is the ith word in the observation sequence, is a conversion matrix, is the prediction probability value of the first label of the ith word, is the labeling sequence of the (i-1) th word, N belongs to N, and N is more than or equal to 1.
Further, the value of score (M | K) is normalized by Softmax to obtain a final probability value, and the second probability P (K | M) is calculated as follows:
Figure RE-GDA0003498082320000042
k' is any one of the noted sequences.
Further, in the step 5, the desensitization operation includes replacement, erasure, data generalization, and format-preserving encryption.
In a second aspect, the present invention discloses a text recognition desensitization system, comprising:
a text input module configured to input a text to be recognized;
the preprocessing module is configured to perform word segmentation processing on the input text to be recognized to obtain words;
the Bert model module is configured to convert each word of the text to be recognized into a corresponding vector, convert the vector into a word representation vector, and label the word representation vector;
the annotation sequence prediction module is configured to calculate the association dependency relationship among the annotations of the word expression vectors to obtain a global optimal annotation sequence;
a desensitization module configured to perform desensitization operations on the labeled sensitive entities;
a text output module configured to output desensitized text.
This application adopts above-mentioned technical scheme, has following technological effect at least:
the invention discloses a text recognition desensitization method and a text recognition desensitization system, which predict sensitive data types relative to the existing one-way long-short time memory neural network and conditional random field combined model, but have weak text context semantic feature extraction capability and poor parallelism. The Bert model and the conditional random field combined model can be executed concurrently, meanwhile, the relational features of words in sentences can be extracted, different relational features can be extracted at a plurality of different levels, further, the semantics of the sentences can be reflected more comprehensively, the word senses can be obtained according to the context of the sentences, so that ambiguity is avoided, in addition, different desensitization operations can be conveniently carried out on sensitive entities of different types in the follow-up process by classifying and labeling the words, and finally, optimal labeling of the sensitive entities is obtained by predicting the labeling through the conditional random field, so that the accuracy of the desensitization operation is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flow chart of a text recognition desensitization method in an embodiment of the present invention.
Fig. 2 is a block diagram of a structure of a text recognition desensitization system according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.
As shown in fig. 1, the present implementation provides a text recognition desensitization method, including:
s1, obtaining a text to be recognized, splitting the text to be recognized into a plurality of sentences, and segmenting the sentences into a plurality of words;
s2, converting each word of the text to be recognized into a corresponding vector by the trained Bert model;
s3, inputting the vector into the trained Bert model, converting the vector into a word expression vector by the Bert model, and classifying and labeling the word expression vector;
s4, inputting the tagged word expression vector into a conditional random field, and calculating the conditional random field based on the association dependency relationship between tags to obtain a global optimal tag sequence;
and S5, desensitizing the words marked as sensitive entities.
Compared with the existing one-way long-short time memory neural network and conditional random field combined model, the method has the advantages that sensitive data categories are predicted, the text context semantic feature extraction capability is weak, and the parallelism is poor. The Bert model and the conditional random field combined model can be executed concurrently, meanwhile, the relational features of words in sentences are extracted, different relational features can be extracted at a plurality of different levels, further, the semantics of the sentences can be reflected more comprehensively, the word senses can be obtained according to the context of the sentences, so that the occurrence of ambiguity is avoided, and finally, the optimal sensitive entity label is obtained by predicting the label through the conditional random field, so that the accuracy of desensitization operation is improved.
Preferably, in S1, when the sentence is an english sentence, the WordPiece word segmentation method is used to segment the sentence into fine-grained words, and when the sentence is a chinese sentence, the single word segmentation is directly performed on the chinese sentence.
Preferably, in S2, the vector is a sum of a word vector, a segment vector, and a position vector. The word vector, the segment vector and the position vector are used as input, the word vector representation of a target word, the segment vector representation of each word in context and the position vector representation of each word in the target word and the context are obtained through linear transformation, the word vector, the segment vector and the position vector are overlapped, each vector is fused with context information, and abundant semantic and syntactic characteristics are represented.
Preferably, in S3, the Bert model is obtained by training in the following way:
s3-1, acquiring a text to be recognized containing a sensitive entity, and establishing a text data set;
step S3-2, performing word segmentation on the text to be recognized, classifying and labeling sensitive entities, and constructing a training sample;
and S3-3, pre-training the Bert model by using the training samples to obtain the trained Bert model.
The Bert model is obtained by pre-training a large amount of text data based on wiki data (2500 million words) and book corpora (800 million words), a training sample is constructed by labeling sensitive entities of the text, and then the recognition of the sensitive entities by the Bert model is enhanced, so that the recognition accuracy of the Bert model on the sensitive entities is improved.
Preferably, in S3, each word is labeled as "B-X", "I-X", "O", "E-X", or "S", where "B" is the sensitive entity starting position, "I" is the sensitive entity middle position, "O" is a word other than the sensitive entity, "E" is the sensitive entity ending position, "S" is a single entity, and "X" is the type to which the labeled sensitive entity belongs. In addition, "[ GLS ]" and "[ SEP ]" added to the beginning and end of a sentence are used to indicate the beginning and end of the sentence. The sensitive entities are labeled, so that the sensitive entities can be conveniently identified by desensitization operation, and meanwhile, different desensitization operations can be performed according to different types of the sensitive entities during desensitization operation by labeling the types of the sensitive entities. Wherein the types of sensitive entities include: name, age, address, contact, bank account number, identification card number, medical history, etc. For example: "Xiaoming 13 years of age this year", which corresponds to the notation: "[ GLS ] B-NAME E-NAME O B-AEG I-AEG E-AEG [ SEP ]".
Preferably, in S4, the sequence of the word expression vector is used as the observation sequence, the sequence of the label is used as the label sequence, the first probability of the label sequence corresponding to the observation sequence is calculated, and the first probability is normalized to obtain the second probability, and the label sequence with the maximum value of the second probability is the optimal label sequence.
Preferably, in S4, the sequence of the word expression vector is used as the observation sequence, the sequence of the label is used as the label sequence, the first probability of the label sequence corresponding to the observation sequence is calculated, and the first probability is normalized to obtain the second probability, and the label sequence with the maximum value of the second probability is the optimal label sequence. Through the calculation, the optimal labeling sequence can be obtained, and the recognition and desensitization can have good precision.
Preferably, the formula for calculating the first probability of the marker sequence corresponding to the observed sequence is as follows:
Figure RE-GDA0003498082320000091
m is an observation sequence, K is a labeling sequence, i is the ith word in the observation sequence, A is a transformation matrix,
Figure RE-GDA0003498082320000092
is the Kth of the ith wordiThe predicted probability value of each label, KiFor the annotated sequence of the ith word, Ki-1Is the labeled sequence of the i-1 th word, N belongs to N, and N is more than or equal to 1.
Preferably, the value of score (M | K) is normalized by Softmax to obtain a final probability value, and the second probability P (K | M) is calculated as follows:
Figure RE-GDA0003498082320000093
k' is any one of the annotated sequences.
Preferably, in S5, the desensitization operation includes replacement, erasure, data generalization, and format-preserving encryption.
As shown in fig. 2, the present embodiment further provides a text recognition desensitization system, including:
a text input module configured to input a text to be recognized;
the preprocessing module is configured to perform word segmentation processing on an input text to be recognized to obtain words;
the Bert model module is configured to convert each word of the text to be recognized into a corresponding vector, convert the vector into a word expression vector, and classify and label the word expression vector;
the annotation sequence prediction module is configured to calculate the incidence dependency relationship among the annotations of the word expression vectors to obtain a global optimal annotation sequence;
a desensitization module configured to perform desensitization operations on the labeled sensitive entities;
a text output module configured to output desensitized text.
The text recognition desensitization system of the embodiment realizes concurrent execution, simultaneously extracts the relation characteristics of words in sentences, can extract different relation characteristics at a plurality of different levels, further more comprehensively reflects sentence semantics, can acquire word senses according to sentence contexts, further avoids ambiguity, finally predicts the labels through a conditional random field to obtain the optimal labels of sensitive entities, and further improves the accuracy of desensitization operation.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of text recognition desensitization, comprising:
step 1, obtaining a text to be recognized, splitting the text to be recognized into a plurality of sentences, and segmenting the sentences into a plurality of words;
step 2, converting each word of the text to be recognized into a corresponding vector;
step 3, inputting the vector into a Bert model, wherein the Bert model converts the vector into a word expression vector and classifies and labels the word expression vector;
step 4, inputting the word expression vector after the label into a conditional random field, and calculating the conditional random field based on the incidence dependency relationship among labels to obtain a global optimal label sequence;
and 5, desensitizing the words marked as sensitive entities.
2. The text recognition desensitization method according to claim 1, wherein in said step 3, said Bert model is trained in the following way:
step 3-1, acquiring a text to be identified containing a sensitive entity, and establishing a text data set;
step 3-2, performing word segmentation on the text to be recognized, classifying and labeling sensitive entities, and constructing a training sample;
and 3-3, pre-training the Bert model by using the training samples to obtain the trained Bert model.
3. The method of desensitization of text recognition according to claim 1, wherein in said step 1, when said sentence is an english sentence, said sentence is segmented into fine-grained said words using WordPiece segmentation method, and when said sentence is a chinese sentence, said chinese sentence is directly word-segmented.
4. The method of text recognition desensitization according to claim 1, wherein in said step 2, said vector is a sum of a superposition of a word vector, a segment vector, and a position vector.
5. The method of text recognition desensitization according to claim 2, wherein in said step 3, each of said words is labeled "B-X", "I-X", "O", "E-X", or "S", wherein said "B" is said sensitive entity starting position, said "I" is said sensitive entity intermediate position, said "O" is a word other than said sensitive entity, said "E" is said sensitive entity ending position, said "S" is a single entity, and said "X" is labeled as a type to which said sensitive entity belongs.
6. The method of desensitization of text recognition according to claim 1, wherein in said step 4, a sequence of said word expression vectors is used as an observation sequence, said labeled sequence is used as a labeled sequence, a first probability of said labeled sequence corresponding to said observation sequence is calculated, and said first probability is normalized to obtain said second probability, wherein said labeled sequence having the largest value of said second probability is the optimal labeled sequence.
7. The text recognition desensitization method of claim 6, wherein the first probability of the marker sequence corresponding to the observation sequence is calculated as follows:
Figure FDA0003442471330000021
m is an observation sequence, K is a labeling sequence, i is the ith word in the observation sequence, A is a transformation matrix, Pi,KiIs the Kth of the ith wordiThe predicted probability value of each label, KiFor the annotated sequence of the ith word, Ki-1Is the labeled sequence of the i-1 th word, N belongs to N, and N is more than or equal to 1.
8. Text recognition desensitization method according to claim 7, characterized in that said score (M | K) values are normalized by Softmax to obtain the final probability value, said second probability P (K | M) being calculated as follows:
Figure FDA0003442471330000031
k' is any one of the noted sequences.
9. The text recognition desensitization method according to claim 1, wherein in said step 5, said desensitization operations include replacement, erasure, data generalization, format preserving encryption.
10. A text recognition desensitization system, comprising:
a text input module configured to input a text to be recognized;
the preprocessing module is configured to perform word segmentation processing on the input text to be recognized to obtain words;
the Bert model module is configured to convert each word of the text to be recognized into a corresponding vector, convert the vector into a word expression vector, and classify and label the word expression vector;
the annotation sequence prediction module is configured to calculate the association dependency relationship among the annotations of the word expression vectors to obtain a global optimal annotation sequence;
a desensitization module configured to perform desensitization operations on the labeled sensitive entities;
a text output module configured to output desensitized text.
CN202111640408.1A 2021-12-29 2021-12-29 Text recognition desensitization method and system thereof Pending CN114298035A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111640408.1A CN114298035A (en) 2021-12-29 2021-12-29 Text recognition desensitization method and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111640408.1A CN114298035A (en) 2021-12-29 2021-12-29 Text recognition desensitization method and system thereof

Publications (1)

Publication Number Publication Date
CN114298035A true CN114298035A (en) 2022-04-08

Family

ID=80971528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111640408.1A Pending CN114298035A (en) 2021-12-29 2021-12-29 Text recognition desensitization method and system thereof

Country Status (1)

Country Link
CN (1) CN114298035A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544240A (en) * 2022-11-24 2022-12-30 闪捷信息科技有限公司 Text sensitive information identification method and device, electronic equipment and storage medium
CN115828307A (en) * 2023-01-28 2023-03-21 广州佰锐网络科技有限公司 Text recognition method and AI system applied to OCR
CN116910817A (en) * 2023-09-13 2023-10-20 北京国药新创科技发展有限公司 Desensitization processing method and device for medical data and electronic equipment
CN116956356A (en) * 2023-09-21 2023-10-27 深圳北控信息发展有限公司 Information transmission method and equipment based on data desensitization processing
CN117077678A (en) * 2023-10-13 2023-11-17 河北神玥软件科技股份有限公司 Sensitive word recognition method, device, equipment and medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544240A (en) * 2022-11-24 2022-12-30 闪捷信息科技有限公司 Text sensitive information identification method and device, electronic equipment and storage medium
CN115828307A (en) * 2023-01-28 2023-03-21 广州佰锐网络科技有限公司 Text recognition method and AI system applied to OCR
CN115828307B (en) * 2023-01-28 2023-05-23 广州佰锐网络科技有限公司 Text recognition method and AI system applied to OCR
CN116910817A (en) * 2023-09-13 2023-10-20 北京国药新创科技发展有限公司 Desensitization processing method and device for medical data and electronic equipment
CN116910817B (en) * 2023-09-13 2023-12-29 北京国药新创科技发展有限公司 Desensitization processing method and device for medical data and electronic equipment
CN116956356A (en) * 2023-09-21 2023-10-27 深圳北控信息发展有限公司 Information transmission method and equipment based on data desensitization processing
CN116956356B (en) * 2023-09-21 2023-11-28 深圳北控信息发展有限公司 Information transmission method and equipment based on data desensitization processing
CN117077678A (en) * 2023-10-13 2023-11-17 河北神玥软件科技股份有限公司 Sensitive word recognition method, device, equipment and medium
CN117077678B (en) * 2023-10-13 2023-12-29 河北神玥软件科技股份有限公司 Sensitive word recognition method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US11568143B2 (en) Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN109858010B (en) Method and device for recognizing new words in field, computer equipment and storage medium
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN114298035A (en) Text recognition desensitization method and system thereof
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN112883732A (en) Method and device for identifying Chinese fine-grained named entities based on associative memory network
Ahmed et al. Offline arabic handwriting recognition using deep machine learning: A review of recent advances
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
Carbonell et al. Joint recognition of handwritten text and named entities with a neural end-to-end model
CN112464669B (en) Stock entity word disambiguation method, computer device, and storage medium
CN111274829A (en) Sequence labeling method using cross-language information
CN111742322A (en) System and method for domain and language independent definition extraction using deep neural networks
CN114139551A (en) Method and device for training intention recognition model and method and device for recognizing intention
Romero et al. Modern vs diplomatic transcripts for historical handwritten text recognition
Quirós et al. From HMMs to RNNs: computer-assisted transcription of a handwritten notarial records collection
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
US20230061731A1 (en) Significance-based prediction from unstructured text
CN112183060B (en) Reference resolution method of multi-round dialogue system
Bhattacharjee et al. Named entity recognition: A survey for indian languages
CN116089605A (en) Text emotion analysis method based on transfer learning and improved word bag model
AU2021104218A4 (en) A system for identification of personality traits and a method thereof
CN114417891A (en) Reply sentence determination method and device based on rough semantics and electronic equipment
CN114444492A (en) Non-standard word class distinguishing method and computer readable storage medium
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination