CN111159770B - Text data desensitization method, device, medium and electronic equipment - Google Patents

Text data desensitization method, device, medium and electronic equipment Download PDF

Info

Publication number
CN111159770B
CN111159770B CN201911421350.4A CN201911421350A CN111159770B CN 111159770 B CN111159770 B CN 111159770B CN 201911421350 A CN201911421350 A CN 201911421350A CN 111159770 B CN111159770 B CN 111159770B
Authority
CN
China
Prior art keywords
text data
character
label
preset
sensitive entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911421350.4A
Other languages
Chinese (zh)
Other versions
CN111159770A (en
Inventor
张子锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yidu Cloud Beijing Technology Co Ltd
Original Assignee
Yidu Cloud Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yidu Cloud Beijing Technology Co Ltd filed Critical Yidu Cloud Beijing Technology Co Ltd
Priority to CN201911421350.4A priority Critical patent/CN111159770B/en
Publication of CN111159770A publication Critical patent/CN111159770A/en
Application granted granted Critical
Publication of CN111159770B publication Critical patent/CN111159770B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text data desensitization method, which comprises the following steps: acquiring text data; processing the text data through a preset dictionary tree and/or a preset regular expression to obtain a first sensitive entity word of the text data; processing the text data through a preset model to obtain a label of each character in the text data; determining a second sensitive entity word of the text data according to the label of each character in the text data; determining a sensitive entity word of the text data according to a first sensitive entity word of the text data and a second sensitive entity word of the text data; the desensitization processing is carried out on the sensitive entity words of the text data, and the sensitive entity words of the text data are determined to be more accurate by utilizing the mutual complementary proofreading of the first sensitive entity words and the second sensitive entity words, so that the result of the desensitization processing on the text data is more accurate, the desensitization processing on the non-sensitive entity words is avoided, and the user experience is improved.

Description

Text data desensitization method, device, medium and electronic equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text data desensitization method, a text data desensitization device, a text data desensitization medium and electronic equipment.
Background
With the rapid development of the internet, various text data are generated. Generally, one or more entity words are contained in a wide variety of text data. Entity words relating to the privacy of the user may be included in the entity words, which are often referred to as sensitive entity words. Desensitization of sensitive entity words in the text data is required for such cases. Before desensitizing sensitive entity words in the text data, specific sensitive entity words need to be determined from the text data and then desensitized. In the related art, sensitive entity words are generally determined from text data by several means: determination as to total private information fields and other data, dictionary and rule based methods, statistical machine learning based methods.
However, the inventors found that the following technical problems exist in the related art when the inventive concept of the present invention is implemented:
the 'all private information content determining field' refers to a field for determining text content, such as name, telephone, identity card number and the like, and the data belongs to all private data determined by the method of directly replacing original text content with characters. A disadvantage of this type of desensitization method is that the unambiguous meaning of the fields must be determined and must also be of all private content. However, because of the numerous fields and complicated sources in the current text data, the method cannot be generalized, i.e., the processing data type is limited, and the method is limited to data which is determined to be completely sensitive content.
Dictionary and rule based methods mainly refer to artificially constructing feature recognition for specific sensitive content, and then applying desensitization processing. For example, desensitizing the names, manually constructing a dictionary of known names, and performing text desensitization replacement processing when words in the name dictionary are found in a text; for date and telephone desensitization, the date is extracted by constructing a regular expression, and then removing or replacing desensitization processing is carried out. The desensitization method has the defects that the accuracy rate of the sensitive entity words determined based on the dictionary is not high, the sensitive entity words determined based on the statistical machine learning method depend on a large amount of training corpora too much, and the cost of high-quality labeled data is too high, so that the application effect of a plurality of pure machine learning methods on data is reduced.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
An object of the embodiments of the present invention is to provide a method, an apparatus, a medium, and an electronic device for desensitizing text data, so that the defects in the related art can be solved at least to a certain extent, and therefore, the accuracy of obtaining sensitive entity words of text data can be improved.
Additional features and advantages of the invention will be set forth in the detailed description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
According to a first aspect of embodiments of the present invention, there is provided a text data desensitization method, including: acquiring text data; processing the text data through a preset dictionary tree and/or a preset regular expression to obtain a first sensitive entity word of the text data; processing the text data through a preset model to obtain a label of each character in the text data; determining a second sensitive entity word of the text data according to the label of each character in the text data; determining a sensitive entity word of the text data according to a first sensitive entity word of the text data and a second sensitive entity word of the text data; and desensitizing the sensitive entity words of the text data.
In some embodiments of the invention, the method further comprises: combining characters in the general dictionary to obtain a preset dictionary; establishing a dictionary tree based on the words in the preset dictionary, wherein each node in the dictionary tree is a character of each word in the preset dictionary; and establishing an automaton in the dictionary tree to obtain the preset dictionary tree.
In some embodiments of the invention, the method further comprises: constructing the regular expression based on the characteristics of a preset entity vocabulary, wherein the preset entity vocabulary comprises any one or more of the following items: identification number, telephone number, bank card number, passport number, social security card number, house number, mailbox account number, institution name.
In some embodiments of the present invention, before processing the text data through the preset model and acquiring a tag of a vocabulary in the text data, the method includes: processing each character in the text data to obtain a feature vector of each character; processing each vocabulary in the text data to obtain a feature vector of each vocabulary; generating a sequence of feature vectors for the text data based on the feature vector for each character and the feature vector for each vocabulary.
In some embodiments of the present invention, processing the text data through the preset model, and obtaining the label of each character in the text data includes: processing the feature vector sequence of the text data through the bidirectional long and short term memory network layer to obtain the probability of a label corresponding to a character at each position in the text data; processing the probability of the label corresponding to the character at each position through the conditional random field layer to obtain the score of the probability of the label; and scoring the probability of the label through a Viterbi algorithm to obtain the label of each character in the text data.
In some embodiments of the invention, prior to obtaining the text data, the method further comprises: acquiring training data, wherein the training data comprises text data of known sensitive entity words; processing the training data through a word2vec model or a glove model to obtain a feature vector of each character in the training data; performing word segmentation processing on the training data to obtain a word segmentation characteristic sequence of the training data; determining a label of a character at each position in each vocabulary of the training data according to the word segmentation feature sequence of the training data; and training a model by using the feature vector of each character in the training data and the label of the character at each position in each vocabulary to obtain the preset model.
In some embodiments of the invention, the method further comprises: verifying the sensitive entity words of the text data through a preset entity word set; and deleting the sensitive entity words of the text data according to the verification result.
According to a second aspect of embodiments of the present invention, there is provided a text data desensitizing apparatus including: the first acquisition module is used for acquiring text data; the first processing module is used for processing the text data through a preset dictionary tree and/or a preset regular expression to obtain a first sensitive entity word of the text data; the second processing module is used for processing the text data through a preset model to obtain a label of each character in the text data; the first determining module is used for determining a second sensitive entity word of the text data according to the label of each character in the text data; the second determining module is used for determining the sensitive entity words of the text data according to the first sensitive entity words of the text data and the second sensitive entity words of the text data; and the desensitization processing module is used for desensitizing the sensitive entity words of the text data.
In some embodiments of the invention, the apparatus further comprises: the combination module is used for combining the characters in the general dictionary to obtain a preset dictionary; the first establishing module is used for establishing a dictionary tree based on the words in the preset dictionary, and each node in the dictionary tree is one character of each word in the preset dictionary; and the second establishing module is used for establishing an automaton in the dictionary tree to obtain the preset dictionary tree.
In some embodiments of the invention, the apparatus further comprises: the construction module is used for constructing the regular expression based on the characteristics of a preset entity vocabulary, wherein the preset entity vocabulary comprises any one or more of the following items: identity card number, telephone number, bank card number, passport number, social security card number, house number, mailbox account number, organization name.
In some embodiments of the invention, the apparatus comprises: the third processing module is used for processing each character in the text data to obtain a feature vector of each character; the fourth processing module is used for processing each vocabulary in the text data to obtain a feature vector of each vocabulary; and the generating module is used for generating a feature vector sequence of the text data based on the feature vector of each character and the feature vector of each vocabulary.
In some embodiments of the invention, the second processing module includes: the bidirectional long and short term memory network layer processing module is used for processing the feature vector sequence of the text data through the bidirectional long and short term memory network layer to obtain the probability of the label corresponding to the character at each position in the text data; the conditional random field layer processing module is used for processing the probability of the label corresponding to the character at each position through the conditional random field layer to obtain the score of the probability of the label; and the Viterbi algorithm processing module is used for scoring the probability of the label through a Viterbi algorithm to obtain the label of each character in the text data.
In some embodiments of the invention, the apparatus further comprises: the second acquisition module is used for acquiring training data, wherein the training data comprises text data of known sensitive entity words; a fifth processing module, configured to process the training data through a word2vec model or a glove model, and obtain a feature vector of each character in the training data; the sixth processing module is used for performing word segmentation processing on the training data to obtain a word segmentation characteristic sequence of the training data; the third determining module is used for determining the label of the character at each position in each vocabulary of the training data according to the word segmentation characteristic sequence of the training data; and the training module is used for training a model by using the feature vector of each character in the training data and the label of the character at each position in each vocabulary to obtain the preset model.
In some embodiments of the invention, the apparatus further comprises: the verification module is used for verifying the sensitive entity words of the text data through a preset entity word set; and the deleting module is used for deleting the sensitive entity words of the text data according to the verification result.
According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of text data desensitization as described in the first aspect of the embodiments above.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which program, when executed by a processor, implements a method of text data desensitization as described in the first aspect of the embodiments above.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
in the technical scheme provided by some embodiments of the present invention, a text data is processed through a preset dictionary tree and/or a preset regular expression, a first sensitive entity word of the text data is obtained, a text data is processed through a preset model, a label of each character in the text data is obtained, a second sensitive entity word of the text data is determined according to the label of each character in the text data, and then the sensitive entity word of the text data is determined according to the first sensitive entity word of the text data and the second sensitive entity word of the text data.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
fig. 1 is a schematic diagram illustrating an exemplary system architecture to which a text data desensitization method or a text data desensitization apparatus according to an embodiment of the present invention can be applied;
FIG. 2 schematically illustrates a flow diagram of a method of text data desensitization according to an embodiment of the invention;
FIG. 3A schematically illustrates a flow diagram for generating a preset trie, according to an embodiment of the present invention;
FIG. 3B schematically shows a diagram of a pre-set trie according to an embodiment of the invention;
FIG. 4 schematically illustrates a flow diagram of a method of text data desensitization according to another embodiment of the invention;
FIG. 5A schematically illustrates a flow diagram for obtaining a label for each character in text data, in accordance with an embodiment of the present invention;
FIG. 5B schematically shows a diagram of obtaining a label for each character in text data, in accordance with an embodiment of the present invention;
FIG. 6 schematically illustrates a flow diagram of a method of text data desensitization according to another embodiment of the invention;
FIG. 7 schematically illustrates a flow diagram of a method of text data desensitization according to another embodiment of the invention;
FIG. 8 schematically illustrates a block diagram of a text data desensitization apparatus, according to an embodiment of the present invention;
FIG. 9 schematically illustrates a block diagram of a textual data desensitization apparatus according to another embodiment of the present invention;
fig. 10 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention;
fig. 11 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention;
FIG. 12 schematically illustrates a block diagram of a text data desensitization apparatus according to another embodiment of the present invention;
fig. 13 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention;
fig. 14 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention;
FIG. 15 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 is a schematic diagram illustrating an exemplary system architecture of a text data desensitization method or a text data desensitization apparatus to which an embodiment of the present invention may be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like.
The server 105 may be a server that provides various services. For example, when a user uploads text data to the server 105 by using the terminal device 103 (or the terminal device 101 or 102), the server 105 may process the text data by using a preset dictionary tree and/or a preset regular expression, obtain a first sensitive entity word of the text data, process the text data by using a preset model, obtain a label of each character in the text data, determine a second sensitive entity word of the text data according to the label of each character in the text data, determine a sensitive entity word of the text data according to the first sensitive entity word of the text data and the second sensitive entity word of the text data, and perform desensitization on the sensitive entity word of the text data, in such a way, the sensitive entity word of the text data is determined to be more accurate by using mutual complementary proofreading of the first sensitive entity word and the second sensitive entity word, so that the result of performing desensitization on the text data is more accurate, the desensitization on the non-sensitive entity word is avoided, and the method is not limited to only data determined to be completely sensitive content, but also solves the problem of high learning efficiency of a machine learning based on a large number of sensitive entity words.
In some embodiments, the text data desensitization method provided in the embodiments of the present invention is generally executed by the server 105, and accordingly, the text data desensitization apparatus is generally disposed in the server 105. In other embodiments, some terminals may have similar functionality as the server to perform the method. Therefore, the text data desensitization method provided by the embodiment of the invention is not limited to be executed at the server side.
Fig. 2 schematically shows a flow diagram of a text data desensitization method according to an embodiment of the invention.
As shown in fig. 2, the text data desensitization method may include steps S210 to S260.
In step S210, text data is acquired.
In step S220, the text data is processed through a preset dictionary tree and/or a preset regular expression, and a first sensitive entity word of the text data is obtained.
In step S230, the text data is processed through a preset model, and a label of each character in the text data is obtained.
In step S240, a second sensitive entity word of the text data is determined according to the label of each character in the text data.
In step S250, a sensitive entity word of the text data is determined according to a first sensitive entity word of the text data and a second sensitive entity word of the text data.
In step S260, desensitization processing is performed on the sensitive entity words of the text data.
The method can process text data through a preset dictionary tree and/or a preset regular expression, obtain a first sensitive entity word of the text data, process the text data through a preset model, obtain a label of each character in the text data, determine a second sensitive entity word of the text data according to the label of each character in the text data, and then determine the sensitive entity word of the text data according to the first sensitive entity word of the text data and the second sensitive entity word of the text data.
In one embodiment of the present invention, the text data may be various long texts or short texts. Wherein, one or more entity words and/or non-entity words can be contained in the long text or the short text. The entity words in this embodiment may be sensitive entity words and/or non-sensitive entity words.
In one embodiment of the present invention, the text data may be text data of various fields. For example, text data in the medical field, text data in the e-commerce field, text data in the social field.
In an embodiment of the present invention, the predetermined dictionary tree may be obtained by establishing an automaton in the dictionary tree. Wherein the dictionary tree may be generated based on a new dictionary obtained from a character combination of dictionaries in the related art. For example, taking a name entity as an example, the invention constructs a dictionary for names: (1) pre-arranging and constructing a common name dictionary; (2) The pre-acquired name dictionary is often a few common complete names, which are not enough to cover names in actual data. Therefore, the invention combines the first name in the personal name dictionary with the common surname to solve the problems of limited personal name dictionary and insufficient recall; (3) The invention establishes the keywords in the dictionary into a dictionary tree to realize the efficient extraction of the keywords (such as sensitive entity words) in order to efficiently search the possible keywords contained in a large amount of texts, and specifically realizes that each character in each word in the dictionary establishes a node, each node traverses downwards from a root node, nodes with corresponding characters are multiplexed if the corresponding characters exist, character nodes are newly established if the corresponding characters do not exist, and all child nodes share the same prefix, and specifically refer to fig. 3B. (4) An automaton is established in the dictionary tree, so that the searching efficiency can be improved. Specifically, a fail pointer identification is constructed in the dictionary tree, so that the next position of jumping after the matching search fails is realized conveniently.
In an embodiment of the present invention, the preset regular expression may be constructed by using characteristics of an entity vocabulary preset manually. For example, for identification numbers, this is done by constructing an 18-digit number (or a 17-digit number + X). For the mobile phone number, a regular expression is constructed by constructing 11 digits and combining the digital characteristics of the first three digits of the number segment of the domestic mobile phone number operator, for example, the beginning of the number segments of Chinese mobile numbers, namely 134, 135, 150 and 187; number segment of china unicom: 130. 131,. 156, 186; number segment 133, 149.. 189 of chinese telecommunications.
In an embodiment of the present invention, the first sensitive entity word of the text data may be, but is not limited to, a name of a person, a phone number, an identification card, a mailbox, a date (time), a name of a hospital, a name of shopping software, a name of social software, and the like.
In an embodiment of the present invention, the second sensitive entity word of the text data may be a name of a person, an address, a date (time), a name of a hospital, a name of shopping software, a name of social software, and the like, but is not limited thereto.
In an embodiment of the present invention, the sensitive entity words of the text data are determined according to the first sensitive entity words of the text data and the second sensitive entity words of the text data, and in this way, the sensitive entity words of the text data are determined more accurately by using mutual complementary proofreading of the first sensitive entity words and the second sensitive entity words, so that the result of desensitization processing on the text data is more accurate, and desensitization processing on non-sensitive entity words is avoided.
In one embodiment of the invention, desensitization is performed on sensitive entity words of the text data. For example, after the final sensitive entity word is obtained, a desensitization process is performed on the sensitive entity word, the sensitive entity word is generally replaced with a specific character, or an incomplete entity is retained on the premise that sensitive information is removed, for example, a "sun hospital" becomes "after desensitization, or" hospital ".
FIG. 3A schematically illustrates a flow diagram for generating a preset trie, according to an embodiment of the present invention.
As shown in fig. 3A, the method further includes steps S310 to S330.
In step S310, the characters in the general dictionary are combined to obtain a preset dictionary.
In step S320, a dictionary tree is built based on the words in the preset dictionary, where each node in the dictionary tree is a character of each word in the preset dictionary.
In step S330, an automaton is established in the dictionary tree to obtain the preset dictionary tree.
The method can combine the characters in the general dictionary to obtain the preset dictionary, so that the problems of limited vocabulary amount in the dictionary and insufficient recall in the process of recognizing sensitive entity words can be solved. After the preset dictionary is obtained, a dictionary tree can be established based on the words in the preset dictionary, and each node in the dictionary tree is one character of each word in the preset dictionary, so that keywords (for example, sensitive entity words) can be extracted efficiently. And then, establishing an automaton in the dictionary tree to obtain the preset dictionary tree, so that the searching efficiency can be improved.
In an embodiment of the present invention, the predetermined dictionary tree may be obtained by establishing an automaton in the dictionary tree. Wherein, the dictionary tree can be generated based on a new dictionary obtained by combining characters of the dictionary in the related art. For example, taking a name entity as an example, the invention constructs a dictionary for names: (1) pre-arranging and constructing a common name dictionary; (2) The pre-acquired name dictionary is often a few common complete names, which are not enough to cover names in actual data. Therefore, the invention combines the first name in the personal name dictionary with the common surname to solve the problems of limited personal name dictionary and insufficient recall; (3) The invention establishes the keywords in the dictionary into a dictionary tree to realize the efficient extraction of the keywords (such as sensitive entity words) in order to efficiently search the possible keywords contained in a large amount of texts, and specifically realizes that each character in each word in the dictionary establishes a node, each node traverses downwards from a root node, nodes with corresponding characters are multiplexed if the corresponding characters exist, character nodes are newly established if the corresponding characters do not exist, and all child nodes share the same prefix, and specifically refer to fig. 3B. (4) An automaton is established in the dictionary tree, so that the searching efficiency can be improved. Specifically, a fail pointer identification is constructed in the dictionary tree, so that the next position of jumping after the matching search fails is realized conveniently.
Referring to fig. 3B, a predetermined dictionary tree is built with name entities. After the text data is acquired, the text data may be processed using a preset dictionary tree as shown in fig. 3B. For example, after a preset dictionary tree for searching a name of a person in combination with an automaton is constructed, after text data is input into the preset dictionary, a first sensitive entity word is determined from the preset dictionary tree based on each character in the text data, if the text data contains a name Wang Lin Peak, after the search of the name Wang Lin Peak fails, the name Wang Lin Peak is directly jumped to the name Lin Feng according to a fail state pointer, so that the next position jumped after the search fails is conveniently matched, and the search efficiency is improved.
In an embodiment of the present invention, other words may also be extracted by establishing a dictionary tree with fail state after acquiring a dictionary like a person name dictionary, which is not described herein.
In one embodiment of the invention, the method further comprises: constructing the regular expression based on the characteristics of a preset entity vocabulary, wherein the preset entity vocabulary comprises any one or more of the following items: identity card number, telephone number, bank card number, passport number, social security card number, house number, mailbox account number, organization name. In this example, the institution name may be, but is not limited to, a hospital name, a name of a social platform, a name of a shopping platform, and so forth.
In an embodiment of the present invention, the preset regular expression may be constructed by using characteristics of an entity vocabulary preset manually. For example, for identification numbers, this is done by constructing an 18-digit number (or a 17-digit number + X). For the mobile phone number, a regular expression is constructed by constructing 11 digits and combining the digital characteristics of the first three digits of the number segment of the domestic mobile phone number operator, for example, the beginning of the number segments of Chinese mobile numbers, namely 134, 135, 150 and 187; number segment of china unicom: 130. 131.. 156, 186; number segment 133, 149.. 189 of chinese telecommunications.
Fig. 4 schematically shows a flow diagram of a method of text data desensitization according to another embodiment of the invention.
Before step S230, the method further includes steps S410 to S430, as shown in fig. 4.
In step S410, each character in the text data is processed to obtain a feature vector of each character.
In step S420, each vocabulary in the text data is processed to obtain a feature vector of each vocabulary.
In step S430, a feature vector sequence of the text data is generated based on the feature vector of each character and the feature vector of each vocabulary.
The method can generate the feature vector sequence of the text data based on the feature vector of each character and the feature vector of each vocabulary, so that the accuracy of processing the feature vector sequence of the text data through the preset model to obtain the second sensitive entity word is improved.
In an embodiment of the present invention, each character in the text data is processed to obtain a feature vector of each character. For example, each character in the text data may be processed by a word2vec model and/or a glove model to obtain a feature vector for each character. And combining the feature vectors of each character in the text data to form a character-level feature vector sequence of the text data.
In an embodiment of the present invention, each vocabulary in the text data is processed to obtain a feature vector of each vocabulary. For example, the word segmentation processing is performed on the text data to obtain a word in the text data, wherein the word can be a single character or a plurality of characters. And representing each vocabulary in the text data by using the content in the feature table of the entity tagging sequence. For example, for the text data "Zhang Xiaoming is a physician" the resulting vocabulary sequence after word segmentation is "Zhang Xiaoming/yes/one/physician", the conversion to the word segmentation feature vector sequence is: [1,2,3,0,1,3,1,2,2,3]. Wherein [1,2,3] is the feature vector of the word "Zhang Xiaoming". [0] Is the feature vector of the word "yes". [1,3] is the feature vector for the word "one bit". [1,2,2,3] is the feature vector for the word "physician". And combining the feature vectors of each vocabulary in the text data to form a word segmentation feature vector sequence of the text data.
In an embodiment of the present invention, the table of characteristics of the entity tagging sequence is specifically shown in the following table 1:
Figure BDA0002352467400000131
in one embodiment of the present invention, generating the sequence of feature vectors for the text data based on the feature vector for each character and the feature vector for each vocabulary includes generating the sequence of feature vectors for the text data based on a sequence of feature vectors for a character level of the text data and a sequence of feature vectors for a word segmentation of the text data.
FIG. 5A schematically shows a flow diagram for obtaining a label for each character in text data, in accordance with an embodiment of the present invention.
As shown in fig. 5A, the step S230 may specifically include steps S510 to S530.
In step S510, the feature vector sequence of the text data is processed by the bidirectional long and short term memory network layer, and the probability of the label corresponding to the character at each position in the text data is obtained.
In step S520, a score of the probability of the tag is obtained by performing probability processing on the tag corresponding to the character at each position by the conditional random field layer.
In step S530, a label of each character in the text data is obtained through a viterbi algorithm to score the probability of the label.
The method can process the feature vector sequence of the text data through a bidirectional long-short term memory network layer to obtain the probability of a label corresponding to a character at each position in the text data, processes the probability of the label corresponding to the character at each position through a conditional random field layer to obtain the score of the probability of the label, and processes the score of the probability of the label through a Viterbi algorithm to obtain the label of each character in the text data.
Referring to fig. 5B, the feature vector sequence of the text data (i.e., the feature vector sequence formed by splicing the character-level feature vector sequence of the text data and the word segmentation feature vector sequence of the text data) is input into the bidirectional long-short term memory network layer (i.e., the Bilstm layer), and the feature vector sequence of the text data is processed by an algorithm in the bidirectional long-short term memory network layer to obtain the probability of the label corresponding to the character at each position in the text data. For example, the text data is "Wang Xiaoer … is older than … visit", the character-level feature vector sequence of the text data and the participle feature vector sequence of the text data are spliced to obtain a feature vector sequence x = (x =) of the entire text data 1 ,x 2 ,…x n )∈R n*d And d = d of the feature vector sequence of the text data = m of the character level of the text data + L of the dimension of the word segmentation feature vector sequence of the text data, and n is the number of characters in the text data. A sequence of feature vectors of text data x = (x) 1 ,x 2 ,…x n ) And then, splicing the hidden sequence output by the forward long-short term memory network and the hidden sequence output by the reverse long-short term memory network to obtain a complete hidden sequence: ht = (ht) 1 ,ht 2 ,……,ht n )∈R n*m Then after dropout, the hidden sequence is mapped from m-dimension to k-dimension by softmax loss function. k is the number of classes of physical words, and the notation matrix P = (P) 1 ,p 2 ,……,p n )∈R n*k . Namely, the probability P of k position category labels corresponding to the characters of each position in the text data can be obtained after the two-way long-short term memory network layer ij ,P ij The probability of the jth position category label being the ith position in the text data.
Based on the foregoing scheme, the tag corresponding to the character at each position in the text data may be a position category tag of the character at each position in the text data. Wherein the probability of the position category label of the character of each position may be one or more.
Referring to fig. 5B, by the probability processing of the label corresponding to the character at each position by the conditional random field layer (i.e., CRF layer), a score of the probability of the label is obtained. For example, because the bidirectional long and short term memory network layer outputs probabilities of individual location category labels, the relationships between the entity location category labels are not considered. Therefore, a CRF layer is added in the preset model to add constraints among the position category labels, and the final overall position category label is ensured to be optimal. Specifically, the parameter of the CRF layer is (k + 2) * And (k + 2), wherein for the convenience of calculation, two labels of START and END are added to represent the starting position and the ending position of the sentence, and the number of labels of the position category is changed from k to k +2. Mixing the above P = (P) 1 ,p 2 ,……,p n ) Inputting the label into a conditional random field layer, and calculating the score of the probability of the label by using the following formula (1):
Figure BDA0002352467400000151
wherein x is text data, y is a position category label, y i Is the ith character x in x i Corresponding position category label, n is the number of characters in x, P i For the ith character x i Probability of corresponding location category label, A yi-1,yi The transition probability score for location class label yi-1 to location class label yi.
Based on the scheme, the label of each character in the text data is obtained through the scoring processing of the probability of the label by a Viterbi algorithm. For example, equation (2) of the viterbi algorithm is:
Figure BDA0002352467400000152
and y is a label of each character in the text data.
The optimal position label classification of each character in the text is obtained through the formula (2), the output entity location category label with reference to figure 5B may be B PER, I _ PER, E _ PER, O, B _ AGE, B _ AGE, B _ AGE, O, O, O, O. The starting position and content of the sensitive entity word can be constructed according to the result.
Fig. 6 schematically illustrates a flow diagram of a text data desensitization method according to another embodiment of the invention.
Before step S210, the method further includes steps S610 to S650, as shown in fig. 6.
In step S610, training data is obtained, the training data including text data of known sensitive entity words.
In step S620, the training data is processed through a word2vec model or a glove model, and a feature vector of each character in the training data is obtained.
In step S630, the training data is subjected to word segmentation processing to obtain a word segmentation feature sequence of the training data.
In step S640, a label of each character in each position in each vocabulary of the training data is determined according to the word segmentation feature sequence of the training data.
In step S650, a model is trained by using the feature vector of each character in the training data and the label of the character at each position in each vocabulary, so as to obtain the preset model.
The method can train a model by utilizing the feature vector of each character in the training data and the label of the character at each position in each vocabulary to obtain the preset model, so that the trained model can be conveniently used for determining the accurate label probability.
In one embodiment of the present invention, the label of the character at each position in each vocabulary of the training data is determined according to the word segmentation feature sequence of the training data. For example, by processing the segmentation feature sequence of the training data through the entity position category label table 2, the label of the character at each position in each vocabulary of the training data can be obtained. Entity location category labels table 2 is specifically shown below:
tag means of
O Non-physical location
S Single word
B Entity initiation
I Physical intermediate
E Physical termination
For example, the participle feature sequence is [1,2,3,0,1,3,1,2,2,3]. Where [1,2,3] is the feature vector for the word "Zhang Xiaoming". [0] Is the feature vector of the word "yes". [1,3] is the feature vector for the word "one bit". [1,2,2,3] is the feature vector for the word "physician". And combining the feature vectors of each vocabulary in the text data to form a word segmentation feature vector sequence of the text data. The label of the character at each position in each vocabulary determined according to the word segmentation feature sequence can be [ B _ PER, I _ PER, E _ PER, O, O, O, O, O, etc. ].
In one embodiment of the invention, the entities can be represented as B _ PER, I _ PER and E _ PER for the name of a person. The location entities can be represented as O, O, B _ LOC, I _ LOC, E _ LOC, O. In the training phase, the label of the character at each position in each vocabulary according to the above training data may be defined as y = (y) 1 ,y 2 ,...,y n ) And inputting training data with the feature vector x and the label y into the model, training the model, and obtaining parameters of the model in the training process. In this embodiment, the parameters may refer to P and a using the preset model phase.
Fig. 7 schematically illustrates a flow diagram of a text data desensitization method according to another embodiment of the invention.
As shown in fig. 7, the method further includes step S710 and step S720.
In step S710, the sensitive entity words of the text data are verified through a preset entity word set.
In step S720, the sensitive entity word of the text data is deleted according to the verification result.
According to the method, the sensitive entity words of the text data can be verified through the preset entity word set, and the sensitive entity words of the text data are deleted according to the verification result, so that the accuracy in obtaining the sensitive entity words is further improved.
In an embodiment of the present invention, the preset entity word set may include entity words appearing in contexts of the sensitive entity words, and taking text data in the medical field as an example, the entity words in the contexts may be a chief complaint, a current medical history, a past history, a name of a person, and the like. For example, the sensitive entity words such as "patient", "doctor", "hospitalization", etc. are extracted from the medical records by the above-mentioned method. For the extracted sensitive entity words, if the context (a certain range in the sentence) has corresponding preset context entity words, the words are used as final sensitive entity words, and on the contrary, if the sensitive entity words can not be deleted, the words are used as final sensitive entity words.
In addition to sensitive entities related to personal privacy, electronic medical records in the medical field also have entities with important significance, such as diagnosis, drug names, operation names, and the like, which are collectively referred to as "important medical non-sensitive entities", and the "important medical non-sensitive entities" include key information of patients during diagnosis and treatment and need to be retained. Therefore, the invention reduces the condition of the sensitive entity word which is recognized by mistake by checking the medical important non-sensitive entity. For example, the entity words in "medically significant non-sensitive entities" are added to a preset set of entity words. Such as diagnosis (disease) name, drug name, surgery name, physical examination, laboratory examination project name, etc. These important names are referred to as the complete "medically important non-sensitive entities".
In one embodiment of the present invention, these "medically important non-sensitive entities" are added to the tokenizer as tokenized minimum-granularity words to ensure that when these entity words appear in the text data, they are not segmented into sub-sequences, e.g., "patient has hodgkin lymphoma", because "hodgkin lymphoma" is a diagnostic entity word, the text should be tokenized as "patient/suffering/hodgkin lymphoma", rather than "patient/suffering/hodgkin/lymphoma". For example, comparing the extracted sensitive entities with term word boundaries of the above participle sequence, and deleting the sensitive entities in the following cases: the position of the sensitive entity word text in the set of the medically important non-sensitive entities has intersection with the term boundary of the participle sequence and the non-first position coincides with the term boundary.
In one embodiment of the invention, for the verification of the identity card number, the identity card number and the identity card can be used as context entity words in the preset entity words. If the determined sensitive entity word contains 18 digits or 17 digits + X, the sensitive entity word can be determined as the final sensitive entity word if the context contains the identity card number, the identity card number and the identity card. It can of course be verified in other ways. For example, the identification number has definite generation logic, and the specific meaning of the digit is as follows: 18-bit code =6 bits +8 bits +3 bits +1 bit. Wherein the meanings represented in the 18-bit code are shown in table 3:
code Means of
6 bit Region code indicating provincial and urban areas
8 bit Year, month and day of birth
3 position Sequence code
1 is Check code
Therefore, the invention checks the value of the 6-bit area code, the value of the 8-bit year, month and day and the check code in the identity card number of the extracted sensitive entity word, and deletes the extracted value in a non-conforming manner, for example, for the identity card number of the sensitive entity word "42099619942324", because the area represented by the fifth and sixth bit 96 does not exist, the entity word is not a legal identity card number entity word.
Fig. 8 schematically shows a block diagram of a text data desensitization apparatus according to an embodiment of the present invention.
As shown in fig. 8, the text data desensitization apparatus 800 includes a first acquisition module 801, a first processing module 802, a second processing module 803, a first determination module 804, a second determination module 805, and a desensitization processing module 806.
Specifically, the first obtaining module 801 is configured to obtain text data.
The first processing module 802 is configured to process the text data through a preset dictionary tree and/or a preset regular expression, and obtain a first sensitive entity word of the text data.
The second processing module 803 is configured to process the text data through a preset model, and obtain a label of each character in the text data.
A first determining module 804, configured to determine a second sensitive entity word of the text data according to the label of each character in the text data.
A second determining module 805, configured to determine a sensitive entity word of the text data according to the first sensitive entity word of the text data and the second sensitive entity word of the text data.
A desensitization processing module 806, configured to perform desensitization processing on the sensitive entity words of the text data.
The text data desensitization device 800 can process text data through a preset dictionary tree and/or a preset regular expression, obtain a first sensitive entity word of the text data, process the text data through a preset model, obtain a label of each character in the text data, determine a second sensitive entity word of the text data according to the label of each character in the text data, and then determine the sensitive entity word of the text data according to the first sensitive entity word of the text data and the second sensitive entity word of the text data.
According to an embodiment of the present invention, the text data desensitization apparatus 800 may be used to implement the text data desensitization method described in the embodiment of fig. 2.
Fig. 9 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention.
As shown in fig. 9, the text data desensitization apparatus 800 further includes a combination module 807, a first establishing module 808, and a second establishing module 809.
Specifically, the combining module 807 is configured to combine characters in the general dictionary to obtain a preset dictionary.
A first establishing module 808, configured to establish a dictionary tree based on the words in the preset dictionary, where each node in the dictionary tree is a character of each word in the preset dictionary.
The second establishing module 809 establishes an automaton in the dictionary tree to obtain the preset dictionary tree.
The text-forming data desensitization device 800 can combine characters in a general dictionary to obtain a preset dictionary, so that the problems that the vocabulary in the dictionary is limited and the recall is insufficient when sensitive entity words are recognized can be solved. After the preset dictionary is obtained, a dictionary tree can be established based on the words in the preset dictionary, and each node in the dictionary tree is one character of each word in the preset dictionary, so that keywords (such as sensitive entity words) can be extracted efficiently. And then, establishing an automaton in the dictionary tree to obtain the preset dictionary tree, so that the searching efficiency can be improved.
According to an embodiment of the present invention, the text data desensitization apparatus 800 may be used to implement the text data desensitization method described in the embodiment of fig. 3A.
Fig. 10 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention.
As shown in fig. 10, the text data desensitization apparatus 800 further includes a construction module 810.
Specifically, the building module 810 builds the regular expression based on characteristics of a preset entity vocabulary, where the preset entity vocabulary includes any one or more of the following items: identity card number, telephone number, bank card number, passport number, social security card number, house number, mailbox account number, organization name.
Fig. 11 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention.
As shown in fig. 11, the text data desensitization apparatus 800 further includes a third processing module 811, a fourth processing module 812, and a generating module 813.
Specifically, the third processing module 811 is configured to process each character in the text data to obtain a feature vector of each character.
And a fourth processing module 812, configured to process each vocabulary in the text data to obtain a feature vector of each vocabulary.
The generating module 813 generates a feature vector sequence of the text data based on the feature vector of each character and the feature vector of each vocabulary.
The text data desensitization device 800 may generate the feature vector sequence of the text data based on the feature vector of each character and the feature vector of each vocabulary, so as to improve the accuracy of subsequently processing the feature vector sequence of the text data through the preset model to obtain the second sensitive entity word.
According to an embodiment of the present invention, the text data desensitization apparatus 800 may be used to implement the text data desensitization method described in the embodiment of fig. 4.
Fig. 12 schematically shows a block diagram of a text data desensitization apparatus according to another embodiment of the present invention.
As shown in fig. 12, the second processing module 803 includes a bidirectional long/short term memory network layer processing module 803-1, a conditional random field layer processing module 803-2 and a viterbi algorithm processing module 803-3.
Specifically, the bidirectional long-short term memory network layer processing module 803-1 is configured to process the feature vector sequence of the text data through the bidirectional long-short term memory network layer, and obtain the probability of the label corresponding to the character at each position in the text data.
And the conditional random field layer processing module 803-2 is configured to obtain a score of a probability of the label by performing probability processing on the label corresponding to the character at each position by the conditional random field layer.
And the viterbi algorithm processing module 803-3 is configured to obtain a label of each character in the text data by performing a scoring process on the probability of the label through a viterbi algorithm.
The second processing module 803 may process the feature vector sequence of the text data through the bidirectional long and short term memory network layer, obtain the probability of the label corresponding to the character at each position in the text data, obtain the score of the probability of the label through the probability processing of the label corresponding to the character at each position by the conditional random field layer, obtain the label of each character in the text data through the score processing of the probability of the label by the viterbi algorithm, and thus may match the optimal label to each character in the text data.
According to an embodiment of the present invention, the second processing module 803 may be used to implement the text data desensitization method described in the embodiment of fig. 5A.
Fig. 13 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention.
As shown in fig. 13, the text data desensitization apparatus 800 further includes a second obtaining module 814, a fifth processing module 815, a sixth processing module 816, a third determining module 817, and a training module 818.
Specifically, the second obtaining module 814 is configured to obtain training data, where the training data includes text data of known sensitive entity words.
A fifth processing module 815, configured to process the training data through a word2vec model or a glove model, to obtain a feature vector of each character in the training data.
A sixth processing module 816 is configured to perform word segmentation processing on the training data to obtain a word segmentation feature sequence of the training data.
A third determining module 817, configured to determine, according to the word segmentation feature sequence of the training data, a label of a character at each position in each vocabulary of the training data.
The training module 818 obtains the preset model by training a model using the feature vector of each character in the training data and the label of the character at each position in each vocabulary.
The text data desensitization apparatus 800 may train a model using the feature vector of each character in the training data and the label of the character at each position in each vocabulary to obtain the preset model, so as to determine an accurate label probability using the trained model.
According to an embodiment of the present invention, the text data desensitization apparatus 800 may be used to implement the text data desensitization method described in the embodiment of fig. 6.
Fig. 14 schematically shows a block diagram of a text data desensitizing apparatus according to another embodiment of the present invention.
As shown in fig. 14, the text data desensitization apparatus 800 further includes a verification module 819 and a deletion module 820.
Specifically, the verification module 819 is configured to verify the sensitive entity words of the text data by using a preset entity word set.
And the deleting module 820 is used for deleting the sensitive entity words of the text data according to the verification result.
The text data desensitization device 800 can verify the sensitive entity words of the text data through the preset entity word set, and delete the sensitive entity words of the text data according to the verification result, so that the accuracy in obtaining the sensitive entity words is further improved.
According to an embodiment of the present invention, the text data desensitization apparatus 800 may be used to implement the text data desensitization method described in the embodiment of fig. 7.
For details that are not disclosed in the embodiment of the apparatus of the present invention, please refer to the embodiment of the text data desensitization method of the present invention described above for the details that are not disclosed in the embodiment of the apparatus of the present invention, because each module of the text data desensitization apparatus of the example embodiment of the present invention can be used to implement the steps of the example embodiment of the text data desensitization method described above in fig. 2 to 7.
It is understood that the first obtaining module 801, the first processing module 802, the second processing module 803, the two-way long-short term memory network layer processing module 803-1, the conditional random field layer processing module 803-2, the viterbi algorithm processing module 803-3, the first determining module 804, the second determining module 805, the desensitization processing module 806, the combining module 807, the first establishing module 808, the second establishing module 809, the constructing module 810, the third processing module 811, the fourth processing module 812, the generating module 813, the second obtaining module 814, the fifth processing module 815, the sixth processing module 816, the third determining module 817, the training module 818, the checking module 819, and the deleting module 820 may be combined to be implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present invention, at least one of the first obtaining module 801, the first processing module 802, the second processing module 803, the two-way long-short term memory network layer processing module 803-1, the conditional random field layer processing module 803-2, the viterbi algorithm processing module 803-3, the first determining module 804, the second determining module 805, the desensitization processing module 806, the combining module 807, the first establishing module 808, the second establishing module 809, the constructing module 810, the third processing module 811, the fourth processing module 812, the generating module 813, the second obtaining module 814, the fifth processing module 815, the sixth processing module 816, the third determining module 817, the training module 818, the checking module 819, and the deleting module 820 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or any other reasonable way of integrating or encapsulating a circuit, or any other reasonable combination of hardware and firmware. Alternatively, at least one of the first acquisition module 801, the first processing module 802, the second processing module 803, the two-way long-short term memory network layer processing module 803-1, the conditional random field layer processing module 803-2, the viterbi algorithm processing module 803-3, the first determination module 804, the second determination module 805, the desensitization processing module 806, the combining module 807, the first establishing module 808, the second establishing module 809, the constructing module 810, the third processing module 811, the fourth processing module 812, the generating module 813, the second acquisition module 814, the fifth processing module 815, the sixth processing module 816, the third determination module 817, the training module 818, the checking module 819, and the deleting module 820 may be at least partially implemented as a computer program module that, when executed by a computer, may perform the functions of the respective modules.
Referring now to FIG. 15, shown is a block diagram of a computer system 900 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system 900 of the electronic device shown in fig. 15 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
As shown in fig. 15, the computer system 900 includes a Central Processing Unit (CPU) 901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for system operation are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method of desensitizing textual data as described in the embodiments above.
For example, the electronic device may implement the following as shown in fig. 2: in step S210, text data is acquired. In step S220, the text data is processed through a preset dictionary tree and/or a preset regular expression, and a first sensitive entity word of the text data is obtained. In step S230, the text data is processed through a preset model, and a label of each character in the text data is obtained. In step S240, a second sensitive entity word of the text data is determined according to the label of each character in the text data. In step S250, a sensitive entity word of the text data is determined according to the first sensitive entity word of the text data and the second sensitive entity word of the text data. In step S260, desensitization processing is performed on the sensitive entity words of the text data.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (8)

1. A method of desensitizing textual data, comprising:
acquiring text data;
processing the text data through a preset dictionary tree and/or a preset regular expression to obtain a first sensitive entity word of the text data;
processing the text data through a preset model, and acquiring a label of each character in the text data, wherein the label of each character in the text data comprises a position category label of the character at each position in the text data;
determining a second sensitive entity word of the text data according to the label of each character in the text data;
determining a sensitive entity word of the text data according to a first sensitive entity word of the text data and a second sensitive entity word of the text data;
desensitizing sensitive entity words of the text data;
processing the text data through the preset model, and acquiring a label of each character in the text data comprises:
processing a feature vector sequence of text data through a bidirectional long-short term memory network layer to obtain the probability of a position category label corresponding to a character at each position in the text data;
probability processing is carried out on the labels corresponding to the characters at each position through a conditional random field layer, and the scoring of the probability of the labels is obtained;
the probability of the label is graded through a Viterbi algorithm, and the label of each character in the text data is obtained;
wherein the preset dictionary tree is determined by:
combining characters in the general dictionary to obtain a preset dictionary;
establishing a dictionary tree based on the words in the preset dictionary, wherein each node in the dictionary tree is a character of each word in the preset dictionary;
and establishing an automaton in the dictionary tree to obtain the preset dictionary tree.
2. The method of claim 1, further comprising:
constructing the regular expression based on the characteristics of a preset entity vocabulary, wherein the preset entity vocabulary comprises any one or more of the following items: identity card number, telephone number, bank card number, passport number, social security card number, house number, mailbox account number, organization name.
3. The method according to claim 1, wherein before processing the text data through the preset model to obtain the labels of the words in the text data, the method comprises:
processing each character in the text data to obtain a feature vector of each character;
processing each vocabulary in the text data to obtain a feature vector of each vocabulary;
generating a sequence of feature vectors for the text data based on the feature vector for each character and the feature vector for each vocabulary.
4. The method of claim 1, wherein prior to obtaining the text data, the method further comprises:
acquiring training data, wherein the training data comprises text data of known sensitive entity words;
processing the training data through a word2vec model or a glove model to obtain a feature vector of each character in the training data;
performing word segmentation processing on the training data to obtain a word segmentation characteristic sequence of the training data;
determining a label of a character at each position in each vocabulary of the training data according to the word segmentation feature sequence of the training data;
and training a model by using the feature vector of each character in the training data and the label of the character at each position in each vocabulary to obtain the preset model.
5. The method of claim 1, further comprising:
verifying the sensitive entity words of the text data through a preset entity word set;
and deleting the sensitive entity words of the text data according to the verification result.
6. A text data desensitizing apparatus, comprising:
the first acquisition module is used for acquiring text data;
the first processing module is used for processing the text data through a preset dictionary tree and/or a preset regular expression to obtain a first sensitive entity word of the text data; wherein the preset dictionary tree is determined by: combining characters in the general dictionary to obtain a preset dictionary; establishing a dictionary tree based on the words in the preset dictionary, wherein each node in the dictionary tree is a character of each word in the preset dictionary; establishing an automaton in the dictionary tree to obtain the preset dictionary tree;
the second processing module is used for processing the text data through a preset model and acquiring a label of each character in the text data, wherein the label of each character in the text data comprises a position category label of the character at each position in the text data;
the second processing module is further configured to process the feature vector sequence of the text data through a bidirectional long-short term memory network layer, and obtain a probability of a position category label corresponding to a character at each position in the text data; probability processing is carried out on the label corresponding to the character at each position through a conditional random field layer, and the score of the probability of the label is obtained; the probability of the label is graded through a Viterbi algorithm, and the label of each character in the text data is obtained;
the first determining module is used for determining a second sensitive entity word of the text data according to the label of each character in the text data;
the second determining module is used for determining the sensitive entity words of the text data according to the first sensitive entity words of the text data and the second sensitive entity words of the text data;
and the desensitization processing module is used for desensitizing the sensitive entity words of the text data.
7. An electronic device, comprising:
one or more processors; and
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to any one of claims 1 to 5.
8. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the method according to any one of claims 1 to 5.
CN201911421350.4A 2019-12-31 2019-12-31 Text data desensitization method, device, medium and electronic equipment Active CN111159770B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911421350.4A CN111159770B (en) 2019-12-31 2019-12-31 Text data desensitization method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911421350.4A CN111159770B (en) 2019-12-31 2019-12-31 Text data desensitization method, device, medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111159770A CN111159770A (en) 2020-05-15
CN111159770B true CN111159770B (en) 2022-12-13

Family

ID=70560566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911421350.4A Active CN111159770B (en) 2019-12-31 2019-12-31 Text data desensitization method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111159770B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709052B (en) * 2020-06-01 2021-05-25 支付宝(杭州)信息技术有限公司 Private data identification and processing method, device, equipment and readable medium
CN112434331B (en) * 2020-11-20 2023-08-18 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium
CN114548103B (en) * 2020-11-25 2024-03-29 马上消费金融股份有限公司 Named entity recognition model training method and named entity recognition method
CN113420322B (en) * 2021-05-24 2023-09-01 阿里巴巴新加坡控股有限公司 Model training and desensitizing method and device, electronic equipment and storage medium
CN113360946B (en) * 2021-06-29 2024-01-30 招商局金融科技有限公司 News desensitization processing method, device, electronic equipment and readable storage medium
CN113591150B (en) * 2021-08-03 2024-04-26 浙江图盛输变电工程有限公司温州科技分公司 Desensitization processing method for sensitive data
CN116956356B (en) * 2023-09-21 2023-11-28 深圳北控信息发展有限公司 Information transmission method and equipment based on data desensitization processing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002254818A (en) * 2001-03-05 2002-09-11 Toshiba Tec Corp Multicolor heat sensitive recording medium and printing method
CN109739986A (en) * 2018-12-28 2019-05-10 合肥工业大学 A kind of complaint short text classification method based on Deep integrating study
CN110134966A (en) * 2019-05-21 2019-08-16 中电健康云科技有限公司 A kind of sensitive information determines method and device
CN110289059A (en) * 2019-06-13 2019-09-27 北京百度网讯科技有限公司 Medical data processing method, device, storage medium and electronic equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9790253B2 (en) * 2011-05-19 2017-10-17 President And Fellows Of Harvard College OSW-1 analogs and conjugates, and uses thereof
CN106354715B (en) * 2016-09-28 2019-04-16 医渡云(北京)技术有限公司 Medical vocabulary processing method and processing device
CN109582949B (en) * 2018-09-14 2022-11-22 创新先进技术有限公司 Event element extraction method and device, computing equipment and storage medium
CN109388803B (en) * 2018-10-12 2023-09-15 北京搜狐新动力信息技术有限公司 Chinese word segmentation method and system
CN109858280A (en) * 2019-01-21 2019-06-07 深圳昂楷科技有限公司 A kind of desensitization method based on machine learning, device and desensitization equipment
CN110175608A (en) * 2019-04-16 2019-08-27 中国平安财产保险股份有限公司 A kind of settlement of insurance claim attachment processing method and processing device
CN110444259B (en) * 2019-06-06 2022-09-23 昆明理工大学 Entity relation extracting method of traditional Chinese medicine electronic medical record based on entity relation labeling strategy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002254818A (en) * 2001-03-05 2002-09-11 Toshiba Tec Corp Multicolor heat sensitive recording medium and printing method
CN109739986A (en) * 2018-12-28 2019-05-10 合肥工业大学 A kind of complaint short text classification method based on Deep integrating study
CN110134966A (en) * 2019-05-21 2019-08-16 中电健康云科技有限公司 A kind of sensitive information determines method and device
CN110289059A (en) * 2019-06-13 2019-09-27 北京百度网讯科技有限公司 Medical data processing method, device, storage medium and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Role-based Static Desensitization Protection Method;Tan Hu等;《网页在线公开:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8823514》;20190905;第1页 *
用户画像机制下信息定向推送的权利保护与法律规制;胡峻峰;《西安电子科技大学学报(社会科学版)》;20190402;第28卷(第4期);第36-43页 *
适用于 PDF 文本内容的高效模式匹配算法;朱玲玉等;《通信技术》;20180412;第51卷(第3期);第641-646页 *

Also Published As

Publication number Publication date
CN111159770A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111159770B (en) Text data desensitization method, device, medium and electronic equipment
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
JP2021089705A (en) Method and device for evaluating translation quality
WO2020224219A1 (en) Chinese word segmentation method and apparatus, electronic device and readable storage medium
CN109522552B (en) Normalization method and device of medical information, medium and electronic equipment
CN112215008B (en) Entity identification method, device, computer equipment and medium based on semantic understanding
CN111309915A (en) Method, system, device and storage medium for training natural language of joint learning
JP2020520492A (en) Document abstract automatic extraction method, device, computer device and storage medium
CN110705301A (en) Entity relationship extraction method and device, storage medium and electronic equipment
CN107861954B (en) Information output method and device based on artificial intelligence
CN109522338A (en) Clinical term method for digging, device, electronic equipment and computer-readable medium
WO2022088671A1 (en) Automated question answering method and apparatus, device, and storage medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
WO2021174864A1 (en) Information extraction method and apparatus based on small number of training samples
CN113593709B (en) Disease coding method, system, readable storage medium and device
CN113707299A (en) Auxiliary diagnosis method and device based on inquiry session and computer equipment
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
US11669679B2 (en) Text sequence generating method and apparatus, device and medium
CN111353311A (en) Named entity identification method and device, computer equipment and storage medium
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN111415747A (en) Electronic medical record construction method and device
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
WO2021098491A1 (en) Knowledge graph generating method, apparatus, and terminal, and storage medium
CN114036921A (en) Policy information matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant