CN111191275A - Sensitive data identification method, system and device - Google Patents

Sensitive data identification method, system and device Download PDF

Info

Publication number
CN111191275A
CN111191275A CN201911194236.2A CN201911194236A CN111191275A CN 111191275 A CN111191275 A CN 111191275A CN 201911194236 A CN201911194236 A CN 201911194236A CN 111191275 A CN111191275 A CN 111191275A
Authority
CN
China
Prior art keywords
data
sensitive
word
text data
sensitive data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911194236.2A
Other languages
Chinese (zh)
Inventor
刘川意
方滨兴
韩培义
段少明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yun An Bao Technology Co ltd
Original Assignee
Shenzhen Yun An Bao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yun An Bao Technology Co ltd filed Critical Shenzhen Yun An Bao Technology Co ltd
Priority to CN201911194236.2A priority Critical patent/CN111191275A/en
Publication of CN111191275A publication Critical patent/CN111191275A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the invention provides a method, a system and a device for identifying sensitive data, wherein the method comprises the following steps: analyzing the unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprises a plurality of words; inputting text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a CRF; and determining the position of the sensitive data in the text data according to the first labeling sequence. In the embodiment of the invention, each word in the characteristic text data can be better learned by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.

Description

Sensitive data identification method, system and device
Technical Field
The invention relates to the technical field of information security, in particular to a sensitive data identification method, a sensitive data identification system and a sensitive data identification device.
Background
With the increasing importance of data security, how to protect data inside enterprises from being leaked gradually draws attention of all social layers, and many companies put higher demands on the security of sensitive data inside.
Unstructured data (including text, pictures, etc.) accounts for over 80% of enterprise data and grows at a rate of 55% to 65% per year. However, the prior art is more of a process of identifying and desensitizing structured data. How to identify and desensitize sensitive data in large-scale and diversified unstructured data is an urgent problem to be solved.
Disclosure of Invention
In view of this, an embodiment of the present invention provides a method for identifying sensitive data, which is used to solve the problem in the prior art of identifying sensitive data in unstructured data.
The technical scheme adopted by the invention for solving the technical problems is as follows.
In a first aspect, a sensitive data identification method is provided, including:
analyzing the unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprises a plurality of words;
inputting text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
and determining the position of the sensitive data in the text data according to the first labeling sequence.
In a second aspect, a sensitive data recognition system is provided, comprising a memory and a processor, wherein the memory is configured to store executable program code; the processor is connected with the memory, and executes a program corresponding to the executable program code by reading the executable program code stored in the memory so as to execute the sensitive data identification method.
In a third aspect, a sensitive data identification apparatus is provided, including:
the analysis unit is used for analyzing the unstructured data to obtain text data corresponding to the unstructured data, and the text data comprises a plurality of words;
the recognition unit is used for inputting text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, and the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
and the determining unit is used for determining the position of the sensitive data in the text data according to the first annotation sequence.
In the embodiment of the invention, each word in the text data can be better learned by a language model based on deep learning, and a labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining a Conditional Random Field (CRF), so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
Drawings
FIG. 1 is a flow chart of a sensitive data identification method according to an embodiment of the present invention;
FIG. 2 is a diagram of an SDK embedded application provided by an embodiment of the present invention;
FIG. 3 is a flowchart of sensitive data recognition model training according to a second embodiment of the present invention;
FIG. 4 is a schematic diagram of a sensitive data recognition model provided by an embodiment of the invention;
FIG. 5 is a flow chart of a process of passing text data through a sensitive data recognition model during a recognition phase according to an embodiment of the present invention;
FIG. 6 is a flowchart of parsing unstructured data according to a third embodiment of the present invention;
fig. 7 is a schematic structural diagram of a sensitive data identification device according to a fourth embodiment of the present invention;
fig. 8 is a block diagram of a sensitive data identification system according to a fifth embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the embodiment of the invention, each word in the text data can be better represented by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
Example one
Fig. 1 is a flowchart of a sensitive data identification method according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step S101: and analyzing the unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprises a plurality of words.
In the embodiment of the invention, the unstructured data is analyzed and extracted to obtain the corresponding text data, and the unstructured data includes but is not limited to WORD, EXCEL, PPT, TXT, PDF, XML, database text fields, pictures, and the like. The text data includes a plurality of words, which can be classified into a word granularity (Token granularity) level.
Step S102: and inputting the text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a CRF.
In the embodiment of the invention, the sensitive data recognition model comprises an unsupervised pre-trained bidirectional language model based on deep learning, such as BERT, ELMo, GPT and the like, and the text data obtains a word vector with context information through the language model; the method comprises the steps that a word vector passes through the full connection layer and the CRF in sequence to obtain the probability that each word belongs to each sensitive entity attribute and the first labeling sequence with the maximum joint distribution probability of the sensitive entity attributes aiming at each word, wherein the first labeling sequence is a sentence-level labeling sequence. For the output of the CRF, optimization processing such as Viterbi decoding and softmax normalization may also be performed.
Step S103: and determining the position of the sensitive data in the text data according to the first labeling sequence.
And distinguishing the position of the sensitive data according to the sensitive entity attribute in the first labeling sequence.
Step 104: and desensitizing the sensitive data according to the position of the sensitive data in the text data.
In the embodiment of the invention, after the position of the sensitive data is identified, desensitization processing such as shielding, replacing, erasing, format preservation encryption, symmetric encryption, date generalization, numerical generalization, phrase generalization and the like is carried out on the identified sensitive data.
In the embodiment of the invention, each word in the text data can be better represented by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
Preferably, the sensitive data identification and desensitization methods described in steps S101-S104 are integrated into a Software Development Kit (SDK), and are opened in an Application Programming Interface (API) manner, such as Restful or grpc. The SDK is embedded into an application program, the application program calls a corresponding API in the SDK according to requirements, and then a result is returned by a corresponding service. Fig. 2 is a schematic diagram of an SDK embedded application. The API is as follows:
Figure BDA0002294309940000041
Figure BDA0002294309940000051
the method has the advantages that the development period is short, the application program is embedded simply and conveniently, and enterprises can be helped to integrate the sensitive data identification and desensitization method into enterprise products more simply and conveniently, so that the data protection capability of the enterprises is improved.
Example two
As an embodiment of the invention, a sensitive data recognition model needs to be trained before the sensitive data recognition is carried out on the text data. Fig. 3 is a flowchart of sensitive data recognition model training according to a second embodiment of the present invention. As shown in fig. 3, before parsing the unstructured data, the method includes:
step S301: the training data is segmented into a plurality of words.
The sensitive data recognition model is trained through a large amount of training data. The training data includes a plurality of words, each of which can be classified to a level of word granularity (Token granularity).
Step S302: and carrying out sensitive entity attribute labeling on the training data by adopting a preset identifier to obtain a second labeling sequence.
In the embodiment of the invention, sensitive entity attributes and identifiers thereof are defined firstly. And sensitive entity attributes such as name, age, native place, identification card number, mobile phone number, mailbox, organization name and the like. And adopting a labeling method such as BIO or BIOES, wherein the identifier comprises a direct identifier and a quasi-identifier. The direct identifier can directly locate personal attributes such as name, identification card, mobile phone number, etc.; a single quasi-identifier cannot directly locate an individual, but a combination of multiple quasi-identifiers can locate an individual. Desensitizing these two major types of identifiers can resist attacks in most cases, greatly reducing the risk of privacy disclosure. Identifiers such as B, I, E, O and S. Wherein, the identifier B is the initial identification of the attribute of the sensitive entity; the identifier I is a continuation identifier of the attribute of the sensitive entity; the identifier E is an ending identifier of the attribute of the sensitive entity; identifier O represents a non-sensitive entity; the identifier S represents a single sensitive entity. Meanwhile, "[ CLS ]" and "[ SEP ]" are used at the beginning and end of a sentence, respectively. For example: the training data "the capital of the people's republic of china is beijing", and the division into word granularities is "china/people/republic/capital/beijing". In this sentence, the identifier of the sensitive entity attribute of "china" is B, the identifier of the sensitive entity attribute of "people" is I, the identifier of the sensitive entity attribute of "republic" is E, the identifiers of the sensitive entity attributes of "first capital" and "yes" are all O, and the identifier of the sensitive entity attribute of "beijing" is S, so the annotation sequence of the whole sentence is "[ CLS ] bieoos [ SEP ]". The training data is text data for which the second annotation sequence is known.
Step S303: and inputting the training data into a sensitive data recognition model to obtain a third labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a CRF.
In an embodiment of the invention, the sensitive data recognition model comprises a deep learning based language model, a fully connected layer and a CRF. From another dimension, as shown in FIG. 4, the sensitive data recognition model includes an input layer, a word representation layer, and a decoding layer.
The input layer preprocesses the input training data to eliminate unreasonable character pairs and invalid characters in the training data.
The word representation layer is a bidirectional language model, such as BERT, ELMo, GPT and the like, the word vector with context information is obtained by adopting an encoder end based on a Transformer, text features in training data are mined to the maximum extent, richer word representation is extracted, and the defects that the context information cannot be dynamically represented and a word ambiguity cannot be solved by the traditional word vector (such as word2vec, Glove and the like) are overcome.
The decoding layer comprises a full connection layer and a CRF. The word vectors output by the word representation layer pass through the full connection layer to obtain the probability (including no category) that each word belongs to each sensitive entity attribute, then each probability corresponding to each word in the training data is input into the CRF, and the CRF can obtain the third labeling sequence with the maximum joint distribution probability of the sensitive entity attributes of each word according to the transition probability between the states and the emission probability corresponding to the states.
Step S304: and comparing the third labeling sequence with the second labeling sequence, and stopping training and obtaining the trained sensitive data recognition model when the accuracy is greater than a preset threshold value.
The second labeling sequence is a known artificial labeling sequence, and the third labeling sequence is a prediction sequence obtained after the training data is calculated by a sensitive data recognition model. And in the training stage, comparing the third labeling sequence with the second labeling sequence, and when the accuracy of the third labeling sequence is greater than a preset threshold, such as greater than 95%, considering that the sensitive data recognition model is accurate in prediction, and stopping training. The sensitive data identification model can be used for sensitive data identification.
Correspondingly, in the recognition stage, as shown in fig. 5, in the sensitive data recognition method, the text data is input into a sensitive data recognition model, and a first labeled sequence with the highest probability of joint distribution of the sensitive entity attribute for each word is obtained, where the sensitive data recognition model includes a deep learning-based language model, a full-link layer and a CRF, and includes:
step S501: and inputting the text data into a language model based on deep learning, and obtaining a word vector with context information corresponding to each word.
Step S502: and inputting the word vector into the full-connection layer and the CRF to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word.
In the recognition stage, the processing process of the text data passing through the sensitive data recognition model corresponds to that in the training stage, and the obtained first labeling sequence integrates the contextual characteristics and the labeling dependency relationship.
In the embodiment of the invention, the language model based on deep learning can dynamically represent each word in the text data, the semantic information of the text data is learned, the problems of pattern rigid, insufficient accuracy and poor cross-platform recognition capability existing in the method based on pattern matching are solved, and the method based on pattern matching usually needs expert guidance and consumes a large amount of manpower to write rules. Meanwhile, a label sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining the CRF, the position of the sensitive data in the unstructured data is determined, the problem of error in entity boundary identification caused by the fact that the sensitive entities are long in sentence and a plurality of entities are contained in the same sentence is solved, and the identification accuracy is improved.
EXAMPLE III
As an embodiment of the present invention, when the unstructured data is a picture, as shown in fig. 6, it is a flowchart for parsing the unstructured data according to a third embodiment of the present invention. The method comprises the following steps:
step S601: and determining a character area in the picture to be desensitized through the first neural network.
In the embodiment of the invention, a picture is divided to generate a plurality of sub-text suggestion boxes; simultaneously, extracting the characteristics of the picture to be desensitized by utilizing a convolutional neural network; and inputting the characteristics of the picture and the sub text suggestion boxes into a recurrent neural network to analyze the characteristics of the picture, obtaining the score of each sub text suggestion box containing the text data, thereby determining the sub text suggestion boxes possibly containing the text data, and connecting the sub text suggestion boxes to form a text area containing the text data.
Step S602: text data in the text area is obtained through a second neural network.
In the embodiment of the invention, firstly, the picture in the formed text area range is input into a convolution neural network for text recognition, and picture pixels are converted into feature vectors; and then analyzing the characteristic vector through a recurrent neural network of character recognition to obtain a character sequence, namely text data, wherein characters in the character area are converted into text data which can be understood by a computer. The output of the recurrent neural network may contain repeated characters or spaces, and a character sequence translation process based on connection time sequence classification is preferably added after the recurrent neural network to obtain final text data.
Inputting text data into the sensitive data recognition models described in the first embodiment and the second embodiment, and recognizing sensitive data; and combining the text area in the step S601 to complete the positioning and desensitization processing of the sensitive data on the picture.
In the embodiment of the invention, a convolutional neural network and a cyclic neural network are adopted to identify characters in a picture to be desensitized, pixels in the picture are converted into text data which can be identified by a computer, then each word in the text data can be better learned based on a language model for deep learning, and a labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word in the text data is obtained by combining CRF, so that the position of sensitive data in unstructured data is determined, and the identification accuracy is improved.
Example four
Fig. 7 is a schematic structural diagram of a sensitive data identification device according to a fourth embodiment of the present invention. As shown in fig. 7, the sensitive data recognition apparatus includes: an analysis unit 71, a recognition unit 72, and a determination unit 73.
The parsing unit 71 is configured to parse the unstructured data to obtain text data corresponding to the unstructured data, where the text data includes a plurality of words.
The recognition unit 72 is configured to input the text data into a sensitive data recognition model, so as to obtain a first labeling sequence with a maximum joint distribution probability of the sensitive entity attribute for each word, where the sensitive data recognition model includes a language model based on deep learning, a full-link layer, and a CRF.
The determining unit 73 is configured to determine a position of the sensitive data in the text data according to the first annotation sequence.
As an embodiment of the present invention, the sensitive data identification apparatus further includes:
the segmentation unit is used for segmenting the training data into a plurality of words.
And the marking unit is used for marking the sensitive entity attribute of the training data by adopting a preset identifier to obtain a second marking sequence.
The training unit is used for inputting training data into a sensitive data recognition model to obtain a third labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, and the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a CRF.
And the judging unit is used for comparing the third labeling sequence with the second labeling sequence, and stopping training and obtaining a trained sensitive data recognition model when the accuracy is greater than a preset threshold value.
Accordingly, the recognition unit 72 includes:
the word representation subunit is used for inputting the text data into the language model based on deep learning, and obtaining a word vector with context information corresponding to each word.
And the decoding subunit is used for inputting the word vector into the full-link layer and the CRF to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute for each word.
Preferably, the preset identifier includes a direct identifier and a quasi-identifier.
As an embodiment of the present invention, the sensitive data identification apparatus further includes a desensitization unit, configured to perform desensitization processing on the sensitive data according to a position of the sensitive data in the text data.
Preferably, the parsing unit 71 includes:
the first subunit is used for determining a character area in the picture to be desensitized through the first neural network.
The second subunit is used for acquiring the text data in the text area through the second neural network.
In the embodiment of the invention, each word in the characteristic text data can be better learned by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
EXAMPLE five
Fig. 8 is a block diagram of a sensitive data identification system according to a fifth embodiment of the present invention. As shown in fig. 8, the system includes a memory 81 and a processor 82, wherein the memory 81 is used for storing executable program codes; the processor 82 is connected to the memory 81, and executes a program corresponding to the executable program code by reading the executable program code stored in the memory 81, so as to perform the following steps:
analyzing the unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprises a plurality of words;
inputting text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
and determining the position of the sensitive data in the text data according to the first labeling sequence.
In the embodiment of the invention, each word in the characteristic text data can be better learned by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not to be construed as limiting the scope of the invention. Those skilled in the art can implement the invention in various modifications, such as features from one embodiment can be used in another embodiment to yield yet a further embodiment, without departing from the scope and spirit of the invention. Any modification, equivalent replacement and improvement made within the technical idea of using the present invention should be within the scope of the right of the present invention.

Claims (11)

1. A method for sensitive data identification, the method comprising:
analyzing unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprise a plurality of words;
inputting the text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a Conditional Random Field (CRF);
and determining the position of the sensitive data in the text data according to the first labeling sequence.
2. The method of claim 1, wherein prior to the parsing unstructured data, the method further comprises:
segmenting the training data into a plurality of words;
sensitive entity attribute labeling is carried out on the training data by adopting a preset identifier to obtain a second labeling sequence;
inputting the training data into a sensitive data recognition model to obtain a third labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a Conditional Random Field (CRF);
comparing the third labeling sequence with the second labeling sequence, and stopping training and obtaining a trained sensitive data recognition model when the accuracy is greater than a preset threshold;
inputting the text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute for each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF, and comprises the following steps:
inputting the text data into a language model based on deep learning, and obtaining a word vector with context information corresponding to each word;
and inputting the word vector into a full-connection layer and a conditional random field CRF to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute for each word.
3. The method of claim 2, wherein the preset identifier comprises a direct identifier and a quasi-identifier.
4. The method of claim 1, wherein when the unstructured data is a picture, the parsing the unstructured data to obtain text data corresponding to the unstructured data comprises:
determining a character area in the picture to be desensitized through a first neural network;
and acquiring text data in the text area through a second neural network.
5. The method of any of claims 1-4, wherein after said determining the location of sensitive data in the text data from the first annotation sequence, the method further comprises:
and desensitizing the sensitive data according to the position of the sensitive data in the text data.
6. A sensitive data identification system, the system comprising:
a memory and a processor, wherein the processor is capable of,
wherein the memory is to store executable program code; the processor is connected with the memory, and executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to execute the following steps:
analyzing unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprise a plurality of words;
inputting the text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a Conditional Random Field (CRF);
and determining the position of the sensitive data in the text data according to the first labeling sequence.
7. Sensitive data identification device, characterized in that it comprises:
the analysis unit is used for analyzing the unstructured data to obtain text data corresponding to the unstructured data, and the text data comprises a plurality of words;
the recognition unit is used for inputting the text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, and the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
and the determining unit is used for determining the position of the sensitive data in the text data according to the first annotation sequence.
8. The apparatus of claim 7, wherein the apparatus further comprises:
the segmentation unit is used for segmenting the training data into a plurality of words;
the marking unit is used for carrying out sensitive entity attribute marking on the training data by adopting a preset identifier to obtain a second marking sequence;
the training unit is used for inputting the training data into a sensitive data recognition model to obtain a third labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, and the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
the judging unit is used for comparing the third labeling sequence with the second labeling sequence, and when the accuracy is greater than a preset threshold value, stopping training and obtaining a trained sensitive data recognition model;
accordingly, the identification unit comprises:
the word representation subunit is used for inputting the text data into a language model based on deep learning and obtaining a word vector with context information corresponding to each word;
and the decoding subunit is used for inputting the word vector into the full-connection layer and the conditional random field CRF to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute for each word.
9. The apparatus of claim 8, wherein the preset identifier comprises a direct identifier and a quasi-identifier.
10. The apparatus of claim 7, wherein the parsing unit comprises:
the first subunit is used for determining a character area in the picture to be desensitized through a first neural network;
and the second subunit is used for acquiring the text data in the text area through a second neural network.
11. The apparatus of any one of claims 7-10, further comprising:
and the desensitization unit is used for desensitizing the sensitive data according to the position of the sensitive data in the text data.
CN201911194236.2A 2019-11-28 2019-11-28 Sensitive data identification method, system and device Pending CN111191275A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911194236.2A CN111191275A (en) 2019-11-28 2019-11-28 Sensitive data identification method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911194236.2A CN111191275A (en) 2019-11-28 2019-11-28 Sensitive data identification method, system and device

Publications (1)

Publication Number Publication Date
CN111191275A true CN111191275A (en) 2020-05-22

Family

ID=70707272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911194236.2A Pending CN111191275A (en) 2019-11-28 2019-11-28 Sensitive data identification method, system and device

Country Status (1)

Country Link
CN (1) CN111191275A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112104655A (en) * 2020-09-16 2020-12-18 安徽长泰信息安全服务有限公司 Protection system and method for preventing data leakage
CN112217841A (en) * 2020-12-09 2021-01-12 平安国际智慧城市科技股份有限公司 Live broadcast room management method and device, computer equipment and storage medium
CN112507628A (en) * 2021-02-03 2021-03-16 北京淇瑀信息科技有限公司 Risk prediction method and device based on deep bidirectional language model and electronic equipment
CN113177233A (en) * 2021-05-31 2021-07-27 上海英方软件股份有限公司 Sensitive data identification method and device
CN113392641A (en) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 Text processing method, device, storage medium and equipment
CN113420322A (en) * 2021-05-24 2021-09-21 阿里巴巴新加坡控股有限公司 Model training and desensitizing method and device, electronic equipment and storage medium
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN116090006A (en) * 2023-02-01 2023-05-09 北京三维天地科技股份有限公司 Sensitive identification method and system based on deep learning
CN117009596A (en) * 2023-06-28 2023-11-07 国网冀北电力有限公司信息通信分公司 Identification method and device for power grid sensitive data
US11861039B1 (en) * 2020-09-28 2024-01-02 Amazon Technologies, Inc. Hierarchical system and method for identifying sensitive content in data
CN117556447A (en) * 2023-11-29 2024-02-13 金网络(北京)数字科技有限公司 Data encryption method and device based on classification recognition and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080075361A1 (en) * 2006-09-21 2008-03-27 Microsoft Corporation Object Recognition Using Textons and Shape Filters
CN109325326A (en) * 2018-08-16 2019-02-12 深圳云安宝科技有限公司 Data desensitization method, device, equipment and medium when unstructured data accesses
CN109800411A (en) * 2018-12-03 2019-05-24 哈尔滨工业大学(深圳) Clinical treatment entity and its attribute extraction method
CN109960727A (en) * 2019-02-28 2019-07-02 天津工业大学 For the individual privacy information automatic testing method and system of non-structured text
CN109977402A (en) * 2019-03-11 2019-07-05 北京明略软件系统有限公司 A kind of name entity recognition method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080075361A1 (en) * 2006-09-21 2008-03-27 Microsoft Corporation Object Recognition Using Textons and Shape Filters
CN109325326A (en) * 2018-08-16 2019-02-12 深圳云安宝科技有限公司 Data desensitization method, device, equipment and medium when unstructured data accesses
CN109800411A (en) * 2018-12-03 2019-05-24 哈尔滨工业大学(深圳) Clinical treatment entity and its attribute extraction method
CN109960727A (en) * 2019-02-28 2019-07-02 天津工业大学 For the individual privacy information automatic testing method and system of non-structured text
CN109977402A (en) * 2019-03-11 2019-07-05 北京明略软件系统有限公司 A kind of name entity recognition method and system

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112104655A (en) * 2020-09-16 2020-12-18 安徽长泰信息安全服务有限公司 Protection system and method for preventing data leakage
US11861039B1 (en) * 2020-09-28 2024-01-02 Amazon Technologies, Inc. Hierarchical system and method for identifying sensitive content in data
CN113392641A (en) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 Text processing method, device, storage medium and equipment
CN112217841A (en) * 2020-12-09 2021-01-12 平安国际智慧城市科技股份有限公司 Live broadcast room management method and device, computer equipment and storage medium
CN112507628A (en) * 2021-02-03 2021-03-16 北京淇瑀信息科技有限公司 Risk prediction method and device based on deep bidirectional language model and electronic equipment
CN113420322B (en) * 2021-05-24 2023-09-01 阿里巴巴新加坡控股有限公司 Model training and desensitizing method and device, electronic equipment and storage medium
CN113420322A (en) * 2021-05-24 2021-09-21 阿里巴巴新加坡控股有限公司 Model training and desensitizing method and device, electronic equipment and storage medium
CN113177233A (en) * 2021-05-31 2021-07-27 上海英方软件股份有限公司 Sensitive data identification method and device
CN115618371B (en) * 2022-07-11 2023-08-04 上海期货信息技术有限公司 Non-text data desensitization method, device and storage medium
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN116090006A (en) * 2023-02-01 2023-05-09 北京三维天地科技股份有限公司 Sensitive identification method and system based on deep learning
CN116090006B (en) * 2023-02-01 2023-09-08 北京三维天地科技股份有限公司 Sensitive identification method and system based on deep learning
CN117009596A (en) * 2023-06-28 2023-11-07 国网冀北电力有限公司信息通信分公司 Identification method and device for power grid sensitive data
CN117556447A (en) * 2023-11-29 2024-02-13 金网络(北京)数字科技有限公司 Data encryption method and device based on classification recognition and storage medium

Similar Documents

Publication Publication Date Title
CN111191275A (en) Sensitive data identification method, system and device
CN110321432B (en) Text event information extraction method, electronic device and nonvolatile storage medium
CN109766540B (en) General text information extraction method and device, computer equipment and storage medium
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
CN112417885A (en) Answer generation method and device based on artificial intelligence, computer equipment and medium
CN111198948A (en) Text classification correction method, device and equipment and computer readable storage medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN111324739B (en) Text emotion analysis method and system
CN108205524B (en) Text data processing method and device
CN114298035A (en) Text recognition desensitization method and system thereof
CN113821605A (en) Event extraction method
CN108763192B (en) Entity relation extraction method and device for text processing
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
CN115357699A (en) Text extraction method, device, equipment and storage medium
CN114003725A (en) Information annotation model construction method and information annotation generation method
CN110852082B (en) Synonym determination method and device
CN111754352A (en) Method, device, equipment and storage medium for judging correctness of viewpoint statement
CN111597302A (en) Text event acquisition method and device, electronic equipment and storage medium
CN111178080A (en) Named entity identification method and system based on structured information
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN112989820B (en) Legal document positioning method, device, equipment and storage medium
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200522