CN111143884A

CN111143884A - Data desensitization method and device, electronic equipment and storage medium

Info

Publication number: CN111143884A
Application number: CN201911421361.2A
Authority: CN
Inventors: 丁琳
Original assignee: Beijing Yiyiyun Technology Co ltd
Current assignee: Beijing Yiyiyun Technology Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-12
Anticipated expiration: 2039-12-31
Also published as: CN111143884B; CN115062338A

Abstract

The disclosure relates to a data desensitization method and device, electronic equipment and a computer readable storage medium, and belongs to the technical field of data processing. The method comprises the following steps: obtaining a text to be desensitized, performing word segmentation on the text to be desensitized to obtain a plurality of words and parts of speech of the words, and performing filtering processing on the words according to the parts of speech of the words to obtain data to be desensitized; desensitizing the words in the data to be desensitized according to a preset independent sensitive data identification model and a part sensitive data identification model. The present disclosure may improve the efficiency and accuracy of data desensitization.

Description

Data desensitization method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data desensitization method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In each field, with the progress of informatization construction, the demand for interconnection and intercommunication of data is increasing day by day, but at the same time, the security problem of data is also more prominent. Data desensitization can be carried out on data deformation on certain sensitive information through desensitization rules, and reliable protection of sensitive private data is achieved. For example, in the medical field, medical data desensitization is to deform data relating to patient basic information and sensitive information in a medical process according to desensitization rules, so as to protect sensitive data. In the related technology, the identification of sensitive data depends on manually arranged field information, black and white lists or rules, and the identification efficiency and accuracy are low, so that the data desensitization efficiency and accuracy are low.

Disclosure of Invention

An object of the present disclosure is to provide a data desensitization method and apparatus, an electronic device and a computer readable storage medium, which overcome the problems of low efficiency and accuracy of data desensitization due to the limitations and disadvantages of the related art, at least to some extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a data desensitization method, comprising:

obtaining a text to be desensitized, performing word segmentation on the text to be desensitized to obtain a plurality of words and parts of speech of the words, and performing filtering processing on the words according to the parts of speech of the words to obtain data to be desensitized;

desensitizing the words in the data to be desensitized according to a preset independent sensitive data identification model and a part sensitive data identification model.

Optionally, desensitizing the words in the data to be desensitized through a preset independent sensitive data recognition model includes:

for each word in the data to be desensitized, converting the word and the attribute information of the word into a corresponding attribute vector;

processing the attribute vector through an independent sensitive data identification model to obtain an attribute value corresponding to the attribute vector;

determining whether the word is independently sensitive data based on the attribute values, and desensitizing the word when determined to be independently sensitive data.

Optionally, the independent sensitive data recognition model is obtained by training in the following way:

acquiring an original data set marked by sensitive data, and constructing a sensitive word bank according to the sensitive data marked in the original data set;

classifying the words in the sensitive word stock according to a preset rule, and determining the sensitive data type of each word in the sensitive word stock;

for each word of the independent sensitive data type, converting the word and the attribute information of the word into a corresponding attribute vector;

and establishing the independent sensitive data identification model through a logistic regression algorithm according to the attribute vector corresponding to the words of the independent sensitive data type and the preset attribute value.

Optionally, the partially sensitive data recognition model is obtained by training in the following manner:

for each word of a partial sensitive data type, determining target sensitive data corresponding to the word in the original data set;

acquiring similar words with similarity larger than a similarity threshold value with the word, and replacing the words in the target sensitive data with the similar words to obtain updated data;

when the updated data cannot be retrieved from the original data set, acquiring a sensitive data identification result input by a user aiming at the updated data;

and adding the updated data and the sensitive data identification result into the original data set, and taking the updated original data set as the partial sensitive data identification model.

Optionally, desensitizing the words in the data to be desensitized through a preset partially sensitive data recognition model includes:

for each word in the data to be desensitized, retrieving the word in the partially sensitive data identification model;

when the word is searched, judging whether the word is sensitive data;

when the word is sensitive data, the word is desensitized.

Optionally, desensitizing the words in the data to be desensitized according to a preset independent sensitive data identification model and a preset partially sensitive data identification model, including:

determining the sensitive data type of each word in the data to be desensitized;

desensitizing the words belonging to the independent sensitive data type through a preset independent sensitive data identification model;

and desensitizing the words belonging to the type of the partially sensitive data through a preset partially sensitive data recognition model.

Optionally, after the independent sensitive data recognition model is established, the method further includes:

acquiring a training data set marked by sensitive data, and selecting a first target training word with the sensitive data type as an independent sensitive data type from the sensitive data of the training data set;

identifying the first target training words through the independent sensitive data identification model to obtain a predicted value;

selecting a predicted value of which the difference value of the attribute values corresponding to the independent sensitive data types is smaller than a difference threshold value;

and updating the independent sensitive data recognition model according to the second target training words corresponding to the selected predicted values.

According to a second aspect of the present disclosure, there is provided a data desensitization apparatus comprising:

the preprocessing module is used for acquiring a text to be desensitized, segmenting the text to be desensitized to obtain a plurality of words and parts of speech of the words, and filtering the words according to the parts of speech of the words to obtain data to be desensitized;

and the desensitization module is used for desensitizing the words in the data to be desensitized according to a preset independent sensitive data identification model and a part sensitive data identification model.

Optionally, the desensitization module comprises: a first desensitization sub-module, the first desensitization sub-module comprising:

the vector conversion unit is used for converting each word in the data to be desensitized and the attribute information of the word into a corresponding attribute vector;

the attribute value determining unit is used for processing the attribute vector through an independent sensitive data identification model to obtain an attribute value corresponding to the attribute vector;

and the independent sensitive data desensitization unit is used for determining whether the word is independent sensitive data according to the attribute value and desensitizing the word when the word is determined to be independent sensitive data.

Optionally, the data desensitization apparatus according to the embodiment of the present disclosure further includes:

the sensitive word stock building module is used for obtaining an original data set marked by sensitive data and building a sensitive word stock according to the sensitive data marked in the original data set;

the sensitive data type determining module is used for classifying the words in the sensitive word stock according to a preset rule and determining the sensitive data type of each word in the sensitive word stock;

the attribute vector determining module is used for converting the word and the attribute information of the word into a corresponding attribute vector aiming at each word of the independent sensitive data type;

and the independent sensitive data identification model establishing module is used for establishing the independent sensitive data identification model through a logistic regression algorithm according to the attribute vector corresponding to the words of the independent sensitive data type and the preset attribute value.

the target sensitive data determining module is used for determining target sensitive data corresponding to each word in the partial sensitive data type in the original data set;

the similar word replacing module is used for acquiring a similar word with the similarity larger than a similarity threshold, and replacing the word in the target sensitive data with the similar word to obtain updated data;

the identification result acquisition module is used for acquiring a sensitive data identification result input by a user aiming at the updated data when the updated data cannot be retrieved in the original data set;

and the partial sensitive data identification model establishing module is used for adding the updated data and the sensitive data identification result into the original data set and taking the updated original data set as the partial sensitive data identification model.

Optionally, the desensitization module comprises: a second desensitization sub-module, the second desensitization sub-module comprising:

a retrieval unit, configured to, for each word in the data to be desensitized, retrieve the word in the partially sensitive data identification model;

the sensitive data judging unit is used for judging whether the word is sensitive data or not when the word is searched;

and the sensitive data desensitization unit is used for desensitizing the word when the word is sensitive data.

Optionally, the desensitization module is specifically configured to determine a sensitive data type of each word in the data to be desensitized; desensitizing the words belonging to the independent sensitive data type through a preset independent sensitive data identification model; and desensitizing the words belonging to the type of the partially sensitive data through a preset partially sensitive data recognition model.

the model updating module is used for acquiring a training data set marked by sensitive data and selecting a first target training word with a sensitive data type as an independent sensitive data type from sensitive data of the training data set; identifying the first target training words through the independent sensitive data identification model to obtain a predicted value; selecting a predicted value of which the difference value of the attribute values corresponding to the independent sensitive data types is smaller than a difference threshold value; and updating the independent sensitive data recognition model according to the second target training words corresponding to the selected predicted values.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

Exemplary embodiments of the present disclosure have the following advantageous effects:

in the data desensitization method and device provided by the exemplary embodiment of the disclosure, words in the data to be desensitized are desensitized through the independent sensitive data identification model and the partial sensitive data identification model, and the data desensitization can be automatically performed on the data to be desensitized, so that the labor cost can be reduced, and the data desensitization efficiency is improved. Moreover, different models, namely an independent sensitive data identification model and a partial sensitive data identification model, can be established for words of different sensitive data types, and data desensitization is carried out according to a plurality of different models, so that the accuracy of data desensitization can be improved. In addition, word segmentation and filtering processing are carried out on the text to be desensitized to obtain data to be desensitized, so that the complexity of calculation can be reduced, and the data desensitization efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 shows a schematic diagram of identifying and labeling data in semi-structured data through an XML path language;

FIG. 2 illustrates a flow chart of a method of data desensitization in an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating a method for building an independent sensitive data recognition model and a partially sensitive data recognition model according to an embodiment of the present disclosure;

FIG. 4 illustrates a dependency graph;

FIG. 5 shows a flowchart of a method for desensitizing words in data to be desensitized by an independent sensitive data identification model in an embodiment of the present disclosure;

FIG. 6 shows a flowchart of a method for desensitizing words in data to be desensitized by a partially sensitive data identification model in an embodiment of the present disclosure;

FIG. 7 illustrates a flow chart of a method for data desensitization via independent sensitive data identification models and partially sensitive data identification models in an embodiment of the present disclosure;

FIG. 8 illustrates a flow chart for updating the independent sensitive data recognition model in an embodiment of the present disclosure;

FIG. 9 is a schematic diagram showing one configuration of a data desensitization apparatus according to an embodiment of the present disclosure;

fig. 10 shows a schematic structural diagram of a computer system of an electronic device for implementing an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the present disclosure, the terms "include", "arrange", "disposed" and "disposed" are used to mean open-ended inclusion, and mean that there may be additional elements/components/etc. in addition to the listed elements/components/etc.; the terms "first," "second," and the like are used merely as labels, and are not limiting as to the number or order of their objects.

Sensitive data refers to data relating to the security and privacy of a user, including: identification card number, telephone number, address, bank card number, etc., and sensitive data related to medical field includes allergic medicine, infectious disease history, eating habit, gene information, etc. The leakage of sensitive data not only has serious influence on the core confidentiality, the competitive power of the same industry and the market reputation of an enterprise, but also has different degrees of harm to the privacy of a user and the personal information safety. Thus, where customer security data or some business sensitive data is involved, the authentic data may be modified, for example, data desensitization may be performed on personally sensitive information such as identification numbers, cell phone numbers, card numbers, customer numbers, and the like.

The existing sensitive data identification modes mainly include the following three types:

1) and manually checking the original data, and specifying a data column where the sensitive data is located, or specifying a special data node in the semi-structured data. Referring to FIG. 1, FIG. 1 shows a schematic diagram of identifying and labeling data in semi-structured data via an XML (extensible markup language) path language. The XML path language is a language for determining the position of a part in an XML document, and it can be seen that an identification card number can be recognized through the XML path language.

2) Manually providing a full set of data requiring desensitization (white list) and an exception list (black list), and scanning all data to find matching white list data.

3) Rules (e.g., regular expressions) are customized manually according to data characteristics requiring desensitization, and sensitive data is found through rule configuration. The regular expression is a logic formula for operating on character strings, namely a 'regular character string' is formed by using a plurality of specific characters defined in advance and a combination of the specific characters, and the 'regular character string' is used for expressing a filtering logic for the character strings.

Therefore, in the existing scheme, the multiple sensitive data identification modes are all used for identifying the sensitive data in a manual mode, and a large amount of marking work needs to be performed on the data manually, so that the identification efficiency and accuracy are low. Accordingly, the efficiency and accuracy of data desensitization is low.

In order to solve the above problems, embodiments of the present disclosure provide a data desensitization method and apparatus, an electronic device, and a computer-readable storage medium, which can improve efficiency and accuracy of data desensitization.

The data desensitization method provided by the embodiments of the present disclosure is first described below.

Wherein the types of data include:

1. structuring data: relational database data, Excel, CSV (comma separated value) and other table data.

2. Semi-structured data: data stored in JSON (JSON object notation), XML, Html (hypertext markup language), and the like.

3. Unstructured data: plain text, text scans, medical image data, and the like.

The data types supported by various sensitive data identification modes can be seen in table 1:

TABLE 1

The semi-structured data specified manually in table 1 indicates that there are certain limitations to the data format and specification. For unstructured data, sensitive information cannot be accurately marked with row-column numbers, and is often mixed with non-sensitive information, for which a huge amount of information is almost impracticable by manual marking. Although the rule matching mode can partially realize the sensitive data identification of the unstructured data, a large amount of induction and abstraction needs to be performed on the existing sensitive data, and then the existing sensitive data is translated into the rule language, which puts high requirements on data processors.

Medical sensitive data is not limited to data in a certain mode, and therefore, the sensitive data identification in the medical field includes structured data, semi-structured data, data stored in a text mode in unstructured data, and the like.

Referring to fig. 2, fig. 2 shows a flow chart of a data desensitization method in an embodiment of the present disclosure, which may include the following steps:

step S210, obtaining a text to be desensitized, performing word segmentation on the text to be desensitized to obtain a plurality of words and parts of speech of the words, and performing filtering processing on the words according to the parts of speech of the words to obtain data to be desensitized.

And step S220, desensitizing the words in the data to be desensitized according to a preset independent sensitive data identification model and a part sensitive data identification model.

Through the independent sensitive data identification model and the partial sensitive data identification model, the words in the data to be desensitized are desensitized, the data desensitization can be automatically performed on the data to be desensitized, the labor cost can be reduced, and the data desensitization efficiency is improved. Moreover, different models, namely an independent sensitive data identification model and a partial sensitive data identification model, can be established for words of different sensitive data types, and data desensitization is carried out according to a plurality of different models, so that the accuracy of data desensitization can be improved. In addition, word segmentation and filtering processing are carried out on the text to be desensitized to obtain data to be desensitized, so that the complexity of calculation can be reduced, and the data desensitization efficiency is improved.

The data desensitization method of the disclosed embodiments is described in more detail below:

in the embodiment of the present disclosure, since the independent sensitive data identification model and the partially sensitive data identification model used in the data desensitization are established in advance, first, a method for establishing the independent sensitive data identification model and the partially sensitive data identification model is described below.

Referring to fig. 3, fig. 3 is a flow chart illustrating a method for building an independent sensitive data recognition model and a partial sensitive data recognition model according to an embodiment of the present disclosure, which may include the following steps:

step S310, an original data set marked by sensitive data is obtained, and a sensitive word bank is constructed according to the sensitive data marked in the original data set.

In the embodiment of the present disclosure, the original data set may be a data set marked with sensitive data in a conventional manner, for example, the sensitive data in the original data set may be marked with red or in another manner, so as to obtain the sensitive data. The text data may be divided into long texts, sentences, words, etc. according to the granularity of the contained information, wherein the words are the minimum units representing semantics, and the labeled sensitive data may be the long texts, sentences, words, etc.

In an exemplary embodiment of the present disclosure, the sensitive data marked in the original data set may be subjected to word segmentation processing, resulting in a plurality of words. Because the obtained multiple words may have non-sensitive words, the multiple words can be filtered, and the set of filtered words is used as a sensitive word bank.

Optionally, part-of-speech tagging may be performed on each of the plurality of words to obtain a part-of-speech of each word. For example, words may be part-of-speech tagged by a linguistic analysis tool. The parts of speech of Chinese can be divided into 12 categories:

1) the noun: words that represent names of people and things.

2) Verb: representing human and thing actions, behaviors, developments, changes.

3) Adjectives: which represents the shape, nature, color, state, etc. of the object.

4) Adverb: verbs are modified and limited to indicate time, frequency, range, degree, tone, etc.

5) Pronouns: words and phrases that replace an uncertain person or thing; such as you, me, he, who, etc.

6) Preposition: together with other words, constitute a prepositional phrase.

7) Quantifier: a word representing a unit of thing or action.

8) Conjunctive: connecting words and words of sentences.

9) And (3) auxiliary words: words that are attached to words or sentences to assist, such as: earth, and the like.

10) Number of words: a word representing a number.

11) Exclamation words: words representing emotions, exclamations, calls, responses.

12) The sound-making word is: words representing sound.

The noun is the word containing the most sensitive information, and therefore, the noun can be subdivided into: proper nouns, names of people, names of places, names of organizations, etc.; sensitive information such as telephone numbers, identity card numbers, addresses and the like in nouns can be analyzed more carefully through related rules. And adjectives, adverbs, pronouns, prepositions, sighs, pseudonyms and the like do not usually contain sensitive information, so that the words can be filtered according to the parts of speech of the words. Specifically, words, which are not likely to contain sensitive information, such as adjectives, adverbs, pronouns, prepositions, sighs, pseudonyms and the like, in the plurality of words can be filtered out, so that a sensitive word bank can be obtained. Of course, the present disclosure may be filtered by other methods, and is not limited thereto.

Step S320, classifying the words in the sensitive word bank according to a preset rule, and determining a sensitive data type to which each word in the sensitive word bank belongs.

Wherein, the words may include: non-sensitive words (i.e., words unrelated to sensitive data) and sensitive words, including: the words of the independent sensitive data type and the words of the partial sensitive data type mean that the words are sensitive data without being modified by other words, such as a telephone number, an identification number, a home address and the like. Words of the partially sensitive data type mean that the words need to be combined with other words to form combined words or phrases to be sensitive data, for example, a li-shi patient who has suffered from a disease during the service of a building armed team, and there is a requirement for desensitization to a particular medical history of the occurrence of a national security unit, but if the patient does not involve a national security unit, then the information is not required to be desensitized.

In the embodiment of the present disclosure, the preset rule may include a plurality of rules, for example, for the rule of the id number, 18 consecutive digits, and 7 th to 10 th digits in the range of 1900 to 2019, may be identified as the independent sensitive data type. The rule for the address may be that if the address is determined to be a specific home address, then it is identified as an independent sensitive data type. Words of some partially sensitive data types can be directly identified according to a pre-established database, and if the words are matched, the words are determined to be partially sensitive data types.

It should be noted that, for some words in the sensitive word library, they may not be recognized according to the preset rule. In addition, an interface can be provided for a user, and through an interactive mode, unrecognized words are recognized manually, so that the sensitive data type of each word is determined.

In embodiments of the present disclosure, for a word of an independently sensitive data type, desensitization may be performed directly according to the word. For words of the partially sensitive data type, which cannot be desensitized by an independent word, desensitization can be based on the relationship between the word and the context word. Therefore, in order to improve the accuracy of data desensitization, a plurality of different models can be established according to words of different sensitive data types, and data desensitization can be carried out through the plurality of different models. The following steps S330 to S340 are processes of establishing an independent sensitive data recognition model according to the words of the independent sensitive data type, and the following steps S350 to S380 are processes of establishing a partial sensitive data recognition model according to the words of the partial sensitive data type.

Step S330, for each word of the independent sensitive data type, converting the word and the attribute information of the word into a corresponding attribute vector.

In the embodiment of the present disclosure, the attribute information of the word includes: word frequency, part of speech, context words, dependency relationship between context words and the like. The word frequency is the frequency of occurrence of the word, the part of speech refers to the aforementioned noun, adjective, etc., and the context word refers to other words in the same sentence as the word, and may be adjacent words or not. Dependencies are syntactic structures of sentences or dependencies between words in sentences, and dependencies between words can be seen in table 2.

TABLE 2

Wherein the dependency between the terms and the context terms can be determined by a language analysis tool. For example, if a sentence is: the doctor visit at 12 weeks of wang, home, and made an ultrasonic prompt for intrauterine pregnancy, and the dependency relationship between each word and context words obtained by the linguistic analysis tool can be seen in fig. 4.

Note that, non-numerical attribute information such as parts of speech and dependency relationships may be quantized by numerical values, and then the attribute information may be converted into attribute vectors based on the quantization results. For example, a master predicate relationship may be represented as 1, a move predicate relationship may be represented as 2, an inter-predicate relationship may be represented as 3, and so on. For the word itself, it may be converted to a corresponding word vector by word2 vec. It can be seen that the resulting attribute vector is a vector composed of a plurality of vectors, and the corresponding attribute vectors are different for different words.

Step S340, establishing an independent sensitive data recognition model through a logistic regression algorithm according to the attribute vectors corresponding to the words of the independent sensitive data types and preset attribute values.

It should be noted that the attribute values corresponding to the independent sensitive data type and the partial sensitive data type may be preset values, for example, 1 and 0, respectively. Of course, other values are also possible, and are not limited herein. Thus, for each word belonging to the independent sensitive data type in the sensitive word stock, the attribute vector and the attribute value corresponding to the word can be obtained. The attribute values of different words are the same, so that the corresponding relation between each attribute vector and the attribute value can be established, and the independent sensitive data identification model is established through a logistic regression algorithm according to the corresponding relation.

In the embodiment of the present disclosure, a data set a (i.e., a correspondence between each attribute vector and an attribute value) may be randomly divided into k packets by a k-fold cross validation method, where one packet is used as a test set each time, and the remaining k-1 packets are used as a training set for training. Namely, k-1 packets are taken for the logistic regression operation. The logistic regression operation can be logistic regression or other regression methods, and logistic regression is a generalized linear regression analysis model and is commonly used in the fields of data mining, automatic disease diagnosis, economic prediction and the like. For example, risk factors causing diseases are studied, and the probability of occurrence of diseases is predicted from the risk factors.

logistic regression is much the same as multiple linear regression analysis, in that they have essentially the same model form, all with w ' x + b, where w and b are the parameters to be solved, and differ in their dependent variables, and that multiple linear regression directly takes w ' x + b as the dependent variable, i.e. y ═ w ' x + b, whereas logistic regression corresponds w ' x + b to a hidden state p, p ═ L (w ' x + b) by a function L, and then determines the value of the dependent variable according to the size of p and 1-p. If L is a logistic function, it is a logistic regression, and if L is a polynomial function, it is a polynomial regression.

Step S350, for each word of the partial sensitive data type, determining target sensitive data corresponding to the word in the original data set.

In the embodiment of the disclosure, because the words of the partially sensitive data type need to be combined with other words to form a combined word or a short sentence to be sensitive data, for each word of the partially sensitive data type, target sensitive data corresponding to the word in the original data set can be determined, and the partially sensitive data identification model is determined according to the dependency relationship between the word and other words in the target sensitive data. The target sensitive data corresponding to the word refers to a sentence or paragraph where the word is located, and the like, and the dependency relationship between the word and other words can be analyzed according to the target sensitive data, and the partial sensitive data identification model is determined according to the dependency relationship.

And step S360, obtaining similar words with the similarity larger than the similarity threshold, and replacing the words in the target sensitive data with the similar words to obtain updated data.

Generally, data documents in various fields (such as medical fields and the like) belong to strict text data, and although storage formats are various, related standards for most data documents can be used for reference. Therefore, the format and content order of most document data are converged. That is, in the case of writing specification determination, although data is free-text written, data dependency structures describing the same classification content are similar. If similar or analogous terms are used in the same dependency structure, the information described by this phrase may be similar.

Based on the principle, words in the target sensitive data can be replaced by similar words to obtain updated data. At present, the relevance of the words can be analyzed through a plurality of relevant tools, for example, word2vec can vectorize the words, so that the relationship between the words can be quantitatively measured, and the relationship between the words can be mined. By sorting all the data marked as sensitive data, analyzing the dependency structure S and the words W formed by the dependency structure S, keeping the dependency structure S unchanged, and continuously replacing the words W' closely related to W in the sensitive data, the updated data can be obtained, and the probability that the updated data is the sensitive data is generally higher.

Step S370, when the updated data cannot be retrieved from the original data set, acquiring a sensitive data identification result input by the user for the updated data.

Specifically, the updated data may be retrieved from the original data set, and if the updated data is retrieved, whether the updated data is sensitive data may be determined directly according to the sensitive data marking result. If the updated data is not retrieved, an interface can be provided for a user, and the identification is carried out in a manual mode, namely, the marking is carried out in a manual mode, so that the identification result of the sensitive data is obtained.

And step S380, adding the updated data and the sensitive data identification result into the original data set, and taking the updated original data set as a partial sensitive data identification model.

In the embodiment of the present disclosure, the sensitive data identification result includes: whether the updated data is sensitive data or not can be added into the original data set according to the identification result of the updated data and the corresponding sensitive data, so that the data in the original data set are expanded to obtain a partial sensitive data identification model. As can be seen, the partially sensitive data identification model is a data set marked by sensitive data. The more data in the partially sensitive data identification model, the more accurate and rapid identification of the sensitive data can be carried out.

Thus, an independent sensitive data identification model and a partially sensitive data identification model are established, and a specific method for carrying out data desensitization according to the independent sensitive data identification model and the partially sensitive data identification model is as follows.

In step S210, a text to be desensitized is obtained, words of the text to be desensitized are segmented to obtain a plurality of words and parts of speech of the words, and the words are filtered according to the parts of speech of the words to obtain data to be desensitized.

In the embodiment of the disclosure, the text to be desensitized is a text with desensitization requirements, and the text to be desensitized usually contains sensitive data. The text to be desensitized may be a word document, a txt document, etc., and since the processing procedure of this step is the same as step S310, it is not described in detail here. By carrying out word segmentation and filtering on the text to be desensitized, the data to be desensitized is obtained, the complexity of calculation can be reduced, and the desensitization efficiency is improved.

In step S220, desensitizing the words in the data to be desensitized according to the preset independent sensitive data identification model and the partially sensitive data identification model.

Specifically, the data recognition model includes: the independent sensitive data recognition model and the partially sensitive data recognition model can be desensitized in a parallel manner, that is, desensitized by the independent sensitive data recognition model and the partially sensitive data recognition model, respectively. In an exemplary embodiment of the present disclosure, each word may also be first identified by an independent sensitive data identification model, and if the word is of an independent sensitive data type, the word may be desensitized. And then, identifying each remaining word through a partially sensitive data identification model. Therefore, repeated calculation can be avoided, and the recognition efficiency is improved.

The method for desensitizing the words in the data to be desensitized through the independent sensitive data identification model can be seen in fig. 5, and includes the following steps:

step S510, for each word in the data to be desensitized, the word and the attribute information of the word are converted into corresponding attribute vectors.

Since the processing procedure of this step is the same as that of step S330, detailed description thereof will be omitted.

Step S520, processing the attribute vector through the independent sensitive data identification model to obtain an attribute value corresponding to the attribute vector.

Specifically, after the independent sensitive data identification model is established, the parameter values in the independent sensitive data identification model are obtained. Then, the attribute vector is processed through the independent sensitive data identification model, that is, the attribute vector is substituted into the independent sensitive data identification model, and the attribute value corresponding to the attribute vector can be obtained.

Step S530, whether the word is independent sensitive data or not is determined according to the attribute value, and when the word is determined to be independent sensitive data, the word is desensitized.

In the embodiment of the present disclosure, if the obtained attribute value is close to the attribute value corresponding to the independent sensitive data type, the term may be considered to belong to the independent sensitive data type, and desensitize the term. For example, when the attribute values of the independent sensitive data type and the partially sensitive data type are 1 and 0, respectively, if the attribute value is close to 1, for example, 0.95, then the word may be considered as the independent sensitive data type.

Then, the judgment can be continued through the partially sensitive data identification model, and a method for desensitizing the words in the data to be desensitized through the partially sensitive data identification model can be seen in fig. 6, and includes the following steps:

step S610, for each word in the data to be desensitized, the word is retrieved in the partially sensitive data identification model.

Specifically, a combined word or phrase composed of the word and a context word of the word may be retrieved in the partial sensitive data recognition model, and if retrieved, it may be directly determined whether the word is sensitive data. If not, the user can be identified manually by providing an interface to the user.

In step S620, when the word is retrieved, it is determined whether the word is sensitive data.

When the word is retrieved, the marking result corresponding to the word is contained in the partial sensitive data identification model, and whether the word is sensitive data can be determined directly according to the marking result.

In step S630, when the word is sensitive data, desensitizing the word.

After the sensitive data is identified, the sensitive data may be desensitized. The desensitization method comprises: a digital digest, a character mask (e.g., a telephone number, which may be replaced with 1), partial desensitization (only desensitized critical portions, e.g., desensitized addresses in xxxxxx, beijing), and the like. The digital digests are a series of ciphertexts with fixed length (128 bits) formed by 'digests' of plaintext to be encrypted by adopting a one-way Hash function, and different digests of plaintext are ciphertexts, the results are always different, and the digests of the same plaintext are necessarily consistent.

In addition, when the word is not sensitive data, no processing is needed, that is, the processing flow of the word is ended.

In an exemplary embodiment of the present disclosure, a method of data desensitization may also be seen in fig. 7, comprising the steps of:

step S710, determining the sensitive data type of each word in the data to be desensitized.

In the embodiment of the disclosure, the sensitive data type of each word in the data to be desensitized can be determined by the attribute of the word. For example, it is predefined which data is required to be fully desensitized, and this part of data is independently sensitive data. It is predefined which data needs to be combined with other data to be desensitized, and the part of data is part of sensitive data. And then, selecting a corresponding model for desensitization according to the sensitive data types of the words.

And S720, desensitizing the words belonging to the independent sensitive data type through a preset independent sensitive data recognition model.

And step S730, desensitizing the words belonging to the type of the partially sensitive data through a preset partially sensitive data recognition model.

It should be noted that the specific desensitization methods of step S720 and step S730 have been described in detail in the foregoing, and are not described herein again.

In an exemplary embodiment of the present disclosure, a new data set may also be obtained, and the independent sensitive data recognition model is updated to improve the accuracy of the independent sensitive data recognition model. See fig. 8, comprising the steps of:

step S810, a training data set marked by sensitive data is obtained, and a first target training word with the sensitive data type as an independent sensitive data type is selected from the sensitive data of the training data set.

In the embodiment of the present disclosure, the training data set is similar to the original data set, and is a data set labeled by sensitive data. Similarly, the sensitive data in the training data set may be subjected to word segmentation, and a sensitive word bank corresponding to the training data set is obtained through filtering. Because the independent sensitive data recognition model is a model for the independent sensitive data type, after words in the sensitive word stock corresponding to the training data set are classified, a first target training word with the sensitive data type as the independent sensitive data type can be selected. That is, the independent sensitive data recognition model type is updated by the first target training words.

And S820, identifying the first target training words through the independent sensitive data identification model to obtain a predicted value.

In the embodiment of the disclosure, the method for recognizing the first target training word by the independent sensitive data recognition model may specifically be: and converting the first target training words into corresponding attribute vectors according to the method, and substituting the attribute vectors into the independent sensitive data recognition model, so as to obtain a predicted value. Under the general condition, the deviation exists between the predicted value and the actual value, and the smaller the deviation is, the higher the accuracy of the independent sensitive data identification model is represented; the greater the deviation, the less accurate the independent sensitive data recognition model is represented. Therefore, the accuracy of the independent sensitive data recognition model can be verified in the process of recognizing the first target training words by the independent sensitive data recognition model.

Step S830, selecting a predicted value of which the difference value of the attribute values corresponding to the independent sensitive data types is smaller than a difference threshold value; and updating the independent sensitive data recognition model according to the second target training words corresponding to the selected predicted values.

It is to be understood that the first target training word is a labeled word, and the attribute value corresponding to the first target training word is also the attribute value corresponding to the independent sensitive data type. And comparing the predicted value with the actual attribute value, and if the difference value is smaller than the difference threshold value, indicating that the accuracy of the independent sensitive data recognition model for recognizing the first target training word is higher, namely the first target training word can be recognized through the independent sensitive data recognition model. Accordingly, the first target training words can be updated to improve the independent sensitive data recognition model and improve the accuracy of the independent sensitive data recognition model. Otherwise, the first target training word is not suitable for being recognized through the independent sensitive data recognition model. The difference value between the predicted value and the attribute value refers to an absolute value of the difference value, and the difference threshold may be a value set empirically, for example, 0.1, 0.05, and the like, which is not limited herein.

In the embodiment of the disclosure, a word in which a difference value between a predicted value corresponding to the first target training word and an attribute value corresponding to the independent sensitive data type is smaller than a difference threshold value may be used as a second target training word, and the independent sensitive data recognition model is updated through the second target training word. The specific method can be as follows: and converting the second target training words into corresponding attribute vectors according to the method, and adjusting parameter values in the independent sensitive data recognition model according to the attribute vectors and the attribute values so as to further optimize the independent sensitive data recognition model and improve the accuracy of the independent sensitive data recognition model. Of course, the more the second target training words, the higher the accuracy of the updated independent sensitive data recognition model. It should be noted that, later, the independent sensitive data recognition model may be continuously updated by continuously acquiring a new training data set. Along with the accumulation of the data set, the accuracy of the independent sensitive data identification model can be continuously improved.

In the data desensitization method of the embodiment of the disclosure, the independent sensitive data identification model and the partial sensitive data identification model are respectively established according to the original data set marked by the sensitive data, and data desensitization can be automatically performed on the data to be desensitized through the independent sensitive data identification model and the partial sensitive data identification model, so that the labor cost can be reduced, and the efficiency and accuracy of desensitization are improved. In addition, the independent sensitive data identification model and the partial sensitive data identification model can be updated through data accumulation, so that the accuracy of data desensitization of the independent sensitive data identification model and the partial sensitive data identification model is improved. Moreover, the independent sensitive data recognition model carries out vectorization on words, so that private data can be leaked without worrying about the independent sensitive data recognition model.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, in this exemplary embodiment, there is also provided a data desensitization apparatus 900, referring to fig. 9, fig. 9 shows a schematic structural diagram of the data desensitization apparatus in the embodiment of the present disclosure, including:

the preprocessing module 910 is configured to obtain a text to be desensitized, perform word segmentation on the text to be desensitized to obtain a plurality of words and parts of speech of the words, and perform filtering processing on the words according to the parts of speech of the words to obtain data to be desensitized;

and a desensitization module 920, configured to desensitize, according to a preset independent sensitive data identification model and a preset partially sensitive data identification model, a word in the data to be desensitized.

Optionally, a desensitization module, comprising: a first desensitization sub-module, the first desensitization sub-module comprising:

the attribute value determining unit is used for processing the attribute vector through the independent sensitive data identification model to obtain an attribute value corresponding to the attribute vector;

and the independent sensitive data identification model establishing module is used for establishing an independent sensitive data identification model through a logistic regression algorithm according to the attribute vector corresponding to the words of the independent sensitive data type and the preset attribute value.

the similar word replacing module is used for acquiring similar words with the similarity larger than a similarity threshold, replacing the words in the target sensitive data with the similar words and obtaining updated data;

and the partial sensitive data identification model establishing module is used for adding the updated data and the sensitive data identification result into the original data set and taking the updated original data set as a partial sensitive data identification model.

Optionally, a desensitization module, comprising: a second desensitization sub-module, the second desensitization sub-module comprising:

the retrieval unit is used for retrieving each word in the data to be desensitized in the partially sensitive data identification model;

the model updating module is used for acquiring a training data set marked by sensitive data and selecting a first target training word with the sensitive data type as an independent sensitive data type from the sensitive data of the training data set; identifying the first target training words through an independent sensitive data identification model to obtain a predicted value; selecting a predicted value of which the difference value of the attribute values corresponding to the independent sensitive data types is smaller than the difference threshold value; and updating the independent sensitive data recognition model according to the second target training words corresponding to the selected predicted values.

The details of each module/unit in the above-mentioned apparatus have been described in detail in the embodiments of the method section, and thus are not described again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, there is also provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the method of any of the example embodiments.

Fig. 10 shows a schematic structural diagram of a computer system of an electronic device for implementing an embodiment of the present disclosure. It should be noted that the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the computer system 1000 includes a central processing unit 1001 which can perform various appropriate actions and processes in accordance with a program stored in a read only memory 1002 or a program loaded from a storage section 1008 into a random access memory 1003. In the random access memory 1003, various programs and data necessary for system operation are also stored. The central processing unit 1001, the read only memory 1002, and the random access memory 1003 are connected to each other by a bus 1004. An input/output interface 1005 is also connected to the bus 1004.

The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a Local Area Network (LAN) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the input/output interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by the central processing unit 1001, various functions defined in the apparatus of the present application are executed.

In an exemplary embodiment of the disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

It should be noted that the computer readable storage medium shown in the present disclosure can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory, a read-only memory, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, radio frequency, etc., or any suitable combination of the foregoing.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A method of data desensitization, the method comprising:

2. The method according to claim 1, wherein desensitizing words in the data to be desensitized through a preset independent sensitive data recognition model comprises:

3. The method of claim 2, wherein the independent sensitive data recognition model is trained by:

4. The method of claim 3, wherein the partially sensitive data recognition model is trained by:

5. The method according to claim 4, wherein desensitizing words in the data to be desensitized by a preset partially sensitive data recognition model comprises:

when the word is searched, judging whether the word is sensitive data;

when the word is sensitive data, the word is desensitized.

6. The method according to claim 1, wherein desensitizing the words in the data to be desensitized according to a preset independent sensitive data identification model and a partially sensitive data identification model comprises:

7. The method of claim 3, wherein after establishing the independent sensitive data recognition model, the method further comprises:

8. A data desensitization apparatus, characterized in that the apparatus comprises:

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-7 via execution of the executable instructions.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.