CN112132238A

CN112132238A - Method, device, equipment and readable medium for identifying private data

Info

Publication number: CN112132238A
Application number: CN202011322577.6A
Authority: CN
Inventors: 王德胜; 刘佳伟; 章鹏
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2020-12-25

Abstract

The embodiment of the specification discloses a method, a device, equipment and a readable medium for identifying private data. The method comprises the following steps: acquiring metadata of data to be identified; inputting the metadata into a first multi-classification model to identify the data type of the data to be identified to obtain a first identification result; the first multi-classification model is obtained by training based on metadata corresponding to privacy type data; if the first identification result shows that the data to be identified belong to private data, determining the privacy type of the data to be identified according to the first identification result; if the first recognition result shows that the data to be recognized do not belong to private data, inputting the metadata and the data to be recognized into a second multi-classification model to obtain a second recognition result; and determining the privacy type of the data to be identified according to the second identification result.

Description

Method, device, equipment and readable medium for identifying private data

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a readable medium for identifying private data.

Background

In the prior art, when private data recognition is performed, built-in rules corresponding to the type or a multi-classification model based on machine learning can be generally adopted for recognition. The built-in rules may be specific regular expressions or recognition rules constructed based on structural features of the sensitive data itself. However, the types of private data are various. These pre-prepared built-in rules cannot cover all data types. When the data that the user wants to recognize does not have the corresponding preset built-in rule, the recognition of the target data cannot be realized. Most of existing multi-classification models based on machine learning are trained and recognized based on specific contents of data, and the multi-classification models obtained only based on the specific contents of the data are single in model dimension and cannot fully dig out multi-dimensional attributes of data to be recognized, so that recognition accuracy is low, and due to the fact that the specific contents of the data to be recognized contain large data, the model is high in cost in a training stage and a testing stage.

Therefore, how to provide a method for identifying private data with high accuracy and efficiency becomes an urgent technical problem to be solved.

Disclosure of Invention

The embodiment of the specification provides a method, a device, equipment and a readable medium for identifying private data, so that the accuracy and efficiency of private data identification are improved.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

an embodiment of the present specification provides a method for identifying private data, including:

acquiring metadata of data to be identified;

inputting the metadata into a first multi-classification model to identify the data type of the data to be identified to obtain a first identification result; the first multi-classification model is obtained by training based on metadata corresponding to privacy type data;

if the first identification result shows that the data to be identified belong to private data, determining the privacy type of the data to be identified according to the first identification result;

if the first recognition result shows that the data to be recognized do not belong to private data, inputting the metadata and the data to be recognized into a second multi-classification model to obtain a second recognition result; and determining the privacy type of the data to be identified according to the second identification result.

An apparatus for identifying private data provided by an embodiment of the present specification includes:

the data acquisition module is used for acquiring metadata of the data to be identified;

the first identification result determining module is used for inputting the metadata into a first multi-classification model so as to identify the data type of the data to be identified and obtain a first identification result; the first multi-classification model is obtained by training based on metadata corresponding to privacy type data; if the first identification result shows that the data to be identified belong to private data, determining the privacy type of the data to be identified according to the first identification result;

the second identification result determining module is used for inputting the metadata and the data to be identified into a second multi-classification model to obtain a second identification result if the first identification result indicates that the data to be identified does not belong to private data; and determining the privacy type of the data to be identified according to the second identification result.

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the processor stores instructions executable by the at least one processor to enable the at least one processor to:

acquiring metadata of data to be identified;

Embodiments of the present specification provide a computer readable medium having stored thereon computer readable instructions executable by a processor to implement a method of identifying private data.

At least one embodiment provided in this specification can achieve the following advantageous effects:

the embodiment of the specification preferentially adopts a first multi-classification model obtained by training metadata based on data of a known privacy type to judge whether the data to be identified belongs to the privacy data and the privacy type to which the data to be identified belongs, and if the data to be identified does not belong to the privacy data, a second multi-classification model obtained by training a combination result of the metadata based on data samples of the known privacy type and the data samples of the known privacy type is adopted to further judge the data to be identified. On the one hand, computing resources can be saved, computing time is shortened, and the overall recognition efficiency of the private data recognition model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic flowchart of an overall scheme of a method for identifying private data in an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart of an overall scheme of another method for identifying private data in an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a method for identifying private data according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an apparatus for identifying private data, corresponding to fig. 3, provided in an embodiment of the present specification;

fig. 5 is a schematic structural diagram of an apparatus for identifying private data, corresponding to fig. 3, provided in an embodiment of this specification.

Detailed Description

To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the scope of protection of one or more embodiments of the present specification.

The purpose of identifying the private data is to implement more effective protection on the private data, and the private data protection firstly needs to identify potential private data fields from a mass data table; secondly, desensitization processing is carried out on the identified private data fields by using corresponding means, so that leakage of the private data is effectively prevented.

At present, when the private data is identified, a user can identify the private data by adopting a corresponding preset regular expression or a corresponding multi-classification model trained in advance according to the type of the private data to be identified.

Regular Expression (Regular Expression), also called Regular Expression, Regular representation, which constructs a single character string to describe and match a series of character strings conforming to a certain syntax rule based on agreed grammar rules. For example, the mobile phone number can be represented by regular expression "^ 1[3-9] [0-9] {9} $", and the field matching this regular expression can be identified as the mobile phone number field. At present, if a rule-based private data identification scheme is adopted, when a database is scanned, the type of the private data is judged through each regular expression for the sampled data of the database, and then the identification results of all the sampled data are summarized and finally judged. Because the judgment effect is achieved through each rule, the matching efficiency is very low under the condition that the privacy data types are many; moreover, because the built-in rule cannot cover all privacy data types, the application range is very limited, and when the privacy type which the user wants to identify does not have a corresponding preset rule, the user requirement cannot be met; moreover, the writing of the built-in rule requires professional personnel to participate, and the human resource consumption is large.

A private data identification method based on a Multi-classification model (Multi-classification) is one of Supervised Learning (Supervised Learning) methods. Most of the existing multi-classification models based on machine learning are trained and recognized based on specific contents of data, and the multi-classification models obtained only based on the specific contents of the data are single in model dimension and cannot fully dig out multi-dimensional attributes of the data to be recognized, so that the recognition accuracy is low, and more GPU or CPU resources are consumed due to the fact that the specific contents of the data to be recognized contain large data quantity and the model cost is high in a training stage and a testing stage.

According to the method for identifying the private data, whether the data to be identified belongs to the private data and the privacy type to which the data to be identified belongs is judged preferentially by adopting a first multi-classification model obtained by training the metadata of the data with the known privacy type, and if the data to be identified does not belong to the private data and the privacy type to which the data to be identified belongs, the data to be identified is further judged by adopting a second multi-classification model obtained by training the combination result of the metadata of the data sample with the known privacy type and the data sample with the known privacy type. Therefore, on the one hand, computing resources can be saved, computing time is shortened, and the overall identification efficiency of the private data identification model is improved.

To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the scope of protection of one or more embodiments of the present disclosure. It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms, which are used only to distinguish one type of information from another.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of an overall scheme of a method for identifying private data in an embodiment of the present disclosure. As shown in fig. 1, when a user queries a plurality of pieces of data under a certain field in a database, data preprocessing is first performed, that is, data of a predetermined proportion is extracted from all data of the field to which the plurality of pieces of data to be recognized belong, or a predetermined number of pieces of data are extracted from all data of the field to which the plurality of pieces of data to be recognized belong, so as to obtain sample data, and metadata of the field to which the plurality of pieces of data to be recognized belong is obtained at the same time. After a data preprocessing stage, firstly, judging whether the sampled data belongs to the private data and the belonged privacy type by adopting a first multi-classification model obtained by metadata training based on the data with known privacy types, if the privacy type of the sampled data can be judged by the first multi-classification model, firstly, carrying out privacy type marking on all data of the whole field to which the sampled data belongs, then, carrying out desensitization processing on all data of the whole field to which the sampled data belongs by adopting a desensitization rule corresponding to the privacy type to which the sampled data belongs, and then, outputting a query result to a user. If the first multi-classification model can not judge whether the sampled data belongs to the private data and the privacy type to which the sampled data belongs, then a second multi-classification model obtained by training a combination result of metadata of a data sample with a known privacy type and the data sample with the known privacy type is adopted to further judge the sampled data, if the sampled data can be judged to be the private data and the privacy type to which the sampled data belongs through the second multi-classification model, then in order to improve the accuracy of private data identification, a plurality of verification rules corresponding to the privacy type of the sampled data are required to be adopted to carry out secondary verification on the sampled data (in the scheme of the embodiment, a plurality of verification rules, namely rule 1, rule 2, … and rule n, are preset according to the experience of the grasped privacy data type), the verification rules corresponding to the privacy type to which the sampled data belongs are adopted to carry out secondary verification on the sampled data, and then, performing proportion analysis on the result of the secondary verification, namely, if the proportion of the sampled data passing the secondary verification is greater than a preset first threshold and the probability of the sampled data belonging to the privacy type is greater than a preset second threshold, performing desensitization processing on all data of the whole field to which the sampled data belongs by adopting a desensitization rule corresponding to the privacy type to which the sampled data belongs. It should be noted that, in the present solution, in the data preprocessing stage, data sampling is performed from the fields to which the plurality of pieces of data to be queried by the user belong, and then, the sampled data is subjected to subsequent identification processing, considering that the number of the pieces of data to be queried by the user may be relatively small, for example, may be less than 50 pieces, at this time, if data sampling is not performed from the fields to which the plurality of pieces of data to be queried by the user belong, the plurality of pieces of data to be queried by the user are directly subjected to subsequent identification by the first multi-class model and whether further identification is performed by the second multi-class model is determined according to the identification result of the first multi-class model, in the secondary verification stage, the ratio value of the ratio of the plurality of pieces of data passing through the secondary verification may not have statistical regularity because the number of the pieces of data is relatively small, and thus in a statistical sense, the ratio of the number of pieces of data passing the secondary verification to all the pieces of data in the pieces of data cannot indicate whether the pieces of data are private data. When the number of the data to be queried by the user is large, for example, greater than or equal to 50, in the data preprocessing stage, data sampling is not performed from the fields to which the plurality of data to be queried by the user belong, but metadata of the fields to which the plurality of data to be recognized belong needs to be acquired, and then the contents of performing recognition and secondary verification on the fields through the first multi-classification model and determining whether the fields are subjected to further recognition and secondary verification through the second multi-classification model according to the recognition result of the first multi-classification model are referred to and are not described herein again.

Fig. 2 is a schematic flowchart of an overall scheme of another method for identifying private data in an embodiment of the present specification. In the scheme, when a user inquires a plurality of pieces of data under a certain field in a database, firstly, data preprocessing is needed, namely, data with a preset proportion are sampled from all data of the field to which the plurality of pieces of data to be identified belong, or a preset number of pieces of data are sampled from all data of the field to which the plurality of pieces of data to be identified belong, so that sampled data are obtained. Judging whether the sampled data belongs to the private data and the belonged privacy type by adopting a first multi-classification model obtained by metadata training based on the data with known privacy types, if the privacy type of the sampled data can be judged by the first multi-classification model, firstly carrying out privacy type marking on all data of the whole field to which the sampled data belongs, then carrying out desensitization processing on all data of the whole field to which the sampled data belongs by adopting a desensitization rule corresponding to the privacy type to which the sampled data belongs, and then outputting a processing result to a user; if the first multi-classification model cannot judge whether the sampled data belongs to the private data and the private type of the sampled data, further judging the sampled data by adopting a second multi-classification model obtained by training the combination result of the metadata of the data sample with the known private type and the data sample with the known private type, if the sampled data can be judged to be the private data and the private type of the sampled data by the second multi-classification model, carrying out privacy type marking on all data in the field to which the sampled data belongs by adopting the privacy type identification corresponding to the privacy type to which the sampled data belongs, then carrying out desensitization processing on all data in the whole field to which the sampled data belongs by adopting the desensitization rule corresponding to the privacy type to which the sampled data belongs, and then outputting the query result to the user. The difference between the technical scheme in fig. 2 and the technical scheme in fig. 1 is that if it can be determined whether the sampled data belongs to the private data and the private type to which the sampled data belongs through the second multi-classification model, the sampled data is not subjected to secondary verification, but the sampled data is directly regarded as the private data, all data in a field to which the sampled data belongs is directly regarded as the private data, and then subsequent privacy type marking and desensitization processing are performed, and then the data after the desensitization processing is returned to the user.

Fig. 3 is a flowchart illustrating a method for identifying private data according to an embodiment of the present disclosure. From the viewpoint of the program, the execution subject of the flow may be a program installed in an application server or an application terminal.

As shown in fig. 3, the process may include the following steps:

step 302: and acquiring metadata of the data to be identified.

Data, especially large volumes of data, are most often stored in a structured form, which can be stored in a database by a relational model based table structure. The database comprises a large number of table structures, and the data are stored by taking the table structures as organization units. Each table structure has one or more fields. The user can interact with the database system through the database query language, and then the required data is obtained. It should be noted that the data to be identified can refer to a piece of data that a user wishes to query from a field of the database table structure. For example: the database table structure can contain a plurality of fields such as 'name', 'age', 'mobile phone number', 'identification number' and the like, and each field can correspond to a plurality of pieces of data. In practical applications, the field attribute corresponding to a field should be the same, for example, all data contained in the field "mobile phone number" should be the mobile phone number of the user. In this embodiment, a user searches data stored in a database through a database query language, and before providing the searched data to the user, it is necessary to judge privacy type data to which the data displayed to the user belongs in advance. If the data are judged to be the private data, appropriate desensitization processing needs to be carried out on the data correspondingly, so that the risk possibly brought by the leakage of the private data is avoided.

Metadata (Metadata) is structured, encoded data, or structural data that provides information about a resource and can be used to assist in the identification, discovery, evaluation, and management of entities that are described. For example, for a word document, by clicking a right mouse button to view the attribute of the word document, a great deal of document attribute information of the word document, such as file type, opening mode, position, size, occupied space, creation time, modification time, access time, author, last-time keeper, whether the attribute is set to be read-only, and the like, can be obtained, and even if the word document is not opened to view the specific content recorded in the word document, some important information about the word document can be obtained from the attribute information to a certain extent. In the field of databases, metadata is data used to describe the structure and building method of data stored in a data warehouse, and generally represents related attribute information of databases, tables, and fields. Such as item names, database description information, table names, field names, comments, data types (integer, floating point, character) of the fields, and the like. The most basic function of metadata is to describe information resource objects stored in a database, that is, to explain and explain data, and can describe the subject, content, attributes, characteristics and the like of information resources, so that even if the specific content of the information described by the metadata is not specifically viewed, the attributes of the described data objects can be known to a certain extent.

Step 304: inputting the metadata into a first multi-classification model to identify the data type of the data to be identified to obtain a first identification result; the first multi-classification model is obtained by training based on metadata corresponding to privacy type data.

The task of supervised learning in statistical learning techniques is to learn a model that makes a good prediction of the corresponding output for any given input. The first multi-classification model may be one of supervised learning algorithms that may determine which known sample type a new sample belongs to based on certain characteristics of the known samples. More specifically, the multi-classification model may classify the samples by calculating and selecting feature parameters according to sample data provided by a known training set and creating a discriminant function. The first multi-classification model is obtained by training metadata corresponding to privacy type data, namely, the metadata with class labels of the data are used as training samples in advance, the multi-classification model is trained by using the metadata, the multi-classification model learns vector features contained in the metadata training samples with the class labels, and finally the trained first multi-classification model is obtained. The metadata comprises semantic feature information representing the data to be recognized. Therefore, when the metadata with unknown class labels of the data to be recognized is encountered, the metadata of the data to be recognized can be input into the trained first multi-classification model for recognition, and the privacy type of the data to be recognized is judged. It should be noted that, in the technical solution of this embodiment, when obtaining the metadata of the data to be identified, only a part of the metadata closely related to the determination of the privacy attribute of the data to be identified may be selected, and it is not necessary to select all the metadata of the data to be identified. The first multi-classification model herein may include: a multi-classification model based on a decision tree algorithm, a multi-classification model based on a random forest algorithm, a multi-classification model based on logistic regression, a multi-classification model based on an Xgboost algorithm, a multi-classification model based on a gradient lifting tree algorithm, a multi-classification model based on a maximum entropy algorithm, a multi-classification model based on a Convolutional Neural Network (CNN), a multi-classification model based on a Recurrent Neural Network (RNN), and the like.

It should be noted that, in this embodiment, the first multi-classification model has a function of identifying multiple privacy types, for example: the multi-classification model can identify various privacy types such as identification numbers, bank card numbers, mobile phone numbers, IP addresses, system account numbers and the like. Since the metadata contains characteristic information that characterizes the semantics of the data to be recognized, it can be analyzed from these characteristic information whether the data to be recognized is private data and to which type of private data it belongs. After the metadata of the data to be recognized is input into the first multi-classification model, the probability that the data to be recognized belongs to various privacy types can be obtained. For example: the privacy type of the data a needs to be identified, at this time, the metadata of the data a to be identified is input into the first multi-classification model, and the privacy type set that the data a may correspond to can be identified as follows: the probabilities that the privacy types corresponding to the data to be identified are the mobile phone number, the system account number and the electronic mailbox are respectively 60%, 30% and 10%. The privacy type set in the above steps may include one privacy type, may also include a plurality of privacy types, and may also not include a privacy type. In this embodiment, the privacy type with the highest corresponding probability is used as the privacy type to which the data to be identified belongs, that is, for the data a, the technical scheme of this embodiment can determine that the data is the privacy data, and the privacy type is the mobile phone number.

Privacy Data (Private Data), that is, secret Data, may refer to Data that is not intended to be known by others, unrelated persons, or the like. From the perspective of the owner of the private data, the private data may be divided into individual private data and common private data. In the embodiments of the present application, any data that a user wants to recognize and protect may be referred to as private data. For example, the privacy data may include personal characteristic information (e.g., phone number, address, credit card number, etc.) used to locate or identify an individual, sensitive information (e.g., personal health status, financial information, company vital documents, etc.), etc., and may also include family privacy data (e.g., family annual income status, etc.), corporate privacy data, etc.

The private data may include personal basic information, personal identification information, personal biometric information, network identification information, personal health and physiology information, personal educational work information, personal property information, personal communication information, contact information, personal internet records, personal common device information, personal location information, and the like.

The personal basic information privacy data can include specific information types such as personal name, birthday, gender, nationality, family relationship, address, personal telephone number, email and the like. The personal identity information type privacy data can comprise specific information types such as identity cards, military and official certificates, passports, driving licenses, work licenses, access cards, social security cards, residence certificates and the like. The personal biometric information privacy data may include personal genes, fingerprints, voice prints, eye prints, palm prints, pinna, irises, facial features, and other specific information types. The network identity information privacy data may include system account, IP address, mailbox address, and specific information types related to the foregoing password, password protection answer, personal digital certificate, and the like. The personal health physiological information privacy data can comprise related records of personal health medical treatment and the like, such as disease symptoms, hospitalization records, medical advice lists, inspection reports, operation and anesthesia records, nursing records, medication records, drug food allergy information, birth information, past medical history, diagnosis and treatment conditions, family medical history, current medical history, infectious medical history and the like, and other information related to the physical health condition of the person; and specific information types such as weight, height, vital capacity, etc. The personal education work information privacy data can comprise specific information types such as personal occupation, position, work unit, academic calendar, academic position, education experience, work experience, training record, score sheet and the like. The personal property information privacy data can comprise bank account numbers, identification information (passwords), deposit information (including amount of funds, payment and collection records and the like), property information, credit records, credit investigation information, transaction and consumption records, running records and the like, and specific information types such as virtual property information of virtual currency, virtual transactions, game conversion codes and the like. Personal communication information type privacy data may include communication records and content, short messages, multimedia messages, e-mails, and specific information types such as data describing personal communications (often referred to as metadata). The contact information privacy data may include address list, friend list, group list, e-mail address list, and other specific information types. The personal internet log type privacy data may refer to operation records stored through logs, and may include specific information types such as website browsing records, software usage records, and click records. The personal common device information privacy data may refer to information for describing basic conditions of the personal common device, and may include specific information types such as a hardware serial number, a device MAC address, a software list, a unique device identification code (e.g., IMEI/android ID/IDFA/openend ID/GUID, SIM card IMSI information, etc.). The personal location information privacy data can comprise specific information types such as track, accurate positioning information, accommodation information, longitude and latitude and the like. In addition, the private data may also include specific types of information such as wedding history, religious beliefs, sexual orientation, unpublished criminal records, and the like.

The above listed information is only an example as the private data recognizable by the embodiments of the present application, and is not limited to the above example.

Step 306: and if the first identification result shows that the data to be identified belongs to private data, determining the privacy type of the data to be identified according to the first identification result.

The first multi-classification model has a function of identifying a plurality of privacy types, and if the data to be identified can be identified as the privacy data in step 304, the privacy type to which the data to be identified belongs can be further determined in this step 306.

In step 304 and step 306, the metadata of the data to be recognized is input into the first multi-classification model trained in advance for judgment, and since the data volume of the metadata of the data to be recognized is much smaller than that of the specific text of the data to be recognized, but the metadata contains a large amount of key attribute information of the described data, the calculation cost and the time consumption of the scheme are much smaller from the viewpoint of calculation complexity compared with the multi-classification model based on the field content in the prior art. Therefore, most of the data to be identified, which are actually private data, can be accurately identified in the

steps

304 and 306 in the scheme. In practice, the metadata only describes attribute information of the data from a macro level, and does not include information on specific content of the data, and the metadata of a partial table structure may not include semantic feature information representing the data to be identified, so as to further improve the accuracy of the overall technical solution of this embodiment in identifying the private data, in this solution, the data determined not to be the private data in step 304 is further identified as described in step 308.

Step 308: if the first recognition result shows that the data to be recognized do not belong to private data, inputting the metadata and the data to be recognized into a second multi-classification model to obtain a second recognition result; and determining the privacy type of the data to be identified according to the second identification result.

In step 306, the data to be recognized determined as non-private data in step 304 is further recognized, that is, the data to be recognized and the metadata of the data to be recognized are input into the second multi-classification model for recognition. The second multi-classification model is obtained by combining metadata and texts of data to which the metadata belongs in advance to obtain a combined result, determining a type tag of the combined result (that is, if the data to which the metadata corresponds is private data, the type tag is a specific privacy type of the data to which the metadata corresponds, and if the data to which the metadata corresponds is not private data, the type tag represents that the data to which the metadata corresponds does not belong to the private data), training the second multi-classification model by using the combined result with the tag as a training sample, enabling the second multi-classification model to learn vector features contained in the combined result with the type tag, and finally obtaining the trained second multi-classification model. And then, for the data to be recognized which is judged to be not the private data by the first multi-classification model, further judging the data by utilizing the trained second multi-classification model to obtain a second recognition result. In the training stage, the second multi-classification model learns the metadata with the type labels and the vector characteristics contained in the data corresponding to the metadata, so that the second result is utilized to determine that the data to be recognized belongs to the privacy data and the privacy type of the data to be recognized. It should be noted that, the specific type of the second multi-classification model in this step may also adopt one of the multi-classification models recorded in step 304 according to needs, and details are not described here.

Most private data can be recognized in step 304, and in the case that the first multi-classification model recognizes the data to be recognized as non-private data, in this step 306, the data to be recognized is further recognized by using a second multi-classification model obtained by training a combination result of metadata of data samples of known privacy types and the data samples of known privacy types, so that the data to be recognized, which is actually private data in step 304 but is determined as non-private data by the first multi-classification model, can be recognized accurately, and thus the accuracy and efficiency of recognizing the private data by the overall technical scheme of this embodiment are high.

It should be understood that in the method described in one or more embodiments of the present disclosure, the order of some steps may be adjusted according to actual needs, or some steps may be omitted.

Based on the method of fig. 3, the embodiments of the present specification also provide some specific implementations of the method. The following description is made.

In step 304, metadata of data to be recognized needs to be input into the first multi-classification model, before the metadata of the data to be recognized is input into the first multi-classification model, word segmentation processing needs to be performed on the metadata, feature extraction is performed on results after the word segmentation processing, the results of the feature extraction form a first feature vector, then the first feature vector is marked with a label corresponding to the privacy type of the data to which the metadata belongs, and then the first feature vector with the type label is input into the first multi-classification model for recognition to obtain a first recognition result.

Specifically, in this embodiment, the metadata may be formally represented after being subjected to word segmentation processing and feature extraction, where n represents the number of features and represents the ith feature of the sample x, and the feature extraction method for extracting these features includes, but is not limited to: performing feature extraction on the result after word segmentation by adopting One-hot coding method, and performing word frequency feature extraction on the result after word segmentation by adopting word frequency feature methodAnd performing feature extraction or performing feature extraction on the result after word segmentation by adopting a tf-idf method and the like. For example, a database table structure includes several fields. Wherein part of the metadata information of this table structure comprises: table name: contact _ info; table notes: a contact information table; field name 1: a name; comment on field 1: a contact name; field name 2: phone _ num; comment on field 2: the contact mobile phone number. The field names of two fields of the table structure are name and phone _ num respectively, and the corresponding field content part information is as follows: name: zhangsan, lie si, wangwu, phone _ num: 1861X898293, 1861X898294, 1861X 898295. Assuming that the field to be identified is a piece of data in the phone _ num field, since it is only necessary to determine whether the data in the phone _ num field is private data, and only the metadata related to the phone _ num field needs to be selected from the table structure, the selected metadata only needs to include: the method comprises the following steps of carrying out word segmentation on metadata information of a contact person information table, a contact person mobile phone number, contact _ info and phone _ num, wherein the word segmentation processing result is as follows: the method comprises the steps of forming a first feature vector x = [ contact, information table, contact, mobile phone number, contact, phone and num ] by using a contact person, an information table, a contact person, a mobile phone number, contact, info, phone and num and forming a word segmentation processing result into a first feature vector x =]^TAnd then inputting the first feature vector into a trained first multi-classification model, so that the data to be recognized belong to private data, and the privacy type of the data is 'mobile phone number'.

In step 304 and step 306, if the privacy type of the data to be recognized cannot be determined or the data to be recognized cannot be recognized as non-private data through the first multi-classification model, further recognition of the second multi-classification model in step 308 is required. Specifically, in step 308, the input of the second multi-classification model is a word segmentation result obtained by combining metadata of the data to be recognized and text of the data to be recognized. If the data to be recognized is one, combining metadata of the data to be recognized and a text of the data, performing word segmentation, and inputting a word segmentation result into a second multi-classification model for recognition; if the data to be identified is a plurality of pieces of data in one field of the database table structure, random sampling can be performed on the plurality of pieces of data to be identified, partial data to be identified are sampled, the metadata of the plurality of pieces of data to be identified and the text of the partial data to be identified are combined to obtain a combined result, word segmentation processing is performed on the combined result, and the word segmentation result is input into the second multi-classification model for identification.

It should be noted that, in the field of machine learning, a flow of traditional machine learning often consists of a plurality of independent modules, for example, in a typical Natural Language Processing (Natural Language Processing) problem, a plurality of independent steps including word segmentation, part of speech tagging, syntax analysis, semantic analysis, and the like are included, each step is an independent task, and the quality of a result affects a next step, thereby affecting a result of the whole training, which is non-end-to-end. For deep learning, in the training process, a predicted result is obtained from an input end (input data) to an output end, and an error is obtained by comparing the predicted result with a real result, the error is transmitted (back propagation) in each layer of the model, the representation of each layer is adjusted according to the error, and the model is not converged or the expected effect is achieved, and the end-to-end operation is performed, so in the technical scheme of the embodiment, if the first multi-classification model is obtained by training based on the deep learning model, in step 304, the metadata of the data to be recognized can be directly converted into One-hot feature vectors according to the character content of the metadata, and then the metadata is input into the first multi-classification model for recognition; similarly, in step 306, if the second multi-classification model is obtained by training based on the deep learning model, after the metadata of the data to be recognized and the text of the data to be recognized are combined, word segmentation processing may not be performed, and the character content of the combined result is directly converted into One-hot feature vectors, and then is input into the second multi-classification model for recognition.

In the scheme of this embodiment, the purpose of identifying the private data is to perform desensitization processing on corresponding private data to prevent leakage of the private data, and data of different privacy types may correspond to different desensitization methods. Therefore, after the privacy type of the data to be identified is determined, a desensitization method corresponding to the privacy type can be determined, and desensitization processing is performed on the data to be identified by adopting the corresponding desensitization method.

Specifically, after the privacy type to which the data to be identified belongs is determined according to the first identification result, desensitization processing is performed on the data to be identified by adopting a processing mode corresponding to the privacy type to which the data to be identified belongs; or after the privacy type of the data to be identified is determined according to the second identification result, desensitizing the data to be identified by adopting a processing mode corresponding to the privacy type of the data to be identified.

For example, part of the information in the data that needs desensitization processing may be masked, such as: when desensitization processing is required for the user's identification number and the mobile phone number, a symbol such as "×") may be directly used to replace part of the numbers in the identification number, for example: zhang III, the ID card number is: 5303******12. By the method, information is desensitized by using a data desensitization technology, so that information hiding is realized, and the effect of protecting the safety of the information is achieved.

In practical applications, the first multi-classification model and the second multi-classification model used in the above steps may be obtained by training in advance.

Specifically, before the inputting the metadata into the first multi-classification model to identify whether the data to be identified belongs to the private data, and obtaining the first identification result, the method may further include:

acquiring a metadata sample of data to be identified, wherein the metadata sample contains semantic feature information representing the data to be identified; and training the initial first multi-classification model according to the metadata sample to obtain the trained first multi-classification model.

Before the metadata and the data to be recognized are input into a second multi-classification model to obtain a second recognition result, the method further comprises the following steps: acquiring a data sample with a known privacy type and metadata of the data sample with the known privacy type; and combining the metadata of the data sample with the known privacy type and training an initial second multi-classification model according to a combined result to obtain a trained second multi-classification model.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 4 is a schematic structural diagram of an identification apparatus for private data corresponding to fig. 3 provided in an embodiment of this specification. As shown in fig. 4, the apparatus may include:

a data obtaining module 402, configured to obtain metadata of data to be identified;

a first identification result determining module 404, configured to input the metadata into a first multi-classification model to identify a data type of the data to be identified, so as to obtain a first identification result; the first multi-classification model is obtained by training based on metadata corresponding to privacy type data; if the first identification result shows that the data to be identified belong to private data, determining the privacy type of the data to be identified according to the first identification result;

a second recognition result determining module 406, configured to, if the first recognition result indicates that the data to be recognized does not belong to private data, input the metadata and the data to be recognized into a second multi-class model to obtain a second recognition result; and determining the privacy type of the data to be identified according to the second identification result.

The examples of this specification also provide some specific embodiments of the process based on the apparatus of fig. 4, which is described below.

In at least one embodiment of the present application, the first recognition result determining module 404 is specifically configured to: performing word segmentation processing on the metadata, and performing feature extraction on a result after the word segmentation processing to obtain a first feature vector; and inputting the first feature vector into the first multi-classification model for recognition to obtain a first recognition result.

In at least one embodiment of the present application, the second recognition result determining module 406 is specifically configured to: combining the metadata and the text of the data to be identified to obtain a combined result; performing word segmentation on the combined result, and performing feature extraction on the result subjected to word segmentation to obtain a second feature vector; and inputting the second feature vector into the second multi-classification model for recognition to obtain a second recognition result.

The device further comprises:

a desensitization module to: after the privacy type of the data to be identified is determined according to the first identification result, desensitization processing is carried out on the data to be identified by adopting a processing mode corresponding to the privacy type of the data to be identified; or after the privacy type of the data to be identified is determined according to the second identification result, desensitizing the data to be identified by adopting a processing mode corresponding to the privacy type of the data to be identified.

The system comprises a first multi-classification model training module, a first multi-classification model training module and a second multi-classification model training module, wherein the first multi-classification model training module is used for acquiring a metadata sample of data to be identified, and the metadata sample contains semantic feature information representing the data to be identified; and training the initial first multi-classification model according to the metadata sample to obtain the trained first multi-classification model.

The second multi-classification model training module is used for acquiring data samples of known privacy types and metadata of the data samples of the known privacy types; and combining the metadata of the data sample with the known privacy type and training an initial second multi-classification model according to a combined result to obtain a trained second multi-classification model.

It will be appreciated that the modules described above refer to computer programs or program segments for performing a certain function or functions. In addition, the distinction between the above-described modules does not mean that the actual program code must also be separated.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 5 is a schematic structural diagram of an apparatus for identifying private data, corresponding to fig. 3, provided in an embodiment of this specification. As shown in fig. 5, the apparatus 500 may include:

at least one processor 510; and the number of the first and second groups,

a memory 530 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 530 stores instructions 520 executable by the at least one processor 510, the instructions being executed by the at least one processor 510.

The instructions may enable the at least one processor 510 to:

acquiring metadata of data to be identified;

Based on the same idea, the embodiment of the present specification further provides a computer-readable medium corresponding to the above method. The computer readable medium has computer readable instructions stored thereon that are executable by a processor to implement the method of:

acquiring metadata of data to be identified;

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital character system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information which can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of identifying private data, comprising:

acquiring metadata of data to be identified;

2. The method of claim 1, further comprising:

after the privacy type of the data to be identified is determined according to the first identification result, desensitization processing is carried out on the data to be identified by adopting a processing mode corresponding to the privacy type of the data to be identified;

or after the privacy type of the data to be identified is determined according to the second identification result, desensitizing the data to be identified by adopting a processing mode corresponding to the privacy type of the data to be identified.

3. The method according to claim 1, wherein the inputting the metadata into a first multi-classification model to identify a data type of the data to be identified to obtain a first identification result includes:

performing word segmentation processing on the metadata, and performing feature extraction on a result after the word segmentation processing to obtain a first feature vector;

and inputting the first feature vector into the first multi-classification model for recognition to obtain a first recognition result.

4. The method according to claim 1, wherein the step of inputting the metadata and the data to be recognized into a second multi-classification model to obtain a second recognition result specifically comprises:

combining the metadata and the text of the data to be identified to obtain a combined result;

performing word segmentation on the combined result, and performing feature extraction on the result subjected to word segmentation to obtain a second feature vector;

and inputting the second feature vector into the second multi-classification model for recognition to obtain a second recognition result.

5. The method according to claim 3 or 4, wherein the extracting the result after the word segmentation specifically comprises:

performing feature extraction on the result after word segmentation by adopting a One-hot coding method;

or, extracting the features of the result after word segmentation by adopting a word frequency feature method;

or, performing feature extraction on the result after word segmentation by adopting a tf-idf method.

6. The method of claim 1, the metadata specifically comprising: the database name of the database to which the data to be identified belongs, the database description information of the database to which the data to be identified belongs, the table structure name of the database to which the data to be identified belongs, the field name of the table structure of the database to which the data to be identified belongs, or the field type of the table structure of the database to which the data to be identified belongs.

7. The method of claim 1, the multi-classification model comprising: a multi-classification model based on a decision tree algorithm, a multi-classification model based on a random forest algorithm, a multi-classification model based on logistic regression, a multi-classification model based on an Xgboost algorithm, a multi-classification model based on a gradient lifting tree algorithm, a multi-classification model based on a maximum entropy algorithm, a multi-classification model based on a convolutional neural network CNN, or a multi-classification model based on a recurrent neural network RNN.

8. The method of claim 1, before inputting the metadata into the first multi-classification model to identify the data type of the data to be identified, and obtaining the first identification result, further comprising:

acquiring a metadata sample of data to be identified, wherein the metadata sample contains semantic feature information representing the data to be identified;

and training the initial first multi-classification model according to the metadata sample to obtain the trained first multi-classification model.

9. The method of claim 1, before inputting the metadata and the data to be recognized into the second multi-classification model and obtaining the second recognition result, further comprising:

acquiring a data sample with a known privacy type and metadata of the data sample with the known privacy type;

and combining the metadata of the data sample with the known privacy type and training an initial second multi-classification model according to a combined result to obtain a trained second multi-classification model.

10. The method of claim 1, if the data to be identified is a plurality of pieces of data in one field of a database table structure, after determining the privacy type of the data to be identified according to the first identification result, further comprising:

and marking all data in the field to which the data to be recognized belongs by using the privacy type identification corresponding to the privacy type to which the data to be recognized belongs.

11. The method according to claim 10, after the marking all data in the field to which the data to be recognized belongs with the privacy type identifier corresponding to the privacy type to which the data to be recognized belongs, further comprising:

and desensitizing all data in the field to which the data to be identified belongs by adopting a desensitizing mode corresponding to the privacy type to which the data to be identified belongs.

12. The method according to claim 1, if the data to be identified is a plurality of pieces of data in one field of a database table structure, after determining the privacy type to which the data to be identified belongs according to the second identification result, further comprising:

and performing secondary verification on the data by adopting a verification rule corresponding to the privacy type of the data to be recognized, counting the proportion of the data passing the secondary verification in the data relative to all the data, and marking all the data in the field to which the data to be recognized belongs by adopting privacy type identification corresponding to the privacy type of the data to be recognized when the probability of the data to be recognized belonging to the privacy type is larger than a preset second threshold value according to the second recognition result and is larger than a preset first threshold value.

13. The method according to claim 12, further comprising, after the marking all data in the field to which the data to be recognized belongs with the privacy type identifier corresponding to the privacy type to which the data to be recognized belongs, the following:

14. An apparatus to identify private data, comprising:

15. The device according to claim 14, further comprising a desensitization module, configured to perform desensitization processing on the data to be identified by using a processing manner corresponding to the privacy type to which the data to be identified belongs after determining the privacy type to which the data to be identified belongs according to the first identification result;

16. The apparatus of claim 14, wherein the first recognition result determining module is specifically configured to:

performing word segmentation on the metadata, and performing feature extraction on a result after the word segmentation to obtain a first feature vector;

17. The apparatus of claim 14, wherein the second recognition result determining module is specifically configured to:

18. The apparatus of claim 14, the apparatus further comprising:

the system comprises a first multi-classification model training module, a first multi-classification model training module and a second multi-classification model training module, wherein the first multi-classification model training module is used for acquiring a metadata sample of data to be identified, and the metadata sample contains semantic feature information representing the data to be identified;

19. The apparatus of claim 14, the apparatus further comprising:

the second multi-classification model training module is used for acquiring data samples of known privacy types and metadata of the data samples of the known privacy types;

20. An apparatus to identify private data, comprising:

at least one processor; and the number of the first and second groups,

acquiring metadata of data to be identified;

21. A computer readable medium having stored thereon computer readable instructions executable by a processor to implement the method of identifying private data of any one of claims 1-13.