CN112528315A

CN112528315A - Method and device for identifying sensitive data

Info

Publication number: CN112528315A
Application number: CN201910888348.1A
Authority: CN
Inventors: 余吉文
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2021-03-19

Abstract

The application discloses a method and a device for identifying sensitive data, relates to the technical field of computers, and is beneficial to improving the identification accuracy of the sensitive data. The method comprises the steps of obtaining field names of a plurality of fields in a data table to be processed; calculating a vector of field names for the plurality of fields; calculating a composite vector of vectors of field names of the plurality of fields; calculating the distance between the comprehensive vector and a preset vector of at least two candidate sensitive categories, wherein the candidate sensitive categories are candidate categories of a target field in the data table to be processed, and each candidate sensitive category is provided with one preset vector; and determining a candidate sensitive category corresponding to a preset vector with the shortest distance between the preset vector and the comprehensive vector in the at least two candidate sensitive categories as the sensitive category of the content corresponding to the target field.

Description

Method and device for identifying sensitive data

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying sensitive data.

Background

The Internet and the intelligent equipment provide a more convenient data sharing channel for enterprises and individuals, but simultaneously, lawless persons can more easily illegally obtain private data of the enterprises and the individuals, and maliciously utilize the information to carry out activities such as fraud and the like, so that serious economic loss is brought to the enterprises and the individuals. Therefore, in a practical application scenario, once the data relates to sensitive information of a business or an individual, the data must be subjected to desensitization processing to ensure the security of the data. Data desensitization (data masking) refers to data deformation of some sensitive data according to a desensitization rule, and reliable protection of sensitive private data is achieved.

However, desensitization of the data removes or masks some of the sensitive attributes to some extent, which can result in loss of important attributes or statistical information of the data. In the technical fields of big data analysis, artificial intelligence and the like, the data is required to have higher quality, namely, important attributes of the data are kept from being lost, and important statistical information is not changed, so that rules can be found and values can be mined.

In order to balance data security and data quality, how to identify sensitive data so as to improve the identification accuracy of the sensitive data becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying sensitive data, which are beneficial to improving the identification accuracy of the sensitive data.

In a first aspect, a method for identifying sensitive data is provided, including: acquiring field names of a plurality of fields in a data table to be processed; calculating a vector of field names for the plurality of fields; calculating a composite vector of vectors of field names of the plurality of fields; calculating the distance between the comprehensive vector and a preset vector of at least two candidate sensitive categories, wherein the candidate sensitive categories are candidate categories of a target field in a data table to be processed, and each candidate sensitive category is provided with a preset vector; and determining a candidate sensitive category corresponding to a preset vector with the shortest distance between the preset vector and the comprehensive vector in the at least two candidate sensitive categories as the sensitive category of the content corresponding to the target field. In the technical scheme, the sensitive category of the content corresponding to the target field is identified based on the context information of the target field. Because the same field has different context information in different data tables, and the context information of the same field can influence the sensitive category of the content corresponding to the field, the sensitive category of the content corresponding to the target field is determined based on the field names of the fields in the data table to which the target field belongs, which is beneficial to improving the identification accuracy of the sensitive category. In addition, the sensitive category of the content corresponding to the target field is determined by the distribution characteristics of the context information of the target field in the vector space. Therefore, compared with the technical scheme of identifying the sensitive category of the content corresponding to the target field based on the regular expression, on one hand, the cost problem caused by writing and maintaining the rule base can be avoided; on the other hand, even if the target field is not used in the training stage, the sensitive category of the target field can be identified by means of the context information of the target field, that is, the robustness and the universality of the algorithm for identifying sensitive data are improved.

The field name vector is a vector obtained by mapping the field name to a real number field. A comprehensive vector may be understood as a vector that characterizes the field names of the plurality of fields. The preset vector of the candidate sensitive category may be regarded as a representation of the candidate sensitive category in a vector space formed by vectors of N field names.

In one possible design, the plurality of fields are fields in the data table to be processed, the number of fields spaced from the target field is less than or equal to a threshold value. This is a technical solution proposed in consideration of "fields with dependencies are usually located in a relatively centralized location of the data table". This helps to improve the accuracy of the recognition result.

In one possible design, computing a composite vector of vectors of field names for the plurality of fields includes: calculating the comprehensive vector according to a first preset algorithm and the vectors of the field names of the fields; wherein the integrated vector is a vector minimizing a sum of absolute values of distance differences with each of vectors of field names of the plurality of fields.

For example, the first preset algorithm may include: a K-means algorithm or a mean shift clustering algorithm, etc.

In one possible design, the at least two candidate sensitivity categories include a first candidate sensitivity category; the first candidate sensitivity category may be any one of the at least two candidate sensitivity categories. The method further comprises the following steps: acquiring a plurality of related field names of a first candidate sensitive category, wherein the related field names are used for representing the field names of the first candidate sensitive category; calculating a vector of the plurality of related field names; calculating a preset vector of a first candidate sensitive category according to a first preset algorithm and the vectors of the plurality of related field names; the predetermined vector of the first candidate sensitivity class is a vector that minimizes a sum of absolute values of distance differences between each of the vectors of the plurality of related field names.

In one possible design, the method further includes: acquiring field names of N fields in a plurality of data tables; n is an integer greater than or equal to 2; training the field names of the N fields according to a second preset algorithm to obtain vectors of the field names of the N fields; the higher the probability that different fields appear in the same data table, the shorter the distance between the vectors of field names of the different fields. In this case, calculating a vector of field names for the plurality of fields includes: in the vector of the field names of the N fields, a vector of the field names of the plurality of fields is obtained.

For example, the second preset algorithm may include: skip-gram algorithm or CBOW algorithm, etc.

In a second aspect, there is provided an apparatus for identifying sensitive data, which may be used to perform any of the methods provided in the first aspect.

In one possible design, the apparatus may be divided into functional blocks according to any one of the methods provided in the first aspect. For example, the functional blocks may be divided for the respective functions, or two or more functions may be integrated into one processing block.

In one possible design, the apparatus may include a memory for storing a computer program that, when executed by the processor, causes any one of the methods provided by the first aspect to be performed. By way of example, the apparatus may be a computer device or chip.

In a third aspect, a computer-readable storage medium is provided, which contains instructions that, when executed on a computer, cause the computer to perform any one of the methods provided in the first aspect.

In a fourth aspect, a computer program product is provided which, when run on a computer, causes any of the methods provided by the first aspect to be performed.

It is understood that any one of the above-mentioned apparatuses, computer-readable storage media, or computer program products for identifying sensitive data is used to execute the corresponding method provided above, and therefore, the beneficial effects achieved by the above-mentioned apparatuses, computer-readable storage media, or computer program products can refer to the beneficial effects in the corresponding method, and are not described herein again.

Drawings

Fig. 1 is a schematic diagram of a vector space provided in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating categories of sensitive data according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a category of sensitive data provided in FIG. 2 according to an embodiment of the present application;

FIG. 4 is a block diagram illustrating an architecture of a computer system according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a training method according to an embodiment of the present application;

FIG. 7 is a diagram of a trained vector space provided in an embodiment of the present application;

fig. 8 is a schematic flowchart of a method for obtaining a preset vector according to an embodiment of the present disclosure;

FIG. 9 is a diagram illustrating a default vector of a candidate sensitivity type according to an embodiment of the present disclosure;

FIG. 10 is a schematic flow chart illustrating a method for identifying sensitive data according to an embodiment of the present application;

fig. 11 is a schematic logical structure diagram of a computer device according to an embodiment of the present application;

FIG. 12 is a diagram illustrating a synthesized vector according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an apparatus for identifying sensitive data according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, terms referred to in the examples of the present application are explained:

1) data, data table, field name, content corresponding to field

Reference to data in this application refers to structured data. The structured data refers to data logically expressed and realized by a two-dimensional table (namely, a data table) structure, strictly conforms to data format and length specifications, and is mainly stored and managed by a relational database. Structured data, also referred to as row data, is generally characterized by: data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same. One data sheet is shown in table 1:

TABLE 1

Name (R)	Sex	Address	Telephone set
				Zhang Yi	Woman	Xi ' an, Shaanxi Province	012345678900
Wangsan	For male	Shenzhen, Guangdong Province	012345678901
				Li San	For male	Shenzhen, Guangdong Province	012345678903

Each column in the data table represents a field, e.g., table 1 contains 4 fields. Each field has a field name, for example the field names of the fields in table 1 are "name, gender, address and telephone", respectively. The information in the cell in which the non-field name of a field is located is referred to as the corresponding content of the field, and the field name is usually in the first row of the data table. For example, in table 1, the content corresponding to the field name "is" zhang yi, wang bi, and li tri ", and the content corresponding to the field name" address "is" shaxi province xi ann city, guang province Shenzhen city, and guang province Shenzhen city ".

2) Field name vector, word vector space

Before explaining the vector of field names, the word vector is first briefly introduced:

word embedding (word embedding), which may also be referred to as word embedding, is a general term for language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. It aims to quantify and classify semantic similarity between linguistic items based on distribution properties in a large sample of linguistic data.

Training the training data (i.e., multiple words or phrases) based on the training model may result in a vector for each word/phrase. The embodiment of the application does not limit the training model and the implementation mode of training by using the training model. For example, the training model may include a continuous bag of words (CBOW) or skip-gram.

The vector of the field name is a vector obtained by mapping the field name to a real number field.

The vector space is a space formed by vectors of a plurality of field names. The vector for each field name is an element in the vector space. Taking the vector dimension as 3 for example, any point in the three-dimensional vector space can be characterized by coordinates (x, y, z). Assuming that a point (1,2,1) in three-dimensional space can represent the field name ", the vector of" name "is (1,2,1), as shown in fig. 1. Assuming that a point (2,1,0) in three-dimensional space can represent the field name "address", the vector of "address" is (2,1,0), as shown in FIG. 1. Similarly, all field names may be characterized by a point in three-dimensional space.

3) Sensitive data, sensitive categories

The sensitive data refers to data related to sensitive information such as privacy and security in a data table. For example, addresses, contact details, credential information, account information, and the like are generally considered sensitive data.

The sensitive category refers to a category obtained by classifying sensitive data. The sensitivity category may be predefined and may be updated after the predefined. When data desensitization is performed on sensitive data, desensitization processing may be performed based on the category of the sensitive data.

In some embodiments of the present application, the sensitive categories may include a primary category and a secondary category.

The primary category is the result of classifying the sensitive data according to its content. The first class may be referred to as a coarse class of sensitive data. As shown in FIG. 2, C-i represents the ith primary category. Wherein i is an integer greater than or equal to 2.

The secondary category is the result of further classification of the primary category. The secondary category may be referred to as a fine category that is sensitive data. As shown in FIG. 2, C-i-j represents the jth secondary category in the ith primary category. Wherein j is an integer greater than or equal to 2.

It should be noted that the sensitive classes including which primary classes and each primary class including which secondary classes may be predefined, for example, based on statistical analysis of a large amount of data. And the primary and secondary categories of sensitive data are predefined and then updatable.

Optionally, before performing the sensitive data identification, the sensitivity level corresponding to each secondary category may be predefined, for example, the sensitivity level of C-i-j may be defined as P-i-j, and in one example, P-i-j may be one value in the set {0,1,2,3,4 }. The larger the value of P-i-j is, the higher the sensitivity level of C-i-j is. When data desensitization processing is performed on sensitive data, the higher the sensitivity level is, the higher the degree of desensitization performed on the data is.

Fig. 3 shows a specific example of a category of sensitive data. In FIG. 3, the primary categories of sensitive data may include address information, certificate numbers, and account information. The second-level category of the address information may include a personal category and an organization category, that is, the address information may be further divided into personal address information and organization address information, and generally, the sensitivity of the personal address information is higher than that of the organization address information. The secondary categories of the document number may include a personal category and an organization category, that is, the document number may be further divided into a personal document number and an organization document number, and generally, the personal document number is more sensitive than the organization document number. The second class of the account information may include a bank card class and an instant messaging software class, that is, the account information may be further divided into bank card account information and an instant messaging software account number (such as a micro signal, etc.), and generally, the sensitivity of the bank card account information is higher than that of the organization certificate number.

It should be noted that the above description is only an example, and does not limit the sensitive categories to which the embodiments of the present application are applicable. Example 1: the sensitive category may include a plurality of primary categories, and some or all of the plurality of primary categories may be further divided into a plurality of secondary categories. For example, the account information shown in FIG. 3 may no longer distinguish between secondary categories. Example 2: the sensitivity categories may include: and part or all of the plurality of primary categories may be further divided into a plurality of secondary categories, and any one or more secondary categories may be further divided into a plurality of tertiary categories. For example, the instant messaging software account number in fig. 3 may be further divided into: mailbox account, micro-signal, etc.

4) Candidate sensitivity classes

The candidate sensitive category refers to a possible sensitive category of a field. For example, referring to FIG. 3, if a field's primary category has been determined to be address information, then the candidate sensitive categories for the field are a person category and an organization category. For another example, referring to example 2 above, if the secondary category of a field has been determined to be the instant messenger category, then the candidate sensitive categories for the field may be the mailbox category and the WeChat category.

5) Other terms

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In the embodiments of the present application, "at least one" means one or more. "plurality" means two or more.

In the embodiment of the present application, "and/or" is only one kind of association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The technical scheme provided by the embodiment of the application can be applied to a computer system. As shown in fig. 4, the computer system comprises a database 1 and a computer device 2. Wherein the database 1 is used for storing data tables containing sensitive data. The computer device 2 is used for executing the method provided by the embodiment of the application. Optionally, the computer device 2 is also used to perform a data desensitization process.

In one example, the database 1 may be integrated in the computer device 2. In another example, the database 1 may be located on a device other than the computer device 2.

In one example, as shown in fig. 5, the computer device 2 includes: at least one processor 201, a communication link 202, a memory 203, and at least one communication interface 204.

The processor 201 may be a general processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs in accordance with the present invention.

The communication link 202 may include a path for transmitting information between the aforementioned components.

Communication interface 204 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as an ethernet, RAN, Wireless Local Area Networks (WLAN), etc.

The memory 203 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be separate and coupled to the processor via communication line 202. The memory may also be integral to the processor. The memory provided by the embodiment of the application can be generally nonvolatile. The memory 203 is used for storing computer execution instructions for executing the scheme of the application, and is controlled by the processor 201 to execute. The processor 201 is configured to execute computer-executable instructions stored in the memory 203, thereby implementing the methods provided by the embodiments described below.

Optionally, the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.

In particular implementations, computer device 2 may include multiple processors, such as processor 201 and processor 207 in FIG. 5, for example, as an embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In particular implementations, computer device 2 may also include an output device 205 and an input device 206, as one embodiment. The output device 205 is in communication with the processor 201 and may display information in a variety of ways. For example, the output device 205 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 206 is in communication with the processor 201 and may receive user input in a variety of ways. For example, the input device 206 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

It should be noted that, in order to describe the technical solution provided by the embodiment of the present application more clearly, the following description is made based on a training phase, a preset vector obtaining phase, and a sensitive data identification phase. The training phase and the preset vector obtaining phase may be regarded as a preprocessing process for executing the sensitive data identification phase, and these two phases may be executed online or offline.

A training stage:

as shown in fig. 6, the training phase may include the following steps:

s101: the method comprises the steps that the computer equipment obtains field names of N fields in a plurality of data tables; n is an integer greater than or equal to 2.

In particular, the computer device may retrieve a plurality of data tables from a database. The database may be a database internal to the computer device or may be a database external to the computer device. The plurality of data tables may be any plurality of data tables stored in a database, for example, the plurality of data tables may include: payroll tables, sales tables, personal health information tables, personal information registration tables, enterprise base case tables, and the like. The N fields may be some or all of the fields in the plurality of data tables.

S102: the computer device trains the field names of the N fields based on a training model (such as a skip-gram model or a CBOW model), and obtains a vector of the field name of each field in the field names of the N fields. Wherein the higher the probability that a different field appears in the same table, the closer the distance between the vectors of field names of the different fields.

The prior art can be referred to for specific implementation of step S102.

In the following, the training model is briefly described:

the input information for training the model may include: n field names and identification information of the data table to which each field name belongs. For example, input information for training the model may include: the personal health information table comprises field names such as name, gender, birth date, address, contact information, weight, eyesight and the like; the personal information registration form comprises field names such as name, gender, birth date, address, contact way, education background and the like; the payroll table includes field names such as name, payroll number, level, basic payroll, drawing, deduction of tax, real payroll, etc.

Optionally, the input information of the training model may further include the dimension of the vector of field names. The dimension of the vector of the field name refers to the number of elements included in the vector. The dimensions of the vector may be several dimensions, or tens of dimensions, or hundreds of dimensions, or even thousands of dimensions, etc. The higher the dimensionality of the vector is, the more accurately the vector can reflect the relationship between the field name indicated by the vector and other field names, so that the more accurate the recognition result obtained by using the training result for sensitive data recognition is. However, the higher the dimensionality of the vector, the more complex the training process and the subsequent process of using the training results for sensitive data. Therefore, the dimensionality of the vector can be reasonably selected based on the accuracy and computational complexity of the recognition result of the sensitive data. The dimensions of the vector may be predefined or may be self-defined values (e.g., user-entered values). In practical applications, the dimensions of the vector may be selected based on the total amount and characteristics of the training data.

The training targets are: the vector of the N field names is obtained under the condition that "the higher the probability that different fields appear in the same table, the closer the distance between the vectors of the field names of the different fields" is.

The output information of the training model includes: a vector of N field names.

Fig. 7 shows an example of a trained vector space provided in an embodiment of the present application. The a-dimensional vector space shown in FIG. 7 (shown as coordinate axes X1-Xa) includes vectors of N field names, and each point in FIG. 7 represents a vector of one field. a is an integer greater than or equal to 3.

It will be appreciated that mapping field names to word vector space yields vectors for field names, such that similarity between field names can be calculated by calculating distances between vectors. Common algorithms for calculating similarity include: euclidean distance, hamiltonian distance, chebyshev distance, cosine distance, and the like. Taking cosine distance as an example, cosine of an included angle in geometry can be used to measure the difference between two vector directions.

The cosine of the angle between vector A (x1, y1) and vector B (x2, y2) in two-dimensional space is as follows:

in an expandable manner, the cosine of the angle between two N-dimensional vector points (x11, x12, …, x1N) and (x21, x22, …, x2N) is:

the cosine of the included angle has the value range of [ -1,1 ]. In practical implementation, by applying constraint, the vectors with the N field names can be obtained to be all integers, so that the cosine distance is calculated within the range of [0,1 ]. The closer the value is to 1, the higher the similarity between the two vectors, and the closer to 0, the less relevant the two vectors are.

A preset vector obtaining stage:

as shown in fig. 8, the method for obtaining the preset vector may include the following steps:

s201: the computer device obtains a plurality of related field names for candidate sensitive categories to be processed.

The plurality of related field names of the candidate sensitive categories to be processed are field names of a plurality of fields for characterizing the candidate sensitive categories to be processed. For example, when the candidate sensitive category to be processed is a personal category, since the most commonly used fields characterizing personal information are "name, gender, date of birth, phone, mailbox", etc., the field names of these several fields can be taken as the related field names of the personal category. For another example, when the candidate sensitive category to be processed is a business category, since the fields that represent business information most commonly are "name, address, telephone, creation time and scale", etc., the field names of these several fields may be taken as the relevant field names of the business category. For another example, when the first candidate sensitive type is a bank card category, since the most commonly used field for representing the information of the bank card is "a name of a bank (such as a chinese bank, a transportation bank, etc.)" of the bank card, the name of the bank to which the bank card belongs may be used as a related field name of the bank card category.

The candidate sensitivity category to be processed may be any one of predefined candidate sensitivity categories. For example, the first candidate category may be any one of the personal category, the organization category, the bank card category, and the instant communication software category in fig. 2.

The associated field names of the candidate sensitive categories to be processed may be predefined or user-indicated. For example, predefined based on statistical analysis of field names in a large number of data tables.

S202: the computer device calculates a vector of the plurality of related field names.

Specifically, after the training phase is finished, the computer device may store a correspondence between each field name of the N field names and the vector corresponding to the field name; in S202, for each related field name, the computer device may search the N field names for the related field name, and then obtain a vector corresponding to the related field name.

For example, based on the example in S201, assuming that the candidate sensitive category to be processed is a personal category, the computer device may obtain vectors of field names "name, gender, age, address, telephone, mailbox" from N field names obtained in the training phase, respectively. A vector of these field names can be obtained, for example, by looking up table 5.

S203: the computer device calculates a preset vector of candidate sensitive categories to be processed according to a preset algorithm such that a sum of absolute values of distance differences between the preset vector and each of the vectors of the plurality of related field names is minimized.

The preset vector of the candidate sensitive category to be processed can be regarded as a representation of the candidate sensitive category to be processed in a vector space formed by vectors of N field names.

Calculating the preset vector of the candidate sensitive category to be processed in step S203 is a simple optimization problem.

For example, assuming that vectors of a plurality of related field names of the candidate sensitivity classes to be processed are respectively labeled as { V1, V2, … … Vi … … Vn }, where i is greater than 1 and less than or equal to n, i and n are integers, and the preset vector of the candidate sensitivity class to be processed is labeled as Vc, then the step S203 calculates the preset vector of the candidate sensitivity class to be processed specifically by solving the following optimization equation:

an objective function: minf (V)_C)。

Constraint conditions are as follows:

i.e. the sum of the distances Vc to Vi;

i.e. Vc belongs to an N-dimensional vector.

Alternatively, the algorithm (i.e., the preset algorithm in S203) commonly used for solving the optimization equation may include: a K-means algorithm or a mean shift clustering algorithm, etc.

And (3) taking each candidate sensitive type as a candidate sensitive type to be processed, and executing the steps S201 to S203 to obtain a preset vector of each candidate sensitive type through calculation. For example, based on fig. 3, a preset vector of a personal category, a preset vector of an organization category, a preset vector of a bank card category, and a preset vector of an instant messaging software category may be obtained, respectively.

Fig. 9 is a schematic diagram of a preset vector of a candidate sensitivity type. Fig. 9 is drawn based on fig. 8, and fig. 9 illustrates preset vectors of two candidate sensitivity types, i.e., a person type and an organization type. In fig. 9, the names of the fields related to the personal category are: name, gender, age, address, phone, mailbox; the multiple field names associated with the organization category are: name, address, phone, creation time, and size "are illustrated as examples.

It should be noted that the execution subjects of any multiple stages in the training stage, the preset vector acquisition stage and the sensitive data identification stage may be the same or different. For example, one computer device executes a training phase and a preset vector acquisition phase, and then the execution results of the two phases (i.e., the vector of the N field names and the preset vector of each candidate sensitive type) are input to another computer device, and the other computer device executes a sensitive data recognition phase. For another example, one computer device executes the training phase, and then inputs the execution result (i.e., the vector with the name of N fields) of the training phase to another computer device, and the other computer device executes the preset vector acquisition phase and the sensitive data identification phase. Other examples are not listed.

And a sensitive data identification stage:

as shown in fig. 10, the method of identifying sensitive data may include the steps of:

s301: and the computer equipment determines the data table to be processed which accords with the retrieval request according to the retrieval request.

The data table to be processed is the data table which is searched in the database by the computer equipment according to the search request of the user and meets the search condition in the search request of the user. The data table to be processed may be one of the plurality of data tables in S101, or may be one other than the plurality of data tables in S101. Before the data table to be processed is transmitted to the user, whether sensitive data exist in the data table to be processed or not needs to be identified, and desensitization processing is carried out on the identified sensitive data.

In one example, as shown in FIG. 11, a computer device includes a software application, a database engine, and a sensitive data identification module. The software application, such as a data presentation platform, a data exchange platform, a data processing platform or a data consumption platform, sends a data retrieval request to the database engine. And the database engine retrieves the to-be-processed data table which accords with the retrieval request from the database according to the data retrieval request, and then sends the read data table to the sensitive data identification module. The sensitive data identification module is used for identifying the sensitive data in the data table to be processed according to the information sent by the database engine (for example, executing S302-S308).

Optionally, the computer device may further include a desensitization module, configured to perform desensitization processing on the sensitive data according to an identification result of the sensitive data identification module, and return data obtained after the desensitization processing to the software application that sends the data retrieval request.

In one example, the software application in FIG. 11 may be a software application installed on other computer devices.

S302: and the computer equipment determines the primary category of the content corresponding to the target field according to the content corresponding to the target field in the data table to be processed. For the related description and specific examples of the first class, reference may be made to the above description, which is not repeated herein.

In the embodiment of the present application, when desensitization processing is performed on a data table to be processed, identification of sensitive data may be performed by using the same method for each field.

For example, the computer device determines a first-level category of content corresponding to the target field based on the regular expression and the content corresponding to the target field. Regular expression (regular expression) is a concept of computer science, and is generally used to retrieve or replace text conforming to a certain pattern (or rule). The specific implementation manner of determining the first-level category of the content corresponding to the target field based on the regular expression and the content corresponding to the target field by the computer device may refer to the existing basis, and is not described herein again.

S303: the computer device determines whether the primary category of the content corresponding to the target field includes a secondary category.

If not, executing S304; if yes, S305 is performed.

S304: the computer device takes the primary category determined in S302 as the sensitive category of the content corresponding to the target field.

S305: the computer device obtains field names of a plurality of fields in the data table to be processed.

Wherein, the plurality of fields can be any plurality of fields in the data table to be processed. Based on the example in S301, the plurality of fields may be a plurality of fields that the database engine sends to the sensitive data identification module. For example, assuming that the to-be-processed data table is table 1 and the target field is an address, the field names acquired in S305 may be name, gender, and telephone.

Optionally, the multiple fields are fields in the to-be-processed data table, where the number of fields spaced from the target field is less than or equal to a threshold. The target field may be any field in the data table to be processed. The specific value of the threshold and the determination method of the specific value are not limited in the embodiment of the application. This is a technical solution proposed in consideration of "fields with dependencies are usually located in a relatively centralized location of the data table". This helps to improve the accuracy of the recognition result. For example, taking the plurality of fields as fields with the number of fields spaced from the target field being 1 in the to-be-processed data table as an example, assuming that the to-be-processed data table is table 1 and the target field is an address, the field names acquired in S305 may be gender and telephone.

In the following, by way of an example, it is illustrated that fields with dependencies are typically located in a relatively centralized location of a data table. For example, if a table contains more fields and includes fields representing various types of information, such as a sales table including both fields representing personal information such as name, address, telephone, etc. and fields representing sales information such as trade name, commodity price, time of sale, etc., the fields usually used for representing the same type of information (such as personal information or sales information) are often more concentrated, such as the fields recorded in the sales table are sequentially: name, address, telephone, trade name, price of goods, time of sale, etc.

S306: the computer device calculates a vector of field names for the plurality of fields. Specifically, the computer device looks up a vector of field names for each of the plurality of fields from the vector of N field names obtained during the training phase.

For example, assuming that the data table to be processed is table 1, the target field is an address, and the field names acquired in S305 are name, gender, and telephone, in S306, the computer device may search vectors corresponding to the name, gender, and telephone from vectors of N field names obtained in the training phase, as shown in fig. 12.

S307: the computer device calculates a composite vector of vectors of field names of the plurality of fields according to a preset algorithm such that a sum of absolute values of distance differences between the composite vector and each of the vectors of field names of the plurality of fields is minimized.

A comprehensive vector is understood to mean a vector that characterizes the field names of the fields. Calculating the synthetic vector is a simple optimization problem, and one example can refer to the example in S203 described above. The preset algorithm used in the process may include: a K-means algorithm or a mean shift clustering algorithm, etc.

Based on the example in S306, a comprehensive vector of the fields "name, gender, and phone" is shown in fig. 12.

S308: the computer device calculates distances between the composite vector and preset vectors of at least two candidate sensitivity categories, respectively. And the candidate sensitive category is a candidate category of a target field in the data table to be processed. Each candidate sensitive category has a preset vector, and reference may be made to the above for a manner of obtaining the preset vector of the candidate sensitive category, which is not described herein again.

For example, assuming that the primary class of the target field obtained after performing S302 is address information based on fig. 3, the "at least two candidate sensitive classes" in S308 include a personal class and an organization class.

As another example, assuming that the primary category of the target field obtained after performing S302 is account information based on fig. 3, the "at least two candidate sensitive categories" in S308 include a bank card category and an instant messaging software category.

S309: and the computer equipment determines the candidate sensitive category corresponding to the preset vector with the shortest distance with the comprehensive vector in the preset vectors of the at least two candidate sensitive categories as the sensitive category of the content corresponding to the target field.

That is, the candidate sensitive category that is most similar to the plurality of fields related to the target field of the at least two candidate sensitive categories is used as the sensitive category of the content corresponding to the target field. Wherein each candidate sensitivity category is characterized by a preset vector, and a plurality of fields related to the target field are characterized by a comprehensive vector.

For example, based on fig. 3, the sensitive category determined by performing S308 is referred to as a secondary category.

As shown in fig. 12, based on the example in S307, the secondary category of the target field "address" in table 1 is a personal category. That is, the content corresponding to the destination field is the personal address information.

The sensitive data identification method provided by the embodiment of the application identifies the sensitive category of the content corresponding to the target field based on the distance between the vector of the field names of the fields in the data table to be processed and the vector of the candidate sensitive category of the target field in the data table to be processed, that is, identifies the sensitive category of the content corresponding to the target field based on the context information of the target field. Because the same field has different context information in different data tables, and the context information of the same field can influence the sensitive category of the content corresponding to the field, the sensitive category of the content corresponding to the target field is determined based on the field names of the fields in the data table to which the target field belongs, which is beneficial to improving the identification accuracy of the sensitive category.

In addition, the technical scheme maps the context information of the target field into the vector space, and identifies the sensitive category of the content corresponding to the target field by calculating the similarity between vectors, that is, the sensitive category of the content corresponding to the target field is determined by the distribution characteristics of the context information of the target field in the vector space. Therefore, compared with the technical scheme of identifying the sensitive category of the content corresponding to the target field based on the regular expression, on one hand, the technical scheme can avoid the cost problem caused by writing and maintaining the rule base because the rule base is required to be written and maintained for identifying the sensitive category based on the regular expression; on the other hand, if the rule for identifying the sensitive category of the target field (such as the above-mentioned secondary category) is not stored in the rule base, the sensitive category of the target field cannot be identified based on the rule base, but in the technical scheme, even if the target field is not used in the training stage, the sensitive category of the target field can be identified by means of the context information of the target field, that is, the technical scheme is helpful for improving the robustness and the universality of the algorithm for identifying the sensitive data.

The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. To implement the above functions, it includes hardware structures and/or software modules for performing the respective functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the computer device may be divided into the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

Fig. 13 is a schematic structural diagram of an apparatus 130 for identifying sensitive data according to an embodiment of the present application. In one example, the apparatus 130 may be a computer device configured to perform the steps performed by the computer device. The apparatus 130 may include: an acquisition unit 1301, a calculation unit 1302, and a determination unit 1303. Optionally, the apparatus may further comprise a training unit 1304.

The obtaining unit 1301 is configured to obtain field names of a plurality of fields in the to-be-processed data table. A calculating unit 1302, configured to calculate a vector of field names of the plurality of fields; calculating a composite vector of vectors of field names of the plurality of fields; and calculating the distance between the comprehensive vector and a preset vector of at least two candidate sensitive categories, wherein the candidate sensitive categories are candidate categories of target fields in the data table to be processed, and each candidate sensitive category is provided with one preset vector. The determining unit 1303 is configured to determine a candidate sensitive category corresponding to a preset vector with a shortest distance to the integrated vector in the at least two candidate sensitive categories, as the sensitive category of the content corresponding to the target field. For example, in conjunction with fig. 10, the obtaining unit 1301 may be configured to perform S305, the calculating unit 1302 may be configured to perform S306, S307, and S308, and the determining unit 1303 may be configured to perform S309.

Optionally, the multiple fields are fields in the to-be-processed data table, where the number of fields spaced from the target field is less than or equal to a threshold.

Optionally, the calculating unit 1302 is specifically configured to: calculating a comprehensive vector according to a first preset algorithm and the vectors of the field names of the fields; wherein the integrated vector is a vector minimizing a sum of absolute values of distance differences with each of vectors of field names of the plurality of fields. For example, in conjunction with fig. 10, the computing unit 1302 may be configured to execute S307.

Optionally, the at least two candidate sensitivity categories include a first candidate sensitivity category; the obtaining unit 1301 is further configured to obtain a plurality of related field names of the first candidate sensitive category, where the related field names are used to characterize the field names of the first candidate sensitive category; calculating a preset vector of a first candidate sensitive category according to a first preset algorithm and the vectors of the plurality of related field names; the predetermined vector of the first candidate sensitivity class is a vector that minimizes a sum of absolute values of distance differences between each of the vectors of the plurality of related field names. For example, in conjunction with fig. 8, the obtaining unit 1301 may be configured to perform S201, and the calculating unit 1302 may be configured to perform S202 and S203.

Optionally, the obtaining unit 1301 is further configured to obtain field names of N fields in the multiple data tables; n is an integer greater than or equal to 2. The training unit 1304 is configured to train the field names of the N fields according to a second preset algorithm, so as to obtain a vector of the field names of the N fields; the higher the probability that different fields appear in the same data table, the shorter the distance between the vectors of field names of different fields. The calculating unit 1302 is specifically configured to obtain a vector of the field names of the plurality of fields from the vector of the field names of the N fields. For example, in conjunction with fig. 7, the obtaining unit 1301 may be configured to perform S101, and the training unit 1304 may be configured to perform S102.

The means 130 for identifying sensitive data may be a general purpose device or a special purpose device.

As an example, the above-mentioned means 130 for identifying sensitive data may be implemented by the computer device 2 in fig. 5. The functions performed by part or all of the above-described acquisition unit 1301, calculation unit 1302, determination unit 1303, and training unit 1304 may be implemented by processor 201 calling a computer program stored in memory 203.

For the explanation of the related content and the description of the beneficial effects of any of the above-mentioned devices 130 for identifying sensitive data, reference may be made to the above-mentioned method embodiments, and details are not repeated herein.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The processes or functions according to the embodiments of the present application are generated in whole or in part when the computer-executable instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The foregoing is only illustrative of the present application. Those skilled in the art can conceive of changes or substitutions based on the specific embodiments provided in the present application, and all such changes or substitutions are intended to be included within the scope of the present application.

Claims

1. A method of identifying sensitive data, the method comprising:

acquiring field names of a plurality of fields in a data table to be processed;

calculating a vector of field names for the plurality of fields;

calculating a composite vector of vectors of field names of the plurality of fields;

calculating the distance between the comprehensive vector and a preset vector of at least two candidate sensitive categories, wherein the candidate sensitive categories are candidate categories of a target field in the data table to be processed, and each candidate sensitive category is provided with one preset vector;

and determining a candidate sensitive category corresponding to a preset vector with the shortest distance between the preset vector and the comprehensive vector in the at least two candidate sensitive categories as the sensitive category of the content corresponding to the target field.

2. The method of claim 1, wherein the plurality of fields are fields in the pending data table that have a number of fields spaced apart from the target field that is less than or equal to a threshold.

3. The method of claim 1 or 2, wherein said computing a composite vector of vectors of field names of said plurality of fields comprises:

calculating the comprehensive vector according to a first preset algorithm and the vectors of the field names of the fields; wherein the integrated vector is a vector minimizing a sum of absolute values of distance differences with each of vectors of field names of the plurality of fields.

4. The method of any of claims 1 to 3, wherein the at least two candidate sensitivity categories comprise a first candidate sensitivity category; the method further comprises the following steps:

acquiring a plurality of related field names of the first candidate sensitive category, wherein the related field names are used for representing the field names of the first candidate sensitive category;

calculating a vector of the plurality of related field names;

calculating a preset vector of the first candidate sensitive category according to a first preset algorithm and the vectors of the plurality of related field names; the predetermined vector of the first candidate sensitivity class is a vector that minimizes a sum of absolute values of distance differences with each of the vectors of the plurality of related field names.

5. The method according to any one of claims 1 to 4, further comprising:

acquiring field names of N fields in a plurality of data tables; n is an integer greater than or equal to 2;

training the field names of the N fields according to a second preset algorithm to obtain vectors of the field names of the N fields; the higher the probability that different fields appear in the same data table, the shorter the distance between the vectors of the field names of the different fields;

the computing a vector of field names for the plurality of fields includes:

and acquiring the vectors of the field names of the fields from the vectors of the field names of the N fields.

6. An apparatus for identifying sensitive data, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring field names of a plurality of fields in a data table to be processed;

a calculation unit configured to calculate a vector of field names of the plurality of fields; calculating a composite vector of vectors of field names of the plurality of fields; calculating the distance between the comprehensive vector and a preset vector of at least two candidate sensitive categories, wherein the candidate sensitive categories are candidate categories of a target field in the data table to be processed, and each candidate sensitive category is provided with one preset vector;

and the determining unit is used for determining a candidate sensitive category corresponding to a preset vector with the shortest distance between the preset vector and the comprehensive vector in the at least two candidate sensitive categories as the sensitive category of the content corresponding to the target field.

7. The apparatus of claim 6, wherein the plurality of fields are fields in the pending data table that have a number of fields spaced apart from the target field that is less than or equal to a threshold.

8. The apparatus according to claim 6 or 7,

the computing unit is specifically configured to: calculating the comprehensive vector according to a first preset algorithm and the vectors of the field names of the fields; wherein the integrated vector is a vector minimizing a sum of absolute values of distance differences with each of vectors of field names of the plurality of fields.

9. The apparatus of any of claims 6 to 8, wherein the at least two candidate sensitivity categories comprise a first candidate sensitivity category;

the obtaining unit is further configured to obtain a plurality of related field names of the first candidate sensitive category, where the related field names are field names used for characterizing the first candidate sensitive category;

the calculation unit is further configured to calculate a vector of the plurality of related field names; and calculating a preset vector of the first candidate sensitive category according to a first preset algorithm and the vectors of the plurality of related field names; the predetermined vector of the first candidate sensitivity class is a vector that minimizes a sum of absolute values of distance differences with each of the vectors of the plurality of related field names.

10. The apparatus according to any one of claims 6 to 9, wherein the apparatus further comprises a training unit;

the obtaining unit is further configured to obtain field names of N fields in the plurality of data tables; n is an integer greater than or equal to 2;

the training unit is used for training the field names of the N fields according to a second preset algorithm to obtain vectors of the field names of the N fields; the higher the probability that different fields appear in the same data table, the shorter the distance between the vectors of the field names of the different fields;

the calculation unit is specifically configured to obtain, from the vectors of the field names of the N fields, the vectors of the field names of the multiple fields.

11. An apparatus for identifying sensitive data, the apparatus comprising a memory for storing a computer program and a processor for invoking the computer program to perform the method of any of claims 1 to 5.

12. A computer-readable storage medium containing instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 5.