CN113672976A - Sensitive information detection method and device - Google Patents

Sensitive information detection method and device Download PDF

Info

Publication number
CN113672976A
CN113672976A CN202110889223.8A CN202110889223A CN113672976A CN 113672976 A CN113672976 A CN 113672976A CN 202110889223 A CN202110889223 A CN 202110889223A CN 113672976 A CN113672976 A CN 113672976A
Authority
CN
China
Prior art keywords
field
sensitive
values
information
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110889223.8A
Other languages
Chinese (zh)
Inventor
张安蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110889223.8A priority Critical patent/CN113672976A/en
Publication of CN113672976A publication Critical patent/CN113672976A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the present specification provides a sensitive information detection method, including: firstly, field information of a field to be detected is obtained, wherein the field information comprises field attributes and a plurality of field values; then, carrying out sensitivity judgment on the field information by using a preset rule; further, under the condition that the field information is judged to be suspected of a certain sensitive category, a detection algorithm corresponding to the certain sensitive category is determined based on a mapping relation between a pre-established alternative detection algorithm and the alternative sensitive category; and processing the field information by using the detection algorithm to obtain a processing result, wherein the processing result indicates whether the field belongs to the certain sensitive category.

Description

Sensitive information detection method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of machine learning technologies, and in particular, to a method and an apparatus for detecting sensitive information.
Background
As industries shift to digitization, online data is growing. For example, the user operates in the network platform to generate various operation data, such as browsing data, clicking data, payment data, registration information, and the like. The huge data contains more or less sensitive data, and once the sensitive data is leaked, the privacy safety of users, enterprises and the like is threatened, property loss and the like are caused, and even social stability is damaged, so the data safety becomes an intersection point concerned by all social circles.
However, the current method for detecting sensitive data is single, and it is difficult to meet the requirements in practical application. Therefore, a detection scheme is needed, which can effectively improve the detection efficiency and accuracy of sensitive information.
Disclosure of Invention
One or more embodiments of the present disclosure describe a sensitive information detection method and apparatus, where, for a field to be detected, a preset rule is first used to pre-discriminate the field, and then a detection algorithm corresponding to a suspected sensitive category indicated by a pre-discrimination result is called to perform detection, so as to obtain an accurate sensitive detection result efficiently and quickly.
According to a first aspect, there is provided a sensitive information detection method, comprising: acquiring field information of a field to be detected, wherein the field information comprises field attributes and a plurality of field values; carrying out sensitivity judgment on the field information by using a preset rule; under the condition that the field information is judged to be suspected of a certain sensitive category, determining a detection algorithm corresponding to the certain sensitive category based on a mapping relation between a pre-established alternative detection algorithm and the alternative sensitive category; and processing the field information by using the detection algorithm to obtain a processing result, wherein the processing result indicates whether the field belongs to the certain sensitive category.
In one embodiment, the field attribute comprises at least one of: field name, field comment, field type, table name of the table to which the field belongs, and table comment.
In one embodiment, the acquiring field information of the field to be detected includes: respectively performing first sampling and second sampling on field values of the fields to obtain a plurality of corresponding first field values and a plurality of corresponding second field values; evaluating a first sample quality of the first sample based on the plurality of first field values and a plurality of second field values; determining the plurality of field values based on the plurality of first field values in case the first sampling quality meets a preset criterion.
In a specific embodiment, evaluating a first sampling quality of the first samples based on the plurality of first field values and the plurality of second field values includes: calculating a degree of difference between a first field value distribution corresponding to the plurality of first field values and a second field value distribution corresponding to the plurality of second field values; and under the condition that the difference degree is smaller than a preset threshold value, judging that the first sampling quality reaches a preset standard.
In one embodiment, the acquiring field information of the field to be detected includes: carrying out first sampling on field values of the fields to obtain a plurality of first field values; determining a plurality of encoding vectors corresponding to the plurality of first field values, and calculating a plurality of similarities between a plurality of pairs of encoding vectors formed based on the plurality of encoding vectors; in a case where the plurality of similarities reflect that the similarity between the plurality of encoding vectors is less than a preset degree, the plurality of field values are determined based on the plurality of first field values.
In a specific embodiment, determining the plurality of field values based on the plurality of first field values includes: and carrying out de-duplication processing on the plurality of first field values to obtain the plurality of field values.
In one embodiment, the preset rules include sensitive class rules for several candidate sensitive classes and non-sensitive class rules for several non-sensitive classes; the method for distinguishing the field information by using the preset rule comprises the following steps: extracting the characteristics of the field information to obtain a plurality of characteristic values corresponding to a plurality of preset characteristic items; judging the plurality of characteristic values by utilizing the non-sensitive rule; and under the condition that the characteristic values do not belong to any sensitive category in a plurality of non-sensitive categories, judging the characteristic values by using the sensitive category rule to obtain the certain sensitive category.
In one embodiment, after the field information is determined by using a preset rule, the method further includes: and under the condition that the field information is judged to belong to a certain non-sensitive category, discarding the field.
In one embodiment, in the mapping relationship, any candidate sensitive category corresponds to one or more candidate detection algorithms, and the one or more candidate detection algorithms relate to one or more of the following algorithm types: rules, regular expressions, machine learning models.
In one embodiment, the detection algorithm comprises a first machine learning model for the certain sensitive category, including an attribute characterization layer, a value characterization layer, a fusion layer, and a full-connectivity layer; wherein, processing the field information by using the detection algorithm to obtain a processing result, and the processing result comprises: performing characterization processing on the field attribute by using the attribute characterization layer to obtain an attribute characterization vector; respectively performing representation processing on the plurality of field values by using a value representation layer to obtain a plurality of value representation vectors; utilizing a fusion layer to perform fusion processing on the attribute characterization vectors and the plurality of value characterization vectors respectively to obtain a plurality of fusion vectors; respectively processing the plurality of fusion vectors by utilizing the full-connection layer to obtain a plurality of detection results; determining the processing result based on the plurality of detection results.
In a specific embodiment, the certain sensitivity category is one of the following: user name, user address, company name.
In a specific embodiment, each detection result includes a probability indicating that the corresponding field attribute-field value pair is identified as the certain sensitive category; wherein determining the processing result based on the plurality of detection results comprises: calculating an average probability of a plurality of probabilities in the plurality of detection results; and under the condition that the average probability is greater than the preset probability, judging that the field belongs to the certain sensitive category, and taking the judgment result as the processing result.
In one embodiment, the detection algorithm comprises a second machine learning model for the certain sensitive category, and a number of first rules and/or a number of first regular expressions for field attributes, a number of second rules and/or a number of second regular expressions for field values; wherein, processing the field information by using the detection algorithm to obtain a processing result, and the processing result comprises: based on the field attribute of the field, obtaining a plurality of attribute characteristic values by utilizing the plurality of first rules and/or the plurality of first regular expressions; based on the field values, obtaining a plurality of statistical characteristic values by utilizing the plurality of second rules and/or the plurality of second regular expressions; and inputting the attribute characteristic values and the statistical characteristic values into the second machine learning model to obtain the processing result.
In one embodiment, further comprising: and determining the sensitivity level corresponding to the field based on a preset mapping relation between the alternative field and the alternative sensitivity level under the condition that the processing result indicates that the corresponding field belongs to the certain sensitivity category.
According to a second aspect, there is provided a sensitive information detection method, comprising: acquiring a plurality of pieces of field information of a plurality of fields to be detected, wherein each piece of field information comprises field attributes and a plurality of field values of a corresponding field; based on the plurality of field information, filtering the non-sensitive fields in the plurality of fields according to a first judgment rule set for the non-sensitive fields and a second judgment rule set for the sensitive fields, and obtaining the suspected sensitive type of each suspected sensitive field in the plurality of suspected sensitive fields; aiming at each suspected sensitive field, determining a detection algorithm corresponding to the suspected sensitive category of the suspected sensitive field based on a mapping relation between a pre-established alternative detection algorithm and the suspected sensitive category; and processing the field information of the suspected sensitive field by using the detection algorithm to obtain a processing result, wherein the processing result indicates whether the suspected sensitive field belongs to the suspected sensitive category.
According to a third aspect, there is provided a sensitive information detecting apparatus comprising: the field information acquisition unit is configured to acquire field information of a field to be detected, wherein the field information comprises field attributes and a plurality of field values; the pre-judging unit is configured to judge the sensitivity of the field information by using a preset rule; the algorithm determining unit is configured to determine a detection algorithm corresponding to a certain sensitive category based on a mapping relation between a pre-established alternative detection algorithm and the alternative sensitive category when the field information is judged to be suspected of the certain sensitive category; and the field information processing unit is configured to process the field information by using the detection algorithm to obtain a processing result, and the processing result indicates whether the field belongs to the certain sensitive category.
According to a fourth aspect, there is provided a sensitive information detecting apparatus comprising: the field information acquisition unit is configured to acquire a plurality of pieces of field information of a plurality of fields to be detected, wherein each piece of field information comprises a field attribute and a plurality of field values of a corresponding field; the pre-judging unit is configured to filter the non-sensitive fields in the fields according to a first judging rule set for the non-sensitive fields and a second judging rule set for the sensitive fields based on the information of the fields, and obtain suspected sensitive types of the suspected sensitive fields in the plurality of suspected sensitive fields; an algorithm determining unit, configured to determine, for each suspected sensitive field, a detection algorithm corresponding to a suspected sensitive category of the suspected sensitive field based on a mapping relationship between a pre-established candidate detection algorithm and the suspected sensitive category; and the field information processing unit is configured to process the field information of the suspected sensitive field by using the detection algorithm to obtain a processing result, and the processing result indicates whether the suspected sensitive field belongs to the suspected sensitive category.
According to a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first or second aspect.
According to a sixth aspect, there is provided a computing device comprising a memory having stored therein executable code, and a processor which, when executing the executable code, implements the method of the first or second aspect.
By adopting the method and the device provided by the embodiment of the specification, firstly, field samples are collected based on a data table in a database, and the field samples can comprise field names of corresponding fields, a plurality of field values and other field information; then, pre-judging the field samples by using a preset simple rule; further, if the judgment result is that the field sample is suspected to be of a certain sensitive category, a high-confidence detection algorithm corresponding to the certain sensitive category is called to detect the field sample, so that a more accurate detection result with higher confidence level is obtained, and whether the field sample belongs to the certain sensitive category is indicated. Therefore, the detection efficiency and the accuracy of the detection result can be effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 shows an implementation architecture diagram of a sensitive information detection method according to an embodiment;
FIG. 2 shows a flow diagram of a sensitive information detection method according to one embodiment;
FIG. 3 illustrates a model structure diagram of a first machine learning model according to one embodiment;
FIG. 4 shows a schematic flow diagram of a sensitive information detection method according to another embodiment;
FIG. 5 shows a schematic structural diagram of a sensitive information detection apparatus according to an embodiment;
fig. 6 shows a schematic configuration diagram of a sensitive information detection apparatus according to another embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
As previously mentioned, sensitive information in a database or data record table needs to be detected. The common detection mode completely depends on complex rules and regular expressions, the problems of high maintenance cost, difficult algorithm upgrading, high manual operation cost and the like exist, sensitive data without rules cannot be identified, detection performance is worried due to the complex rules and regular expressions, and especially when massive data are scanned, scanning efficiency is difficult to guarantee.
Based on the above observation and analysis, the inventors propose a sensitive information detection method. Fig. 1 is a schematic diagram illustrating an implementation architecture of a sensitive information detection method according to an embodiment, as shown in fig. 1, first, a field sample is collected based on a data table in a database, where the field sample may include field information such as a field name of a corresponding field, a plurality of field values, and the like; then, pre-judging the field sample by using a preset simple rule, wherein the obtained judgment result may be that the field is a non-sensitive field or that the corresponding field is suspected to be a sensitive category; further, if the judgment result is a suspected certain sensitive category, a detection algorithm corresponding to the certain sensitive category is called, the field sample is specially detected, a more accurate detection result with higher reliability is obtained, and the detection result indicates whether the field sample belongs to the certain sensitive category. Therefore, the detection efficiency and the accuracy of the detection result can be effectively improved.
For convenience of visual understanding, it is assumed that 1 ten thousand field samples are collected, and 10 detection algorithms are correspondingly configured for 10 sensitive categories in advance, if the scheme is not adopted, 10 ten thousand detection algorithm processing needs to be performed for detecting the 1 ten thousand field samples, and by implementing the scheme, because the rule adopted in the pre-discrimination is simple and the calculation amount is small, it is assumed that 1 hundred field samples in the 1 ten thousand field samples are suspected to be sensitive fields, and because the suspected sensitive categories are pre-determined, only 1 hundred detection algorithm processing needs to be performed, and three orders of magnitude are reduced compared with 10 ten thousand detection algorithms.
The following describes the steps of the above method with reference to the examples.
Fig. 2 is a schematic flow chart of a sensitive information detection method according to an embodiment, and an execution subject of the method may be any server, device, or equipment cluster having computing and processing capabilities. As shown in fig. 2, the method comprises the steps of:
step S210, field information of a field to be detected is obtained, wherein the field information comprises field attributes and a plurality of field values; step S220, carrying out sensitivity judgment on the field information by using a preset rule; step S230, under the condition that the field information is judged to be suspected of a certain sensitive category, determining a detection algorithm corresponding to the certain sensitive category based on a mapping relation between a pre-established alternative detection algorithm and the alternative sensitive category; step S240, processing the field information by using the detection algorithm to obtain a processing result, where the processing result indicates whether the field belongs to the certain sensitive category.
The above steps are introduced as follows:
step S210, acquiring field information of a field to be detected, wherein the field information includes field attributes and a plurality of field values. It is to be understood that the data source of the field is not limited, for example, the field may be from a data table in a database, or may be from a data table in another storage device or storage unit, or may be from a data record in a network platform or a network server.
In one embodiment, the field attributes may include one or more of the following: field name, field comment, table name, table comment, and field type. In one particular embodiment, the field name and field annotation are the English name and Chinese translation name of the field, e.g., username and user name, respectively. In another specific embodiment, the field name and the field comment are the Chinese name and Chinese description information of the field, such as the user identifier, respectively, and the text description information is: identity card number. In yet another specific embodiment, the field name and field comment are the unique number and description of the field, e.g., number 89757, respectively, and the description is: the name of the song that the user likes to listen to.
In one particular embodiment, the table name and table comment of the table to which the field belongs are the table's English name and Chinese translation name, respectively, e.g., userinfo and user information. In another specific embodiment, the field name and the field comment are the Chinese name and Chinese description information of the field, such as user information, respectively, and the text description information is: personal information such as user name, mobile phone number, identification card number and the like. In yet another specific embodiment, the table name and table annotation are the unique serial number and description information of the table, e.g., a678g, respectively, and the description information is: is associated with the user.
In a particular embodiment, the field types may include Binary data types such as Binary, Varbinary, Image, etc., character data types such as Char, Varchar, Text, etc., digital data types such as positive and negative numbers, decimals, integers, etc.
For the above field values, it is understood that there are usually a large number of field values under a field, for example, for a student list of a certain province, if there are 20 thousands students in the province, 20 thousands of field values are included in the field named as the school number, and if these field values are sampled in full, the storage space and the computing resources will be greatly consumed. Therefore, a proper amount of sampling of the field value is required for the field. In one embodiment, random sampling or hierarchical sampling may be performed based on a preset number of samples (e.g., 150 field values), and the field values obtained by random sampling or hierarchical sampling may be used as a plurality of field values constituting the field information.
In another embodiment, the quality evaluation may be performed on the sampled field value, so that the field value finally used for constructing the field information meets a preset standard. In one embodiment, an intra-batch evaluation may be performed based on field values sampled from a batch. In a specific embodiment, it is desirable to have as large a variance as possible between sampled field values, whereby diversity can be used as a quality assessment indicator.
Correspondingly, the field value of the field is firstly sampled to obtain a plurality of first field values, then a plurality of encoding vectors corresponding to the plurality of first field values are determined, and a plurality of similarities among a plurality of pairs of encoding vectors formed based on the plurality of encoding vectors are calculated; and under the condition that the similarity reflects that the similarity among the code vectors is less than a preset degree, determining the field values based on the first field values.
Further, for the determination of the encoding vector of the first field value, in an example, assuming that the first field value corresponds to text, the encoding vector may be obtained through a text encoding model, for example, a pre-trained bert model, etc. In another example, assuming that the first field value corresponds to a picture, an encoding vector thereof may be obtained through a picture encoding model, for example, a convolutional neural network, etc.
For the judgment of whether the similarity between the plurality of encoding vectors is smaller than the preset degree, in an example, the encoding vector group pair is performed based on the plurality of encoding vectors to obtain a plurality of pairs of encoding vectors, so as to obtain a plurality of similarities corresponding to the plurality of pairs of encoding vectors by calculating cosine similarities or euclidean distances between the vectors, and then, an average value of the plurality of similarities is calculated, or a median of the plurality of similarities is determined, and whether the average value or the median is smaller than a preset threshold (such as 0.3 or 0.4), if so, the similarity is judged to be smaller than the preset degree, and the first sampling is up to the standard, otherwise, the similarity is not up to the standard.
In another specific embodiment, it is desirable that sampled field values differ from each other, so uniqueness may be used as a quality assessment indicator. Correspondingly, the field value of the field is firstly sampled to obtain a plurality of first field values, then, the number of the repeated number in the plurality of first field values is determined, if the number of the repeated number is less than a preset threshold (such as 2 or 5), the first sampling is judged to reach the standard, otherwise, the first sampling is judged to not reach the standard.
It should be understood that the diversity index and the uniqueness index may be used alternatively or in combination. For example, when both the diversity index and the uniqueness index reach the standard, it is determined that the sampling quality reaches the preset standard.
In this way, an assessment of the quality of the sampling may be achieved based on the value of the sampling field within a batch. Further, in a case where quality of the first sampled first field values reaches a preset criterion, the sampling may be stopped, and the field values for constructing the field information may be determined based on the first field values. In a specific embodiment, the plurality of first field values may be subjected to a deduplication process, thereby obtaining a plurality of field values. In another specific embodiment, the plurality of first field values may be directly determined as the plurality of field values. Otherwise, if the sampling instruction of the first sampling does not reach the standard, continuing sampling and carrying out sampling quality evaluation until the quality of a certain sampling reaches the standard, and further determining the field values based on the field value of the certain sampling.
In another embodiment, the evaluation of the quality of the sampling may also be achieved based on the field values of different sampling batches. Considering that the sampled field value can be close to the real distribution of the full field value, the distribution closeness degree needs to be evaluated, however, if the evaluation is carried out based on the sampled field value and the full field value, great pressure is caused on storage space and calculation amount, therefore, the closeness degree between the field value distributions of different batches of samples is evaluated by adopting a multi-sampling mode, if the closeness degree is enough, the sampled field value is indirectly judged to be close to the real distribution, and the sampling quality reaches the standard.
Based on this, in one embodiment, the field values of the fields to be detected may be sampled twice, so as to obtain a plurality of first field values and a plurality of second field values; the quality of the two samples is then evaluated based on these field values. And in the case that the quality of the two times of sampling reaches a preset standard, determining a plurality of field values for constructing the field information based on the field values of any one time of sampling. In a specific embodiment, a difference degree between a first field value distribution corresponding to a plurality of first field values and a second field value distribution corresponding to a plurality of second field values is calculated, and when the difference degree is smaller than a preset threshold value, it is determined that the first sampling quality reaches a preset standard. In one example, the manner of calculating the degree of difference may employ KL divergence or cross entropy.
Further, in the case that the quality of the two samples reaches a preset standard, the field values may be determined based on the field value of any one sample. For example, the field values of any batch of samples are directly used as the field values, or the field values are obtained after the de-duplication processing. Under the condition that the two sampling qualities do not reach the preset standard, it is understood that one sampling quality or two sampling qualities may not reach the preset standard actually, at this time, additional sampling is performed again, for example, sampling is performed again, the distribution difference degree between any two sampling in the three sampling is evaluated, and if the distribution difference degree is smaller than the preset threshold value, the two corresponding sampling qualities are judged to reach the standard.
In this way, the evaluation of the sampling quality can be achieved based on the sampling field values of a plurality of batches. It should be understood that the evaluation indexes within a batch and the evaluation indexes between batches may be used alternatively or in combination. For example, for a first batch field value and a second batch field value obtained by two times of sampling, if the diversity index and the uniqueness index of the first batch field value both reach the standard and the difference degree of the distribution between the two batch field values is smaller than a preset threshold, it is determined that the sampling quality reaches a preset standard, otherwise, it is determined that the sampling quality does not reach the preset standard, and the sampling quality evaluation are continued until the quality of a certain sampling reaches the standard, and then the field values are determined based on the field value of the certain sampling.
From the above, a plurality of field values may be determined and included in the field information. Next, in S220, sensitivity determination is performed on the field information using a preset rule. In one embodiment, the preset rules include a sensitivity class discrimination rule (or sensitivity class rule) for several candidate sensitivity classes, so that it can be discriminated that the field attribute is suspected to relate to one or more sensitivity classes in the several candidate sensitivity classes, or that the field information does not relate to the several candidate sensitivity classes. Further, in a specific embodiment, the preset rule further includes a non-sensitive class discrimination rule (or referred to as a non-sensitive class rule) for a plurality of non-sensitive classes, so that it can be discriminated that the field attribute is suspected to relate to one or more sensitive classes in the plurality of candidate sensitive classes, or that the field attribute relates to one or more non-sensitive classes in the plurality of non-sensitive classes. It should be understood that several of the above-mentioned items refer to one or more items, and the preset rule is used to perform a preliminary judgment on the field information, so that it is simpler, and thus, a part of non-sensitive information can be filtered out through a less-calculated preliminary judgment, and a preliminary judgment on the sensitive category related to the sensitive field can be realized.
According to a specific embodiment, firstly, feature extraction is carried out on the field information to obtain a plurality of feature values corresponding to a plurality of preset feature items; then, the non-sensitive discrimination rules are utilized to discriminate the plurality of characteristic values; and under the condition that the field information does not relate to any sensitive category in a plurality of non-sensitive categories, judging the plurality of characteristic values by using the sensitive category judgment rule to obtain one or more sensitive categories related to the plurality of characteristic values, and if the plurality of sensitive categories are related, respectively taking each of the plurality of characteristic values as the certain judged sensitive category.
Further, in an example, the plurality of preset feature items may include a feature item for a field attribute, for example, a field name, and may further include a feature item for a field value, for example, a length or a number of bits of the field value. In one example, the non-sensitive class rules may include: if the field name is a timestamp, the corresponding field is an insensitive field. In one example, the sensitive class rules may include: if the field value is 11 digits, it is suspected of being a mobile phone number, and/or if the field value is 18 digits, it means an identification number.
According to another specific embodiment, after a plurality of characteristic values corresponding to the field information are extracted, the characteristic values are distinguished by using a rule tree composed of a non-sensitive rule and a non-sensitive rule, so that a distinguishing result is obtained.
Thus, the pre-discrimination of the sensitivity of the field information can be realized. Further, when the field information is judged to belong to one or more non-sensitive categories as a whole, and/or is not suspected of being any one of the alternative sensitive categories, the field information can be judged to belong to the non-sensitive information, so that the current detection process is terminated, or the field corresponding to the field information is discarded.
And if the field information is judged to be suspected to be a certain sensitive category, executing step S230, and determining a detection algorithm corresponding to the certain sensitive category based on a mapping relationship between a pre-established alternative detection algorithm and the alternative sensitive category.
It should be understood that the candidate detection algorithm is used for judging whether the candidate detection algorithm belongs to a certain candidate sensitive category with high confidence, and the sensitive features of different candidate sensitive categories are different, so that the used algorithms are different in category, for example, the identity number is suitable for judgment by using rules, the mailbox is suitable for judgment by using regular expressions, the name is suitable for judgment by using a machine learning model, and the like.
For a single candidate sensitive category, one or more candidate detection algorithms may be designed for the candidate sensitive category (or, the candidate detection algorithms may also be referred to as a candidate detection scheme), and the type of algorithm involved in one candidate detection algorithm may be one or more, for example, a candidate detection algorithm is designed, in which a regular expression is sequentially used for feature extraction and a machine learning model is used for prediction, or a rule and a regular expression are used, or only a machine learning model is used. In addition, for the mapping relationship, the candidate sensitive categories, and the rules and regular expressions involved in the candidate detection algorithm may be set by a worker or an expert according to experience, and the involved machine learning model may be obtained by training according to the collected training samples.
Based on the mapping relationship, a detection algorithm corresponding to a certain sensitive category suspected to be the field information may be determined, and the number of the detection algorithms may be one or more. In one embodiment, assuming that the certain sensitive category is a mailbox, it may be determined that the detection algorithm corresponding to the certain sensitive category includes two detection algorithms, one detection algorithm is a mailbox discrimination rule, and the other detection algorithm includes a mailbox discrimination regular expression and a mailbox discrimination model for determining statistical characteristics. In another embodiment, assuming that the sensitive category is a user name, it can be determined that the corresponding detection algorithm is a name discriminant model. In addition, if the sensitive categories suspected to be related to the field information include a plurality of sensitive categories, a corresponding detection algorithm can be determined for each suspected sensitive category.
After determining the detection algorithm corresponding to a certain sensitive category of the field, in step S240, the detection algorithm is used to process the field information to obtain a processing result, where the processing result indicates whether the field belongs to the certain sensitive category.
In one embodiment, the detection algorithm corresponding to the certain sensitive category includes a first machine learning model, as shown in FIG. 3, which includes an attribute characterization layer 310, a value characterization layer 320, a fusion layer 330, and a fully-connected layer 340. Correspondingly, the method comprises the following steps: using an attribute representation layer 310 to perform representation processing on the field attributes in the field information to obtain attribute representation vectors; a value representation layer 320 is used for respectively representing a plurality of field values in the field information to obtain a plurality of value representation vectors; utilizing a fusion layer 330 to perform fusion processing on the attribute characterization vectors and the plurality of value characterization vectors respectively to obtain a plurality of fusion vectors; the multiple fusion vectors are processed by the full link layer 340 to obtain multiple detection results.
In a specific embodiment, the sensitive category may be a user name, a user address, or a company name.
In a particular embodiment, the attribute characterization layer 310 may include a first word embedding sublayer and a first characterization sublayer. In one example, where the first Word embedding sublayer may be implemented as a pre-trained Bert model or a Word2vec model, etc.
In one example, the first characterization sublayer may employ a deep neural network DNN or a recurrent neural network RNN, or the like.
In a particular embodiment, if the field value corresponds to a picture, the value characterizing layer 320 may be implemented as a convolutional neural network CNN or a deep neural network DNN, etc. In another particular embodiment, the value characterization layer 320 may be implemented as an encoding sublayer and a second characterization sublayer if the field values correspond to numbers. In one example, the coding sublayer may employ a table lookup or a one-hot coding algorithm. In one example, the second characterization sublayer may employ a DNN network or the like. In yet another particular embodiment, the value representation layer may include a second word embedding sublayer and a second representation sublayer if the field value corresponds to text. In one example, the second Word embedding sublayer may be implemented as a pre-trained Bert model or a Word2vec model, etc. In one example, the second characterization sublayer may employ a DNN network or an RNN network, or the like.
In a specific embodiment, the manner of fusing the attribute characterization vector and the value characterization vector by the fusion layer 330 may include: splicing processing, weighted summation processing (including direct addition and averaging), bit-wise multiplication processing, and the like. In another specific embodiment, an attention mechanism can be introduced to perform fusion processing.
In a particular embodiment, the fully connected layer 340 may be implemented as one or more layers of a fully connected network. According to one example, sigmoid function is used in the last fully-connected layer for binary classification.
In this way, a plurality of corresponding detection results can be obtained for a plurality of field attribute-field value pairs each composed of the field attribute and the field value, and the processing result can be determined based on the plurality of detection results. In a specific embodiment, each detection result indicates whether the corresponding field attribute-field value pair belongs to the certain sensitive category, and accordingly, the plurality of processing results may be determined in a voting manner, for example, counting the number of results indicating that the plurality of detection results belong to the certain sensitive category, and if the occupation ratio exceeds a preset threshold (e.g., 0.5 or 0.6), taking the field belonging to the certain sensitive category as the processing result, otherwise, taking the field not belonging to the certain sensitive category as the processing result.
In another specific embodiment, each detection result includes a probability indicating that the corresponding field attribute-field value pair is identified as the certain sensitive category, and accordingly, an average probability of a plurality of probabilities in the plurality of detection results may be calculated, and further, in a case that the average probability is greater than a preset probability (e.g., 0.5 or 0.6), it is determined that the field belongs to the certain sensitive category, and the determination result is taken as the processing result, otherwise, it is taken that the field does not belong to the certain sensitive category as the processing result.
In this way, the field information can be processed using the first machine learning model, thereby obtaining a processing result.
In another embodiment, the detection algorithm corresponding to the certain sensitive category includes a second machine learning model, and any one of the following: a number of first rules for field attributes, a number of first regular expressions for field attributes, a number of second rules for field values, a number of second regular expressions for field values. Based on this, the determination of the processing result may include: based on the field attribute of the field, obtaining a plurality of attribute characteristic values by utilizing the plurality of first rules and/or the plurality of first regular expressions; and obtaining a plurality of statistical characteristic values by utilizing the plurality of second rules and/or the plurality of second regular expressions based on the plurality of field values.
According to a specific embodiment, the sensitive category is a user name; the first rule set for the field attribute may include: if the field type is a character, the characteristic value of the corresponding attribute characteristic item is 1, otherwise, the value is 0; the plurality of first regular expressions respectively correspond to: the field name comprises a name, the field annotation comprises a name, and further, for each first regular expression, if the first regular expression is hit, the characteristic value of the corresponding attribute characteristic item is 1, otherwise, the characteristic value is 0; the above-described first regular expression set for the field value corresponds to: the number of the characters is more than 2 and less than 4, so that the number of the first regular expression hit in the field values can be counted as the characteristic value of the corresponding statistical characteristic item. In one example, the determined plurality of attribute feature values includes 1, 1, 0 and the determined statistical feature value includes 80, 46.
Further, the obtained attribute feature values and statistical feature values may be input into the second machine learning model to obtain corresponding processing results. In a specific embodiment, in addition to inputting the feature values and the feature values of the attributes into the second machine learning model, the field attributes and the field values may be input into the second machine learning model together. In a specific embodiment, the second machine learning model may be implemented using a DNN network, a CNN network, an RNN network, or the like.
In this way, the field information of the field to be detected can be processed by using the second machine learning model processing and the rule and/or regular expression set for the field information, so as to obtain the processing result.
In a further embodiment, the detection algorithm corresponding to a certain sensitive class comprises a third machine learning model, so that the field information can be input into the third machine learning model to obtain the corresponding processing result. In a particular embodiment, the certain sensitive category may include user name, user mailbox, identification number, user annual income, home address, company department turnover, and the like. In a specific embodiment, the third machine learning model may be implemented using a DNN network, a CNN network, an RNN network, or the like.
On the other hand, in one implementation, the sensitivity category corresponds to a plurality of detection algorithms, and in this case, the sensitivity category may be used alternatively, or the field information may be processed by the plurality of detection algorithms to obtain a plurality of processing results, and further, the integrated processing result may be obtained from the plurality of processing results. In one example, a voting mechanism may be utilized to obtain the integrated processing results.
Thus, the field information of the field may be processed by invoking a detection algorithm corresponding to the certain sensitive category, so as to obtain a processing result, which indicates whether the field belongs to the certain sensitive category. Further, in an embodiment, in a case that the processing result indicates that the corresponding field belongs to the certain sensitivity category, the sensitivity level corresponding to the field may be determined based on a preset mapping relationship between the alternative field and the alternative sensitivity level (see fig. 1). Further, a security policy corresponding to the sensitivity level may be invoked to perform security processing on the fields, for example, to prohibit bulk transmissions or perform desensitization processing before transmissions, etc.
To sum up, with the sensitive information detection method disclosed in the embodiments of the present specification, first, field samples are collected based on a data table in a database, where the field samples may include field information such as a field name of a corresponding field, a plurality of field values, and the like; then, pre-judging the field samples by using a preset simple rule; further, if the judgment result is that the field sample is suspected to be of a certain sensitive category, a high-confidence detection algorithm corresponding to the certain sensitive category is called to detect the field sample, so that a more accurate detection result with higher confidence level is obtained, and whether the field sample belongs to the certain sensitive category is indicated. Therefore, the detection efficiency and the accuracy of the detection result can be effectively improved.
According to another aspect of embodiments, another method of sensitive information detection is disclosed. Fig. 4 is a schematic flow chart of a sensitive information detection method according to another embodiment, and an execution subject of the method may be any server, device, or equipment cluster with computing and processing capabilities. As shown in fig. 4, the method comprises the steps of:
step S410, acquiring a plurality of pieces of field information of a plurality of fields to be detected, wherein each piece of field information comprises a field attribute and a plurality of field values of a corresponding field; step S420, based on the information of the plurality of fields, filtering the non-sensitive fields in the plurality of fields according to a first judgment rule set for the non-sensitive fields and a second judgment rule set for the sensitive fields, and obtaining the suspected sensitive type of each suspected sensitive field in the plurality of suspected sensitive fields; step S430, for each suspected sensitive field, determining a detection algorithm corresponding to the suspected sensitive field based on a mapping relationship between a pre-established candidate detection algorithm and a pre-established candidate sensitive category, so as to process the field information of the suspected sensitive field by using the detection algorithm, thereby obtaining a processing result, where the processing result indicates whether the suspected sensitive field belongs to the suspected sensitive category.
Regarding the above steps, in one embodiment, the step S420 may be implemented as: firstly, filtering the non-sensitive fields in the fields based on the information of the fields by using a first judgment rule set for the non-sensitive category to obtain a plurality of suspected sensitive fields; and determining the suspected sensitive category of each suspected sensitive field based on the field information of the suspected sensitive fields by utilizing a second judgment rule set for the sensitive category.
It should be noted that, for the description of the above steps, reference may also be made to the relevant description in the foregoing embodiments.
To sum up, with the sensitive information detection method disclosed in the embodiments of the present specification, first, a plurality of field samples are collected based on a data table in a database, where each field sample may include field information such as a field name and a plurality of field values of a corresponding field; then, pre-judging a plurality of field samples by using a preset simple rule; further, for the field samples judged to be suspected to be sensitive, a high-confidence detection algorithm corresponding to a certain sensitive category suspected to be sensitive is called to detect the field samples, so that a more accurate detection result with higher confidence is obtained, and whether the corresponding field samples belong to the certain sensitive category is indicated. Therefore, the sensitive detection of the fields can be realized, the sensitive fields and the corresponding sensitive types in the fields can be detected, and the detection efficiency and the accuracy of the detection result can be effectively improved.
Corresponding to the above detection method, the embodiment of the present specification further discloses a detection device, which specifically includes:
fig. 5 shows a schematic structural diagram of a sensitive information detection apparatus according to an embodiment, which may be implemented as any server or device cluster with computing and processing capabilities. As shown in fig. 5, the apparatus 500 includes the following units:
a field information obtaining unit 510 configured to obtain field information of a field to be detected, where the field information includes field attributes and a plurality of field values; a pre-judging unit 520 configured to perform sensitivity judgment on the field information by using a preset rule; an algorithm determining unit 530, configured to determine, when it is determined that the field information is suspected of a certain sensitive category, a detection algorithm corresponding to the certain sensitive category based on a mapping relationship between a pre-established candidate detection algorithm and the candidate sensitive category; a field information processing unit 540, configured to process the field information by using the detection algorithm, to obtain a processing result, where the processing result indicates whether the field belongs to the certain sensitive category.
In one embodiment, the field attribute comprises at least one of: field name, field comment, field type, table name of the table to which the field belongs, and table comment.
In one embodiment, the field information obtaining unit 510 includes: a sampling module 511, configured to perform first sampling and second sampling on field values of the fields respectively to obtain a plurality of corresponding first field values and a plurality of corresponding second field values; an evaluation module 512 configured to evaluate a first sampling quality of the first sample based on the plurality of first field values and a plurality of second field values; a determining module 513 configured to determine the plurality of field values based on the plurality of first field values if the first sampling quality meets a preset criterion.
In a specific embodiment, the evaluation module 512 is specifically configured to: calculating a degree of difference between a first field value distribution corresponding to the plurality of first field values and a second field value distribution corresponding to the plurality of second field values; and under the condition that the difference degree is smaller than a preset threshold value, judging that the first sampling quality reaches a preset standard.
In a specific embodiment, the determining module 513 is specifically configured to: and carrying out de-duplication processing on the plurality of first field values to obtain the plurality of field values.
In an embodiment, the field information obtaining unit 510 is specifically configured to: carrying out first sampling on field values of the fields to obtain a plurality of first field values; determining a plurality of encoding vectors corresponding to the plurality of first field values, and calculating a plurality of similarities between a plurality of pairs of encoding vectors formed based on the plurality of encoding vectors; in a case where the plurality of similarities reflect that the similarity between the plurality of encoding vectors is less than a preset degree, the plurality of field values are determined based on the plurality of first field values.
In one embodiment, the preset rules include sensitive class rules for several candidate sensitive classes and non-sensitive class rules for several non-sensitive classes; the pre-discrimination unit 520 is specifically configured to: extracting the characteristics of the field information to obtain a plurality of characteristic values corresponding to a plurality of preset characteristic items; judging the plurality of characteristic values by utilizing the non-sensitive rule; and under the condition that the characteristic values do not belong to any sensitive category in a plurality of non-sensitive categories, judging the characteristic values by using the sensitive category rule to obtain the certain sensitive category.
In one embodiment, the apparatus 500 further comprises a field discarding unit 550 configured to discard the field if the field information is determined to belong to a non-sensitive category.
In one embodiment, in the mapping relationship, any candidate sensitive category corresponds to one or more candidate detection algorithms, and the one or more candidate detection algorithms relate to one or more of the following algorithm types: rules, regular expressions, machine learning models.
In one embodiment, the detection algorithm comprises a first machine learning model for the certain sensitive category, including an attribute characterization layer, a value characterization layer, a fusion layer, and a full-connectivity layer; wherein, the field information processing unit 540 is specifically configured to: performing characterization processing on the field attribute by using the attribute characterization layer to obtain an attribute characterization vector; respectively performing representation processing on the plurality of field values by using a value representation layer to obtain a plurality of value representation vectors; utilizing a fusion layer to perform fusion processing on the attribute characterization vectors and the plurality of value characterization vectors respectively to obtain a plurality of fusion vectors; respectively processing the plurality of fusion vectors by utilizing the full-connection layer to obtain a plurality of detection results; determining the processing result based on the plurality of detection results.
In a specific embodiment, the certain sensitive category is one of the following: user name, user address, company name.
In a specific embodiment, each detection result includes a probability indicating that the corresponding field attribute-field value pair is identified as the certain sensitive category; the field information processing unit 540 determines the processing result based on the plurality of detection results, and specifically includes: calculating an average probability of a plurality of probabilities in the plurality of detection results; and under the condition that the average probability is greater than the preset probability, judging that the field belongs to the certain sensitive category, and taking the judgment result as the processing result.
In one embodiment, the detection algorithm comprises a second machine learning model for the certain sensitive category, and a number of first rules and/or a number of first regular expressions for field attributes, a number of second rules and/or a number of second regular expressions for field values; the field information processing unit 540 is specifically configured to: based on the field attribute of the field, obtaining a plurality of attribute characteristic values by utilizing the plurality of first rules and/or the plurality of first regular expressions; based on the field values, obtaining a plurality of statistical characteristic values by utilizing the plurality of second rules and/or the plurality of second regular expressions; and inputting the attribute characteristic values and the statistical characteristic values into the second machine learning model to obtain the processing result.
In one embodiment, the apparatus 500 further comprises a sensitivity level determination unit 560 configured to: and determining the sensitivity level corresponding to the field based on a preset mapping relation between the alternative field and the alternative sensitivity level under the condition that the processing result indicates that the corresponding field belongs to the certain sensitivity category.
Fig. 6 shows a schematic structural diagram of a sensitive information detection apparatus according to another embodiment, which may be implemented as any server or device cluster with computing and processing capabilities. As shown in fig. 6, the apparatus 600 includes the following units:
a field information obtaining unit 610 configured to obtain a plurality of pieces of field information of a plurality of fields to be detected, where each piece of field information includes a field attribute and a plurality of field values of a corresponding field; a pre-discrimination unit 620 configured to filter, based on the information of the plurality of fields, non-sensitive fields in the plurality of fields according to a first discrimination rule set for a non-sensitive type and a second discrimination rule set for a sensitive type, and obtain a suspected sensitive type of each suspected sensitive field in the plurality of suspected sensitive fields; an algorithm determining unit 630, configured to determine, for each suspected sensitive field, a detection algorithm corresponding to a suspected sensitive category of the suspected sensitive field based on a mapping relationship between a pre-established candidate detection algorithm and the suspected sensitive category; the field information processing unit 640 is configured to process the field information of the suspected sensitive field by using the detection algorithm to obtain a processing result, where the processing result indicates whether the suspected sensitive field belongs to the suspected sensitive category.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2 or fig. 4.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2 or fig. 4.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (25)

1. A sensitive information detection method, comprising:
acquiring field information of a field to be detected, wherein the field information comprises field attributes and a plurality of field values;
carrying out sensitivity judgment on the field information by using a preset rule;
under the condition that the field information is judged to be suspected of a certain sensitive category, determining a detection algorithm corresponding to the certain sensitive category based on a mapping relation between a pre-established alternative detection algorithm and the alternative sensitive category;
and processing the field information by using the detection algorithm to obtain a processing result, wherein the processing result indicates whether the field belongs to the certain sensitive category.
2. The method of claim 1, wherein the field attribute comprises at least one of: field name, field comment, field type, table name of the table to which the field belongs, and table comment.
3. The method of claim 1, wherein the obtaining field information of the field to be detected comprises:
respectively performing first sampling and second sampling on field values of the fields to obtain a plurality of corresponding first field values and a plurality of corresponding second field values;
evaluating a first sample quality of the first sample based on the plurality of first field values and a plurality of second field values;
determining the plurality of field values based on the plurality of first field values in case the first sampling quality meets a preset criterion.
4. The method of claim 3, wherein evaluating a first sampling quality of the first samples based on the plurality of first field values and a plurality of second field values comprises:
calculating a degree of difference between a first field value distribution corresponding to the plurality of first field values and a second field value distribution corresponding to the plurality of second field values;
and under the condition that the difference degree is smaller than a preset threshold value, judging that the first sampling quality reaches a preset standard.
5. The method of claim 1, wherein the obtaining field information of the field to be detected comprises:
carrying out first sampling on field values of the fields to obtain a plurality of first field values;
determining a plurality of encoding vectors corresponding to the plurality of first field values, and calculating a plurality of similarities between a plurality of pairs of encoding vectors formed based on the plurality of encoding vectors;
in a case where the plurality of similarities reflect that the similarity between the plurality of encoding vectors is less than a preset degree, the plurality of field values are determined based on the plurality of first field values.
6. The method of claim 3 or 5, wherein determining the plurality of field values based on the plurality of first field values comprises:
and carrying out de-duplication processing on the plurality of first field values to obtain the plurality of field values.
7. The method of claim 1, wherein the preset rules include a sensitive class rule for a number of candidate sensitive classes and a non-sensitive class rule for a number of non-sensitive classes; the method for distinguishing the field information by using the preset rule comprises the following steps:
extracting the characteristics of the field information to obtain a plurality of characteristic values corresponding to a plurality of preset characteristic items;
judging the plurality of characteristic values by utilizing the non-sensitive rule;
and under the condition that the characteristic values do not belong to any sensitive category in a plurality of non-sensitive categories, judging the characteristic values by using the sensitive category rule to obtain the certain sensitive category.
8. The method of claim 1, wherein after discriminating the field information using a preset rule, the method further comprises:
and under the condition that the field information is judged to belong to a certain non-sensitive category, discarding the field.
9. The method of claim 1, wherein any alternative sensitive category corresponds to one or more alternative detection algorithms in the mapping, the one or more alternative detection algorithms involving one or more of the following algorithm types: rules, regular expressions, machine learning models.
10. The method of claim 1, wherein the detection algorithm comprises a first machine learning model for the certain sensitive category, including an attribute characterization layer, a value characterization layer, a fusion layer, and a full-connectivity layer;
wherein, processing the field information by using the detection algorithm to obtain a processing result, and the processing result comprises:
performing characterization processing on the field attribute by using the attribute characterization layer to obtain an attribute characterization vector;
respectively performing representation processing on the plurality of field values by using a value representation layer to obtain a plurality of value representation vectors;
utilizing a fusion layer to perform fusion processing on the attribute characterization vectors and the plurality of value characterization vectors respectively to obtain a plurality of fusion vectors;
respectively processing the plurality of fusion vectors by utilizing the full-connection layer to obtain a plurality of detection results;
determining the processing result based on the plurality of detection results.
11. The method of claim 10, the certain sensitivity category being one of: user name, user address, company name.
12. The method of claim 10, wherein each detection result includes therein a probability indicating that the corresponding field attribute-field value pair is identified as the certain sensitive category; wherein determining the processing result based on the plurality of detection results comprises:
calculating an average probability of a plurality of probabilities in the plurality of detection results;
and under the condition that the average probability is greater than the preset probability, judging that the field belongs to the certain sensitive category, and taking the judgment result as the processing result.
13. The method of claim 1, wherein the detection algorithm comprises a second machine learning model for the certain sensitive category, and a number of first rules and/or a number of first regular expressions for field attributes, a number of second rules and/or a number of second regular expressions for field values;
wherein, processing the field information by using the detection algorithm to obtain a processing result, and the processing result comprises:
based on the field attribute of the field, obtaining a plurality of attribute characteristic values by utilizing the plurality of first rules and/or the plurality of first regular expressions;
based on the field values, obtaining a plurality of statistical characteristic values by utilizing the plurality of second rules and/or the plurality of second regular expressions;
and inputting the attribute characteristic values and the statistical characteristic values into the second machine learning model to obtain the processing result.
14. The method of claim 1, further comprising:
and determining the sensitivity level corresponding to the field based on a preset mapping relation between the alternative field and the alternative sensitivity level under the condition that the processing result indicates that the corresponding field belongs to the certain sensitivity category.
15. A sensitive information detection method, comprising:
acquiring a plurality of pieces of field information of a plurality of fields to be detected, wherein each piece of field information comprises field attributes and a plurality of field values of a corresponding field;
based on the plurality of field information, filtering the non-sensitive fields in the plurality of fields according to a first judgment rule set for the non-sensitive fields and a second judgment rule set for the sensitive fields, and obtaining the suspected sensitive type of each suspected sensitive field in the plurality of suspected sensitive fields;
aiming at each suspected sensitive field, determining a detection algorithm corresponding to the suspected sensitive category of the suspected sensitive field based on a mapping relation between a pre-established alternative detection algorithm and the suspected sensitive category; and processing the field information of the suspected sensitive field by using the detection algorithm to obtain a processing result, wherein the processing result indicates whether the suspected sensitive field belongs to the suspected sensitive category.
16. A sensitive information detection apparatus comprising:
the field information acquisition unit is configured to acquire field information of a field to be detected, wherein the field information comprises field attributes and a plurality of field values;
the pre-judging unit is configured to judge the sensitivity of the field information by using a preset rule;
the algorithm determining unit is configured to determine a detection algorithm corresponding to a certain sensitive category based on a mapping relation between a pre-established alternative detection algorithm and the alternative sensitive category when the field information is judged to be suspected of the certain sensitive category;
and the field information processing unit is configured to process the field information by using the detection algorithm to obtain a processing result, and the processing result indicates whether the field belongs to the certain sensitive category.
17. The apparatus of claim 16, wherein the field information acquiring unit comprises:
the sampling module is configured to perform first sampling and second sampling on field values of the fields respectively to obtain a plurality of corresponding first field values and a plurality of corresponding second field values;
an evaluation module configured to evaluate a first sampling quality of the first sample based on the plurality of first field values and a plurality of second field values;
a determining module configured to determine the plurality of field values based on the plurality of first field values if the first sampling quality meets a preset criterion.
18. The apparatus of claim 17, wherein the evaluation module is specifically configured to:
calculating a degree of difference between a first field value distribution corresponding to the plurality of first field values and a second field value distribution corresponding to the plurality of second field values;
and under the condition that the difference degree is smaller than a preset threshold value, judging that the first sampling quality reaches a preset standard.
19. The apparatus according to claim 16, wherein the field information obtaining unit is specifically configured to:
carrying out first sampling on field values of the fields to obtain a plurality of first field values;
determining a plurality of encoding vectors corresponding to the plurality of first field values, and calculating a plurality of similarities between a plurality of pairs of encoding vectors formed based on the plurality of encoding vectors;
in a case where the plurality of similarities reflect that the similarity between the plurality of encoding vectors is less than a preset degree, the plurality of field values are determined based on the plurality of first field values.
20. The apparatus of claim 16, wherein the preset rules include a sensitive class rule for a number of candidate sensitive classes and a non-sensitive class rule for a number of non-sensitive classes; wherein the pre-discrimination unit is specifically configured to:
extracting the characteristics of the field information to obtain a plurality of characteristic values corresponding to a plurality of preset characteristic items;
judging the plurality of characteristic values by utilizing the non-sensitive rule;
and under the condition that the characteristic values do not belong to any sensitive category in a plurality of non-sensitive categories, judging the characteristic values by using the sensitive category rule to obtain the certain sensitive category.
21. The apparatus of claim 16, wherein the detection algorithm comprises a first machine learning model for the certain sensitive category comprising an attribute characterization layer, a value characterization layer, a fusion layer, and a full connection layer;
wherein the field information processing unit is specifically configured to:
performing characterization processing on the field attribute by using the attribute characterization layer to obtain an attribute characterization vector;
respectively performing representation processing on the plurality of field values by using a value representation layer to obtain a plurality of value representation vectors;
utilizing a fusion layer to perform fusion processing on the attribute characterization vectors and the plurality of value characterization vectors respectively to obtain a plurality of fusion vectors;
respectively processing the plurality of fusion vectors by utilizing the full-connection layer to obtain a plurality of detection results;
determining the processing result based on the plurality of detection results.
22. The apparatus of claim 1, wherein the detection algorithm comprises a second machine learning model for the certain sensitive category, and a number of first rules and/or a number of first regular expressions for field attributes, a number of second rules and/or a number of second regular expressions for field values;
the field information processing unit is specifically configured to:
based on the field attribute of the field, obtaining a plurality of attribute characteristic values by utilizing the plurality of first rules and/or the plurality of first regular expressions;
based on the field values, obtaining a plurality of statistical characteristic values by utilizing the plurality of second rules and/or the plurality of second regular expressions;
and inputting the attribute characteristic values and the statistical characteristic values into the second machine learning model to obtain the processing result.
23. A sensitive information detection apparatus comprising:
the field information acquisition unit is configured to acquire a plurality of pieces of field information of a plurality of fields to be detected, wherein each piece of field information comprises a field attribute and a plurality of field values of a corresponding field;
the pre-judging unit is configured to filter the non-sensitive fields in the fields according to a first judging rule set for the non-sensitive fields and a second judging rule set for the sensitive fields based on the information of the fields, and obtain suspected sensitive types of the suspected sensitive fields in the plurality of suspected sensitive fields;
an algorithm determining unit, configured to determine, for each suspected sensitive field, a detection algorithm corresponding to a suspected sensitive category of the suspected sensitive field based on a mapping relationship between a pre-established candidate detection algorithm and the suspected sensitive category;
and the field information processing unit is configured to process the field information of the suspected sensitive field by using the detection algorithm to obtain a processing result, and the processing result indicates whether the suspected sensitive field belongs to the suspected sensitive category.
24. A computer-readable storage medium, on which a computer program is stored, wherein the computer program causes a computer to carry out the method of any one of claims 1-15, when the computer program is carried out in the computer.
25. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-15.
CN202110889223.8A 2021-08-04 2021-08-04 Sensitive information detection method and device Pending CN113672976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110889223.8A CN113672976A (en) 2021-08-04 2021-08-04 Sensitive information detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110889223.8A CN113672976A (en) 2021-08-04 2021-08-04 Sensitive information detection method and device

Publications (1)

Publication Number Publication Date
CN113672976A true CN113672976A (en) 2021-11-19

Family

ID=78541308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110889223.8A Pending CN113672976A (en) 2021-08-04 2021-08-04 Sensitive information detection method and device

Country Status (1)

Country Link
CN (1) CN113672976A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120072992A1 (en) * 2010-09-16 2012-03-22 International Business Machines Corporation Securing sensitive data for cloud computing
US20150324606A1 (en) * 2014-05-10 2015-11-12 Informatica Corporation Identifying and Securing Sensitive Data at its Source
US20190108432A1 (en) * 2017-10-05 2019-04-11 Salesforce.Com, Inc. Convolutional neural network (cnn)-based anomaly detection
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
US20190354717A1 (en) * 2018-05-16 2019-11-21 Microsoft Technology Licensing, Llc. Rule-based document scrubbing of sensitive data
CN111274149A (en) * 2020-02-06 2020-06-12 中国建设银行股份有限公司 Test data processing method and device
CN112491816A (en) * 2020-11-12 2021-03-12 支付宝(杭州)信息技术有限公司 Service data processing method and device
CN112528638A (en) * 2019-08-29 2021-03-19 北京沃东天骏信息技术有限公司 Abnormal object identification method and device, electronic equipment and storage medium
CN112528315A (en) * 2019-09-19 2021-03-19 华为技术有限公司 Method and device for identifying sensitive data
CN113032834A (en) * 2021-04-20 2021-06-25 江苏保旺达软件技术有限公司 Database table processing method, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120072992A1 (en) * 2010-09-16 2012-03-22 International Business Machines Corporation Securing sensitive data for cloud computing
US20150324606A1 (en) * 2014-05-10 2015-11-12 Informatica Corporation Identifying and Securing Sensitive Data at its Source
US20190108432A1 (en) * 2017-10-05 2019-04-11 Salesforce.Com, Inc. Convolutional neural network (cnn)-based anomaly detection
US20190354717A1 (en) * 2018-05-16 2019-11-21 Microsoft Technology Licensing, Llc. Rule-based document scrubbing of sensitive data
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
CN112528638A (en) * 2019-08-29 2021-03-19 北京沃东天骏信息技术有限公司 Abnormal object identification method and device, electronic equipment and storage medium
CN112528315A (en) * 2019-09-19 2021-03-19 华为技术有限公司 Method and device for identifying sensitive data
CN111274149A (en) * 2020-02-06 2020-06-12 中国建设银行股份有限公司 Test data processing method and device
CN112491816A (en) * 2020-11-12 2021-03-12 支付宝(杭州)信息技术有限公司 Service data processing method and device
CN113032834A (en) * 2021-04-20 2021-06-25 江苏保旺达软件技术有限公司 Database table processing method, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
何文竹;彭长根;王毛妮;丁兴;樊玫玫;丁红发;: "面向结构化数据集的敏感属性识别与分级算法", 计算机应用研究, no. 10, 31 December 2020 (2020-12-31) *
滕金芳;钟诚;: "基于聚类的敏感属性-多样性匿名化算法", 计算机工程与设计, no. 20, 28 October 2010 (2010-10-28) *
许大琴;曹美琴;: "数据库敏感字段的加密研究", 信息安全与技术, no. 02, 10 February 2014 (2014-02-10) *

Similar Documents

Publication Publication Date Title
CN110263538A (en) A kind of malicious code detecting method based on system action sequence
CN110991474A (en) Machine learning modeling platform
CN110619535A (en) Data processing method and device
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
CN114417405A (en) Privacy service data analysis method based on artificial intelligence and server
CN111988327B (en) Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN113988226B (en) Data desensitization validity verification method and device, computer equipment and storage medium
CN116366312A (en) Web attack detection method, device and storage medium
CN113672976A (en) Sensitive information detection method and device
JP2004171316A (en) Ocr device, document retrieval system and document retrieval program
CN114528908A (en) Network request data classification model training method, classification method and storage medium
CN111931229B (en) Data identification method, device and storage medium
CN111209567B (en) Method and device for judging perceptibility of improving robustness of detection model
CN111611981A (en) Information identification method and device and information identification neural network training method and device
CN113987309B (en) Personal privacy data identification method and device, computer equipment and storage medium
Burrell et al. Testing conventional wisdom (of the crowd)
CN113986956B (en) Data exception query analysis method and device, computer equipment and storage medium
CN113254801A (en) Information processing and training method and device, electronic equipment and computer storage medium
CN115757791A (en) Public opinion big data-based appeal case information extraction clustering method and device
CN113868416A (en) Abnormal short message detection method and device, computer equipment and medium
CN115827866A (en) Log processing method, device, equipment and computer readable storage medium
CN114547606A (en) Third-party application risk analysis method and system for mobile internet operating system
CN115883232A (en) Website identification method, device, equipment and storage medium
CN112907306A (en) Customer satisfaction judging method and device
CN115828137A (en) Method for processing training data, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination