WO2020215571A1 - 一种识别敏感数据的方法、装置、存储介质及计算机设备 - Google Patents

一种识别敏感数据的方法、装置、存储介质及计算机设备 Download PDF

Info

Publication number
WO2020215571A1
WO2020215571A1 PCT/CN2019/103529 CN2019103529W WO2020215571A1 WO 2020215571 A1 WO2020215571 A1 WO 2020215571A1 CN 2019103529 W CN2019103529 W CN 2019103529W WO 2020215571 A1 WO2020215571 A1 WO 2020215571A1
Authority
WO
WIPO (PCT)
Prior art keywords
field
sample
data
model
sensitive
Prior art date
Application number
PCT/CN2019/103529
Other languages
English (en)
French (fr)
Inventor
许超俊
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020215571A1 publication Critical patent/WO2020215571A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of data identification technology, and in particular to a method, device, storage medium and computer equipment for identifying sensitive data.
  • Sensitive information generally refers to information related to privacy, including property information, health and physiological information, biometric information, identity information, and network identification information, such as ID card numbers, bank card numbers, phone numbers, web browsing records, whereabouts, etc. .
  • the labor consumption is high, the subjectivity is strong, and the possibility of missing or wrong identification is high.
  • the method of self-defining the fuzzy verification field is likely to cause errors in the fuzzy matching of the field name after selection, matching the field that should not be matched, or missing the matched field because the fuzzy matching range is too small.
  • the method of self-defining fuzzy verification fields requires a large amount of data understanding and reading data, and manual definition, which requires a high level of operational capabilities and data understanding capabilities of operators.
  • this application provides a method, device, storage medium and computer equipment for identifying sensitive data.
  • a method for identifying sensitive data including:
  • a recognition model which includes a recognition sub-model for identifying whether a field is a sensitive field and a classification sub-model for distinguishing between sensitive data and non-sensitive data; acquiring information to be tested, the information to be tested includes the field to be tested And the data to be tested corresponding to the field to be tested; determine whether the field to be tested is a sensitive field according to the identification sub-model, and determine whether the data to be tested is sensitive data according to the classification sub-model; When the field to be tested is a sensitive field and the data to be tested is sensitive data, it is determined that the information to be tested is sensitive information.
  • a device for identifying sensitive data including:
  • the model module is used to establish a recognition model, the recognition model includes a recognition sub-model used to recognize whether a field is a sensitive field and a classification sub-model used to distinguish between sensitive data and non-sensitive data; the acquisition module is used to obtain information to be tested The information to be tested includes the field to be tested and the data to be tested corresponding to the field to be tested; the judgment module is used to judge whether the field to be tested is a sensitive field according to the identification submodel, and according to the The classification sub-model judges whether the data to be tested is sensitive data; the identification processing module is used to determine that the information to be tested is sensitive information when the field to be tested is a sensitive field and the data to be tested is sensitive data .
  • a computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by a processor, the step of identifying sensitive data is realized.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor executes the computer-readable instructions Implement steps to identify sensitive data.
  • the method, device, storage medium, and computer equipment for identifying sensitive data utilize the feature that the data in the database contains field attributes, and the process of adding fields for identifying data when identifying sensitive data is established by establishing an identifier
  • the model and classification sub-models respectively identify and judge the field to be tested and the data to be tested in the information to be tested, and determine whether the information to be tested is sensitive information based on the two dimensions of the field and the data, so that the test can be judged more accurately Whether the data is sensitive data can make the recognition accuracy higher.
  • this method is suitable for identifying a large amount of data in a database, and can save the connection between fields and data established by manually viewing a large amount of actual data one by one, and improve the identification efficiency.
  • the sample data corresponding to the sensitive field is set as sensitive data
  • the sample data corresponding to the non-sensitive field is set as non-sensitive data, so that whether the sample data is sensitive data can be quickly determined, and it is convenient to quickly obtain a sample set containing a large amount of data.
  • Increase the weight of the sample field with multiple sample data by setting the weight value for the sample field.
  • the weight value is introduced to make the word frequency of the word segmentation more consistent With the characteristics of the sample set, the established recognition sub-model is more accurate, which can further improve the accuracy of field-sensitive recognition.
  • the mutual verification of the judgment results is realized through the two judgment results of the recognition sub-model and the classification sub-model, and the recognition accuracy is further improved.
  • the recognition sub-model can be revised and the recognition accuracy of the recognition sub-model can be improved.
  • the accuracy of the model can be gradually improved, and finally a more practical recognition model can be established.
  • FIG. 1 is a flowchart of a method for identifying sensitive data provided by an embodiment of the application
  • FIG. 2 is a flowchart of a specific method for establishing an identification model in the method for identifying sensitive data provided by an embodiment of the application;
  • FIG. 3 is a structural diagram of an apparatus for identifying sensitive data provided by an embodiment of the application.
  • Fig. 4 is a schematic structural diagram of a computer device for executing a method for identifying sensitive data provided by an embodiment of the application.
  • An embodiment of the present application provides a method for identifying sensitive data, which uses an identification model to identify sensitive data. Specifically, referring to FIG. 1, the method includes:
  • Step 101 Establish a recognition model.
  • the recognition model includes a recognition sub-model for identifying whether a field is a sensitive field and a classification sub-model for distinguishing between sensitive data and non-sensitive data.
  • the recognition model is divided into two dimensions: field and data, that is, the recognition model includes a recognition sub-model and a classification sub-model.
  • the recognition model can be trained and established by means of machine learning or the like.
  • the recognition sub-model is used to identify whether a field is a sensitive field, and the sensitive field is the field containing sensitive information, such as the field "ID card number”, "mobile phone number”, etc.; the field can be in text form, digital form, etc. , Such as “ID card number”, "2018”, “name”, etc.; the recognition sub-model can be a neural network model or a classification model.
  • the classification sub-model is used to identify whether the data is sensitive data, that is, to distinguish between sensitive data and non-sensitive data; the data is specifically in text or digital form, such as "110105" (ID number), or "Zhang San", “ “Li Si” (name), etc.; the classification sub-model can be specifically XGboost, random forest and other models.
  • a field can correspond to one or more data, that is, the field and data can be stored in the form of a database.
  • the field "name” corresponds to multiple data, including data "Zhang San", data "Li Si” and so on.
  • Step 102 Obtain information to be tested.
  • the information to be tested includes a field to be tested and data to be tested corresponding to the field to be tested.
  • the information to be identified can be identified and verified based on the identification model, that is, the information to be tested is identified and verified to determine whether the information to be tested is sensitive information.
  • the information to be tested also includes the fields to be tested and the corresponding data to be tested.
  • the information to be tested includes a field to be tested and a corresponding data to be tested. If multiple data to be tested in the same field to be tested need to be identified, each data to be tested is divided into one to be tested. ⁇ Test information.
  • the field to be tested for the information to be identified is "date of birth” and the data to be tested includes “01/12” and "11/06", it can be split into two information to be tested: “date of birth-01” /12", “Date of Birth-11/06", respectively identify each information to be tested.
  • Step 103 Determine whether the field to be tested is a sensitive field according to the recognition sub-model, and determine whether the data to be tested is sensitive data according to the classification sub-model.
  • Step 104 When the field to be tested is a sensitive field and the data to be tested is sensitive data, it is determined that the information to be tested is sensitive information.
  • the recognition model can be used to determine whether the information to be tested is sensitive information.
  • determine whether the field to be tested in the information to be tested is a sensitive field according to the recognition sub-model determine whether the data to be tested in the information to be tested is sensitive data according to the classification sub-model, and determine whether the field to be tested is based on the two dimensions of the field and the data Whether the information is sensitive information can make the recognition accuracy higher.
  • the field to be tested is a sensitive field and the data to be tested is sensitive data, it can indicate that the information to be tested is sensitive information.
  • the data can be split into multiple pieces of information to be tested, so that sensitive identification of a column of data, a data table or the entire database in the database can be realized.
  • the field to be tested in the information to be tested can also correspond to multiple data to be tested, that is, there is no need to repeatedly determine whether a field to be tested is a sensitive field during the identification process, and at this time, each data to be tested can also be determined Whether the corresponding field to be tested is a sensitive field, thereby reducing the number of times of determining whether the field to be tested is a sensitive field, reducing the amount of processing, and improving the processing efficiency.
  • the method for identifying sensitive data uses the feature that the data in the database contains field attributes, and adds the process of identifying the fields of the data when identifying the sensitive data, and establishes the identification sub-model and the classification sub-model to be identified.
  • the field to be tested and the data to be tested in the information to be tested are identified and judged separately, and whether the information to be tested is sensitive information is determined based on the two dimensions of the field and the data, so that it can be more accurately judged whether the data to be tested is sensitive data, which can make The recognition accuracy is higher.
  • this method is suitable for identifying a large amount of data in a database, and can save the connection between fields and data established by manually viewing a large amount of actual data one by one, and improve the identification efficiency.
  • step 101 "establishing a recognition model" includes:
  • Step 1011 Obtain a sample set.
  • the sample set includes sample fields and one or more sample data corresponding to the sample fields; the sample fields include sensitive fields and non-sensitive fields, and the sample data corresponding to the sensitive fields are sensitive data, and the The sample data corresponding to the sensitive field is non-sensitive data.
  • the sample set is a sample used to train the recognition model, which includes sample fields and corresponding sample data; the sample set can be stored in a database or a table in the database.
  • the sample library is used to store the sample set, each field in the sample library corresponds to a sample field, and a column of data corresponding to each field is the corresponding sample data.
  • some sample fields are sensitive fields, such as "ID number”, "geographic location”, etc.; some sample fields are not sensitive fields, such as "serial number", "weather”, etc.
  • the sample data corresponding to the sensitive field is set as sensitive data in this embodiment of the application, corresponding to the non-sensitive field
  • the sample data of is set as non-sensitive data, which can quickly determine whether the sample data is sensitive data, and it is convenient to quickly obtain a sample set containing a large amount of data.
  • Step 1012 Train the recognition sub-model according to all the sample fields in the sample set, determine the trained recognition sub-model, train the classification sub-model according to all the sample data, and determine the trained classification sub-model.
  • Step 1013 Test the trained recognition sub-model and classification sub-model according to the test set, and when the recognition sub-model and the classification sub-model pass the test, generate a recognition model based on the trained recognition sub-model and classification sub-model.
  • the recognition sub-model can be trained using the sample field to determine the parameters of the recognition sub-model; the sample data can also be used to train the classification sub-model to determine the parameters of the classification sub-model , So as to determine the recognition sub-model and classification sub-model after training. After that, you can use the test set to test the trained model to verify the effect of the model.
  • the test set is a set of test samples used to test the model. Similar to the sample set, the test set also includes test fields and one or more test data corresponding to the test fields. It is known whether the test fields are sensitive fields, and Know whether the test data is sensitive data. Using the test set to test the trained recognition sub-model and classification sub-model, a more accurate recognition model can be obtained.
  • the training of the recognition sub-model or the classification sub-model can be continued until the trained recognition sub-model and the classification sub-model pass the test, and then the trained recognition
  • the sub-model and the classification sub-model are used as subsequent available models, that is, the sensitive recognition judgment is made in step 103 according to the trained recognition sub-model and the classification sub-model.
  • step 1012 “according to the sample Train the recognition sub-model based on the word frequency of the sample field in "Train the recognition sub-model for all sample fields in the set”.
  • the process of training the sample fields includes:
  • Step A1 Perform word segmentation processing on the sample fields in the sample set respectively, and determine the word segmentation of each sample field.
  • the word segmentation of each sample field can be determined; for example, after word segmentation of the sample field "mobile phone number", two word segmentation can be obtained: “mobile phone” and "number”.
  • the process of word segmentation processing may specifically perform word segmentation based on the word segmentation model, which is not limited in this embodiment.
  • Step A2 Use the word segmentation of all sample fields as a word segmentation set, and determine the word frequency of each word segmentation of the sample field in the word segmentation set.
  • step A2 "take the word segmentation of all sample fields as the word segmentation set, and determine the word frequency of each word segmentation of the sample field in the word segmentation set" includes:
  • Step A21 Determine the number of sample data ⁇ i corresponding to each sample field in the sample set, ⁇ i represents the number of samples corresponding to the i-th sample field, i ⁇ [1,n], n is the sample field in the sample set quantity.
  • each sample field may correspond to one or more sample data, and the weight of the sample field is determined according to the number of sample data corresponding to the sample field.
  • Table 1 The sample set contains three sample fields "name”, “ID number”, and “mobile phone number”.
  • the sample data corresponding to each sample field is shown in Table 1.
  • the blank part in Table 1 Indicates that there is no sample data, the sample field "name” corresponds to 4 sample data, "ID number” corresponds to 2 sample data, and "mobile phone number” corresponds to 3 sample data.
  • Step A22 Use ⁇ i as the weight value of the number of each word segmentation in the sample field, use all the word segments as the word segmentation set, and determine the total number of word segmentation in the word segmentation set: Among them, N is the total number of word segmentation, and mi is the number of word segmentation in the i-th sample field in the sample set.
  • ⁇ i is used as the weight value of the number of each word segmentation in the sample field, and all the word segments are used as the word segmentation set, which is equivalent to copying the word segmentation of the i-th sample field.
  • Each word segmentation of the sample field totals Generate ⁇ i , so for the i-th sample field, which contains ⁇ i m i word segmentation, the total number of word segmentation in n sample fields is
  • Step A23 Determine the word frequency of each word segment a ij of the sample field in the word segmentation set:
  • f ij represents the word frequency of the j-th participle a ij in the i-th sample field, j ⁇ [1, mi ]; k represents the order of the sample field with the participle a ij , and ⁇ k represents the k-th sample field
  • the weight value of the number of word segmentation, ⁇ k represents the number of word segmentation a ij contained in the k-th sample field.
  • the weight of the word segmentation still needs to be considered; that is, all the sample fields containing the word segmentation a ij in the sample set are used as the reference amount
  • the weight of the sample field must be introduced.
  • the k-th sample field contains the participle a ij , that is, k represents the order of the sample field with the participle a ij ; at this time, the k-th sample can be determined according to the weight value ⁇ k of the k-th sample field The number of participles a ij contained in the field.
  • a sample field may contain multiple same participles, that is, the k-th sample field contains ⁇ k participles a ij , the k-th sample field contains a total of ⁇ k ⁇ k participles a ij .
  • ⁇ k can be defaulted to 1 to simplify the calculation process.
  • the i-th sample field must contain the participle a ij , one value of k must be i; other values of k are determined according to actual conditions.
  • the embodiment of the application increases the weight of a sample field with multiple sample data by setting a weight value for the sample field, and introduces the weight value when determining the total number of word segmentation and the number of each word segmentation in the word segmentation set, so that the word segmentation
  • the word frequency of is more in line with the characteristics of the sample set, and the established recognition sub-model is more accurate, which can further improve the accuracy of field-sensitive recognition.
  • Step A3 Generate the feature vector of the sample field according to the word frequency of the word segmentation, and train the recognition sub-model according to the feature vector of the sample field.
  • the corresponding feature vector can be generated after the word frequency of the word segmentation is determined, and then the model can be trained according to the feature vector as the input parameter of the model.
  • the sensitive feature vector of the i-th sample field Using the semantics of the sample field itself (the result of word segmentation) and its word frequency in the entire sample field set to train the recognition sub-model, it is easier to identify the characteristics of the sensitive data field. For example, general ID card numbers, mobile phone numbers, etc. are all sensitive data. At this time, it can be determined that the participle of "number" has a higher probability of being identified as the participle corresponding to the sensitive field.
  • the feature vector of the field to be tested is generated after word segmentation is performed on the field to be tested and the word frequency is determined, and then according to the identifier
  • the model performs identification and judgment; it is similar to the process of determining the feature vector of the field in the training process, and will not be repeated here.
  • step 103 the method further includes:
  • Step B1 When the data to be tested is sensitive data but the field to be tested is not a sensitive field, obtain multiple other data corresponding to the field to be tested, and determine whether each other data is sensitive data according to the classification sub-module.
  • Step B2 When all other data exceeds the preset number or the preset proportion of data that are sensitive data, mark the field to be tested as a sensitive field, and use the marked information to be tested as a sample to train the recognition model.
  • the recognition accuracy of the recognition sub-model is lower than that of the classification sub-model. Therefore, when the data to be tested is sensitive data but the field to be tested is not a sensitive field, it is necessary to further determine whether the field to be tested is a sensitive field.
  • the field to be tested is a field in the database.
  • the field to be tested also corresponds to other data; in this embodiment of the application, it is determined whether the other data corresponding to the field to be tested is It is sensitive data and you can determine whether the field to be tested is a sensitive field in one step.
  • step B2 when all other data that exceeds the preset amount or preset ratio is sensitive data, indicating that the field to be tested contains a large amount of sensitive data, the field to be tested should also be a sensitive field .
  • the embodiment of the present application continues to train and recognize the information to be tested as a sample
  • the recognition subtype in the model can modify the recognition submodel and improve the recognition accuracy of the recognition submodel.
  • the accuracy of the model can be gradually improved, and finally a more practical recognition model can be established.
  • step B1 when all other data does not exceed the preset number or the preset proportion of data is sensitive data, it can be determined that the field to be tested contains only a small amount of sensitive data.
  • the field to be tested is used as a non-sensitive field, and the identification result of the information to be tested is output: the field to be tested is a non-sensitive field, and the data to be tested is sensitive data.
  • the sensitivity of the field to be tested can be further judged again through other fields related to the field to be tested. Specifically, when all other data does not exceed a preset number or a preset proportion of data that are sensitive data, the method further includes:
  • Step B3 Query whether there is a target field, a part of the data to be tested is the same as the data corresponding to the target field, and a part of other data in the field to be tested is also the same as other data corresponding to the target field; when the target field exists and the target field is For sensitive fields, mark the field to be tested as a sensitive field, and use the marked information to be tested as a sample to train the recognition model.
  • the field to be tested is a sensitive field by determining whether the data of the field to be tested contains other sensitive data. Specifically, if there is a target field, and part of the data in the field to be tested is the same as the data in the target field, it means that the data of the field to be tested and the data of the target field are inclusive, that is, the field to be tested contains the target Field; if the target field is a sensitive field, the data of the target field is also sensitive data. Correspondingly, since the field to be tested contains the target field, the field to be tested and the data in the field to be tested are also sensitive.
  • the field to be tested is "ID number”
  • the recognition result of the recognition sub-model can be corrected, and the recognition accuracy of the test field can be further improved.
  • step 103 the method further includes:
  • Step C1 When the data to be tested is not sensitive data but the field to be tested is a sensitive field, obtain multiple other data corresponding to the field to be tested, and determine whether each other data is sensitive data according to the classification submodule.
  • Step C2 When all other data exceeds a preset amount or a preset ratio, the data to be tested is marked as sensitive data, and the marked information to be tested is used as a sample to train the recognition model.
  • the data to be tested is not sensitive data but the field to be tested is a sensitive field, similar to steps B1-B2 in the above embodiment, it is still possible to determine whether other data in the field to be tested are sensitive data. Correct the classification results of the classification sub-model. Specifically, if all the other data exceeds the preset number or the preset proportion of data that is sensitive data, it means that the data to be tested has a high probability of being sensitive data. At this time, the data to be tested is marked as sensitive data , And use the labeled information to be tested as a sample to train the recognition model and modify the classification sub-model, which can also improve the recognition accuracy of the classification sub-model. At the same time, the mutual verification of the judgment results is realized through the two judgment results of the recognition sub-model and the classification sub-model, and the recognition accuracy is further improved.
  • step C1 when all other data does not exceed a preset number or a preset proportion of data that are sensitive data, similar to step B3 of the foregoing embodiment, after step C1, the method further includes:
  • Step C3 Query whether there is a target field, a part of the data to be tested is the same as the data corresponding to the target field, and a part of other data in the field to be tested is also the same as other data corresponding to the target field; when the target field exists and the target field is For sensitive fields, mark the data to be tested as sensitive data, and use the marked information to be tested as a sample to train the recognition model.
  • the field to be tested is determined to be a sensitive field in the embodiment of this application, only a small amount of data in the field to be tested is sensitive data.
  • the data of the field to be tested needs to be further identified and verified, specifically by querying the target field Way to identify and verify. That is, if there is a target field, and the data in the target field are all sensitive data, the data to be tested in the field to be tested is also regarded as sensitive data, so as to realize accurate identification of the information to be tested.
  • the sensitivity of the information to be tested can also be further determined based on the process similar to steps B1-B3 and steps C1-C3 in the foregoing embodiment.
  • the method for identifying sensitive data uses the feature that the data in the database contains field attributes, and adds the process of identifying the fields of the data when identifying the sensitive data, and establishes the identification sub-model and the classification sub-model to be identified.
  • the field to be tested and the data to be tested in the information to be tested are identified and judged separately, and whether the information to be tested is sensitive information is determined based on the two dimensions of the field and the data, so that it can be more accurately judged whether the data to be tested is sensitive data, which can make The recognition accuracy is higher.
  • this method is suitable for identifying a large amount of data in a database, and can save the connection between fields and data established by manually viewing a large amount of actual data one by one, and improve the identification efficiency.
  • the sample data corresponding to the sensitive field is set as sensitive data
  • the sample data corresponding to the non-sensitive field is set as non-sensitive data, so that whether the sample data is sensitive data can be quickly determined, and it is convenient to quickly obtain a sample set containing a large amount of data.
  • the weight value is introduced to make the word frequency of the word segmentation more consistent
  • the established recognition sub-model is more accurate, which can further improve the accuracy of field-sensitive recognition.
  • the mutual verification of the judgment results is realized through the two judgment results of the recognition sub-model and the classification sub-model, and the recognition accuracy is further improved.
  • the recognition sub-model can be revised and the recognition accuracy of the recognition sub-model can be improved.
  • the accuracy of the model can be gradually improved, and finally a more practical recognition model can be established.
  • the process of the method for identifying sensitive data is described in detail above, and the method can also be implemented by a corresponding device.
  • the structure and function of the device are described in detail below.
  • the embodiment of the present application also provides an apparatus for identifying sensitive data, as shown in FIG. 3, including:
  • the model module 31 is used to establish a recognition model.
  • the recognition model includes a recognition sub-model used to recognize whether a field is a sensitive field and a classification sub-model used to distinguish between sensitive data and non-sensitive data;
  • the acquisition module 32 is used to obtain
  • the information to be tested includes the field to be tested and the data to be tested corresponding to the field to be tested;
  • the judgment module 33 is configured to judge whether the field to be tested is a sensitive field according to the identification submodel, and Determine whether the data to be tested is sensitive data according to the classification submodel;
  • the identification processing module 34 is configured to determine the information to be tested when the field to be tested is a sensitive field and the data to be tested is sensitive data For sensitive information.
  • the model module includes: a sample acquisition unit for acquiring a sample set, the sample set including a sample field and one or more sample data corresponding to the sample field; the sample The fields include sensitive fields and non-sensitive fields, and the sample data corresponding to the sensitive fields are sensitive data, and the sample data corresponding to the non-sensitive fields are non-sensitive data; the training unit is used for all the data in the sample set.
  • the sample field trains the recognition sub-model, determines the trained recognition sub-model, trains the classification sub-model based on all the sample data, and determines the trained classification sub-model; the test unit is used to perform training on the trained The recognition sub-model and the classification sub-model are tested, and when the recognition sub-model and the classification sub-model pass the test, a recognition model is generated according to the trained recognition sub-model and the classification sub-model.
  • the training unit includes: a word segmentation subunit, which is used to perform word segmentation processing on the sample fields in the sample set to determine the word segmentation of each sample field; a processing subunit for The word segmentation of all the sample fields is used as a word segmentation set to determine the word frequency of each word segmentation of the sample field in the word segmentation set; a training subunit is used to generate the feature vector of the sample field according to the word frequency of the word segmentation, and The feature vector of the sample field trains the recognition sub-model.
  • the processing subunit is specifically configured to: respectively determine the quantity of sample data ⁇ i corresponding to each of the sample fields in the sample set, where ⁇ i represents the number of sample data corresponding to the i-th sample field.
  • the number of samples, i ⁇ [1,n], n is the number of sample fields in the sample set; use ⁇ i as the weight value of the number of each word segment in the sample field, and use all word segments as the word segmentation set, and Determine the total number of word segmentation in the word segmentation set:
  • N is the total number of word segmentation
  • mi is the number of word segmentation in the i-th sample field in the sample set; respectively determine the word frequency of each word segmentation a ij of the sample field in the word segmentation set:
  • f ij represents the word frequency of the j-th participle a ij in the i-th sample field, j ⁇ [1, mi ]
  • k represents the order of the sample field with the participle a ij
  • ⁇ k represents the k-th
  • the weight value of the number of word segmentation in the sample field, ⁇ k represents the number of word segmentation a ij contained in the k-th sample field.
  • the identification processing module 34 is further configured to: when the data to be tested is sensitive data but the field to be tested is not a sensitive field, obtain multiple other data corresponding to the field to be tested, and Determine whether each other data is sensitive data according to the classification sub-module; when all other data exceeds a preset number or a preset proportion of data that are sensitive data, mark the field to be tested as a sensitive field, and Training the recognition model using the marked information to be tested as a sample.
  • the identification processing module 34 is further configured to: when the data to be tested is not sensitive data but the field to be tested is a sensitive field, obtain multiple other data corresponding to the field to be tested, and Determine whether each other data is sensitive data according to the classification sub-module; when all other data exceeds a preset number or a preset proportion of data that are sensitive data, mark the data to be tested as sensitive data, and Training the recognition model using the marked information to be tested as a sample.
  • the identification processing module 34 is also used to query whether there is a target field, and the value of the data to be tested A part of the data corresponding to the target field is the same, and a part of other data of the field to be tested is also the same as other data corresponding to the target field; when the target field exists and the target field is a sensitive field , Marking the field to be tested as a sensitive field and/or marking the data to be tested as sensitive data, and using the marked information to be tested as a sample to train the recognition model.
  • the device for identifying sensitive data utilizes the feature that the data in the database contains field attributes, and adds the process of identifying the fields of the data when identifying sensitive data, and establishes the identification sub-model and the classification sub-model to be identified.
  • the field to be tested and the data to be tested in the information to be tested are identified and judged separately, and whether the information to be tested is sensitive information is determined based on the two dimensions of the field and the data, so that it can be more accurately judged whether the data to be tested is sensitive data, which can make The recognition accuracy is higher.
  • this method is suitable for identifying a large amount of data in a database, and can save the connection between fields and data established by manually viewing a large amount of actual data one by one, and improve the identification efficiency.
  • the sample data corresponding to the sensitive field is set as sensitive data
  • the sample data corresponding to the non-sensitive field is set as non-sensitive data, so that whether the sample data is sensitive data can be quickly determined, and it is convenient to quickly obtain a sample set containing a large amount of data.
  • the weight value is introduced to make the word frequency of word segmentation more consistent
  • the established recognition sub-model is more accurate, which can further improve the accuracy of field-sensitive recognition.
  • the mutual verification of the judgment results is realized through the two judgment results of the recognition sub-model and the classification sub-model, and the recognition accuracy is further improved.
  • the recognition sub-model can be revised and the recognition accuracy of the recognition sub-model can be improved.
  • the accuracy of the model can be gradually improved, and finally a more practical recognition model can be established.
  • An embodiment of the present application also provides a computer storage medium that stores computer-executable instructions, which contains a program for executing the above-mentioned method for identifying sensitive data, and the computer-executable instruction can execute any of the above-mentioned methods.
  • the computer storage medium may be any available medium or data storage device that the computer can access, including but not limited to magnetic storage (such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical storage (such as CD, DVD, BD, HVD, etc.), and semiconductor memory (such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD)), etc.
  • magnetic storage such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.
  • optical storage such as CD, DVD, BD, HVD, etc.
  • semiconductor memory such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive
  • the computer device 1100 may be a host server with computing capabilities, a personal computer PC, or a portable computer or terminal that can be carried.
  • the specific embodiment of the present application does not limit the specific implementation of the computer device.
  • the computer device 1100 includes at least one processor (processor) 1110, a communication interface (Communications Interface) 1120, a memory (memory array) 1130, and a bus 1140.
  • the processor 1110, the communication interface 1120, and the memory 1130 communicate with each other through the bus 1140.
  • the communication interface 1120 is used to communicate with network elements, where the network elements include, for example, a virtual machine management center and shared storage.
  • the processor 1110 is used to execute programs.
  • the processor 1110 may be a central processing unit CPU, or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.
  • the memory 1130 is used for executable instructions.
  • the memory 1130 may include a high-speed RAM memory, or may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory.
  • the memory 1130 may also be a memory array.
  • the memory 1130 may also be divided into blocks, and the blocks may be combined into a virtual volume according to certain rules.
  • the instructions stored in the memory 1130 may be executed by the processor 1110, so that the processor 1110 can execute the method in any of the foregoing method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

一种识别敏感数据的方法、装置、存储介质及计算机设备,其中,该方法包括:建立识别模型,识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型(101);获取待测信息,待测信息包括待测字段和与待测字段相对应的待测数据(102);根据识别子模型判断待测字段是否为敏感字段,并根据分类子模型判断待测数据是否为敏感数据(103);在待测字段是敏感字段且待测数据是敏感数据时,确定待测信息为敏感的信息(104)。该方法在识别敏感数据时增加识别数据的字段的过程,通过建立识别子模型和分类子模型对待识别的待测信息中的待测字段和待测数据分别进行识别判断,基于字段和数据两个维度来确定待测信息是否为敏感信息,从而可以更加准确的判断待测数据是否为敏感数据,可以使得识别准确度更高。

Description

一种识别敏感数据的方法、装置、存储介质及计算机设备
本申请要求与2019年4月25日提交中国专利局、申请号为2019103372668、申请名称为“一种识别敏感数据的方法、装置、存储介质及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及数据识别技术领域,特别涉及一种识别敏感数据的方法、装置、存储介质及计算机设备。
背景技术
敏感信息一般指涉及隐私权的信息,包括财产信息、健康生理信息、生物识别信息、身份信息和网络身份标识信息等,比如,身份证号、银行卡号、电话号码、网页浏览记录、行踪轨迹等。
发明人发现用户相关的敏感信息目前主要依靠人工识别与定义模糊校验字段实现对用户敏感信息的获取。人工识别大量表字段时人力消耗较高,主观性较强,出现漏识别、错识别的可能性较高。自定义模糊校验字段的方法容易造成字段名称在选取后模糊匹配时发生错误,将不该匹配的字段匹配上,或者因为模糊匹配范围过小将该匹配的字段漏过。并且自定义模糊校验字段的方法需要大量的理解数据和阅读数据,并进行人工的定义,对操作人员的业务能力和数据理解的能力要求较高。
发明内容
为了解决现有技术存在的问题,本申请提供一种识别敏感数据的方法、装置、存储介质及计算机设备。
根据本申请的第一个方面,提供一种识别敏感数据的方法,包括:
建立识别模型,所述识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型;获取待测信息,所述待测信息包括待测字段和与所述待测字段相对应的待测数据;根据所述识别子模型判断所述待测字段是否为敏感字段,并根据所述分类子模型判断所述待测数据是否为敏感数据;在所述待测字段是敏感字段且所述待测数据是敏感数据时,确定所述待测信息为敏感的信息。
根据本申请的第二个方面,提供一种识别敏感数据的装置,包括:
模型模块,用于建立识别模型,所述识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型;获取模块,用于获取待测信息,所述待测信息包括待测字段和与所述待测字段相对应的待测数据;判断模块,用于根据所述识别子模型判断所述待测字段是否为敏感字段,并根据所述分类子模型判断所述待测数据是否为敏感数据;识别处理模块,用于在所述待测字段是敏感字段且所述待测数据是敏感数据时,确定所述待测信息为敏感的信息。
根据本申请的第三个方面,提供一种计算机可读存储介质,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现识别敏感数据的步骤。
根据本申请的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现识别敏感数据的步骤。
本申请实施例提供的一种识别敏感数据的方法、装置、存储介质及计算机设备,利用数据库中数据包含字段属性这一特点,在识别敏感数据时增加识别数据的字段的过程,通过建立识别子模型和分类子模型对待识别的待测信息中的待测字段和待测数据分别进行识别判断,基于字段和数据两个维度来确定待测信息是否为敏感信息,从而可以更加准确的判断待测数据是否为敏感数据,可以使得识别准确度更高。且该方法适用于识别数据库中的大量数据,可以省去人工去逐条查看大量实际数据而建立起来的字段与数据之间的联系,提高识别效率。将与敏感字段对应的样本数据设为敏感数据,与非敏感字段对应的样本数据设为非敏感数据,从而可以快速确定样本数据是否为敏感数据,方便快速获取包含大量数据的样本集。通过为样本字段设置权重值的方式来提高具有多个样本数据的样本字段的权重,在确定分词总数量以及每个分词在分词集合中的数量时均引入该权重值,使得分词的词频更符合该样本集的特性,建立的识别子模型更加准确,可以进一步提高对字段敏感识别的准确性。通过识别子模型与分类子模型的两个判断结果实现判断结果的相互验证,进一步提高识别准确度。通过将该待测信息作为样本继续训练识别模型中的识别子型,可以修正识别子模型,提高识别子模型的识别准确度。同时,通过对识别模型的不断学习和优化,可以逐渐提高模型的准确度,最终建立比较实用的识别模型。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
附图说明
附图用来提供对本申请的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请,并不构成对本申请的限制。在附图中:
图1为本申请实施例提供的识别敏感数据的方法流程图;
图2为本申请实施例提供的识别敏感数据的方法中,建立识别模型的具体方法流程图;
图3为本申请实施例提供的识别敏感数据的装置结构图;
图4为本申请实施例提供的用于执行识别敏感数据方法的计算机设备的结构示意图。
具体实施方式
以下结合附图对本申请的优选实施例进行说明,应当理解,此处所描述的优选实施例仅用于说明和解释本申请,并不用于限定本申请。
本申请实施例提供的一种识别敏感数据的方法,通过识别模型识别敏感数据,具体的,参见图1所示,该方法包括:
步骤101:建立识别模型,识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型。
本申请实施例中,将识别模型分为字段和数据两个维度,即识别模型包括识别子模型和分类子模型, 具体可以通过机器学习等方式训练识别模型、并建立识别模型。其中,识别子模型用于识别某个字段是否为敏感字段,敏感字段即为包含敏感信息的字段,比如字段“身份证号码”、“手机号码”等;字段具体可以为文字形式、数字形式等,比如“身份证号”、“2018”、“姓名”等;该识别子模型具体可以为神经网络模型或分类模型。分类子模型用于识别数据是否为敏感数据,即区分敏感数据和非敏感数据;该数据具体为文字形式或者数字形式,比如“110105……”(身份证号码),或“张三”、“李四”(姓名)等;分类子模型具体可采用XGboost、随机森林等模型。本申请实施例中,一个字段可以对应一个或多个数据,即可以以数据库的形式存储字段和数据。例如,字段“姓名”对应多个数据,包括数据“张三”、数据“李四”等。
步骤102:获取待测信息,待测信息包括待测字段和与待测字段相对应的待测数据。
本申请实施例中,在建立识别模型后,即可基于该识别模型对待识别的信息进行识别验证,即对待测信息进行识别验证,以确定该待测信息是否为敏感的信息。相应的,待测信息中也包含待测字段和相应的待测数据。一般情况下,待测信息包含一个待测字段和一个相应的待测数据,若需要对同一个待测字段的多个待测数据进行识别时,将每个待测数据均拆分为一个待测信息。例如,待识别的信息的待测字段为“出生日期”,待测数据包括“01/12”和“11/06”,则可以将其拆分为两个待测信息:“出生日期-01/12”、“出生日期-11/06”,分别对每个待测信息进行识别。
步骤103:根据识别子模型判断待测字段是否为敏感字段,并根据分类子模型判断待测数据是否为敏感数据。
步骤104:在待测字段是敏感字段且待测数据是敏感数据时,确定待测信息为敏感的信息。
本申请实施例中,在确定待测信息后,即可利用识别模型判断该待测信息是否为敏感信息。此时,根据识别子模型判断待测信息中的待测字段是否为敏感字段,根据分类子模型判断待测信息中的待测数据是否为敏感数据,基于字段和数据两个维度来确定待测信息是否为敏感信息,可以使得识别准确度更高。具体的,在待测字段是敏感字段且待测数据是敏感数据时,即可说明该待测信息是敏感信息。
同时,如上所述,即使需要识别大量的数据,也可以将数据拆分为多个待测信息,从而可以实现对数据库中的一列数据、一个数据表或整个数据库进行敏感识别。可选的,待测信息中的待测字段也可以对应多个待测数据,即在识别过程中不需要重复判断某个待测字段是否是敏感字段,此时也可以确定每个待测数据所对应的待测字段是否是敏感字段,从而减少判断待测字段是否为敏感字段的判断次数,减少了处理量,进而提高了处理效率。
本申请实施例提供的一种识别敏感数据的方法,利用数据库中数据包含字段属性这一特点,在识别敏感数据时增加识别数据的字段的过程,通过建立识别子模型和分类子模型对待识别的待测信息中的待测字段和待测数据分别进行识别判断,基于字段和数据两个维度来确定待测信息是否为敏感信息,从而可以更加准确的判断待测数据是否为敏感数据,可以使得识别准确度更高。且该方法适用于识别数据库中的大量数据,可以省去人工去逐条查看大量实际数据而建立起来的字段与数据之间的联系,提高识别效率。
本申请另一实施例提供一种识别敏感数据的方法,该方法包括上述实施例中的步骤101-104,其实现原理以及技术效果参见图1对应的实施例。同时,本申请实施例中,步骤101“建立识别模型”包括:
步骤1011:获取样本集,样本集包括样本字段和与样本字段相对应的一个或多个样本数据;样本字段包括敏感字段和非敏感字段,且与敏感字段对应的样本数据为敏感数据,与非敏感字段对应的样本数据为非敏感数据。
本申请实施例中,样本集是用于对识别模型进行训练的样本,其包含样本字段和相应的样本数据;样本集具体可以以数据库的方式、或数据库中表的方式进行存储。例如,样本库用于存储该样本集,样本库中每个字段对应一个样本字段,每个字段下所对应的一列数据即为相应的样本数据。其中,有的样本字段是敏感字段,例如“身份证号码”、“地理位置”等;有的样本字段不是敏感字段,例如“序号”、“天气”等。同时,由于样本数据的量较大,为了方便确定样本数据的敏感性,即样本数据是否为敏感数据,本申请实施例中将与敏感字段对应的样本数据设为敏感数据,与非敏感字段对应的样本数据设为非敏感数据,从而可以快速确定样本数据是否为敏感数据,方便快速获取包含大量数据的样本集。
步骤1012:根据样本集中的所有样本字段对识别子模型进行训练,确定训练后的识别子模型,根据所有的样本数据对分类子模型进行训练,确定训练后的分类子模型。
步骤1013:根据测试集对训练后的识别子模型和分类子模型进行测试,在识别子模型和分类子模型通过测试时,根据训练后的识别子模型和分类子模型生成识别模型。
本申请实施例中,在获取到样本集后,即可利用样本字段对识别子模型进行训练,确定识别子模型的参数;同样可以利用样本数据对分类子模型进行训练,确定分类子模型的参数,从而确定训练后的识别子模型和分类子模型。之后即可利用测试集对训练后的模型进行测试,以验证模型的效果。
其中,测试集为用于测试模型的测试样本集合,与样本集类似,测试集也包括测试字段和与测试字段相对应的一个或多个测试数据,且已知测试字段是否是敏感字段,并已知测试数据是否是敏感数据。利用测试集对训练后的识别子模型和分类子模型进行测试,可以得到更加准确地识别模型。当识别子模型或分类子模型没有通过测试时,则对识别子模型或分类子模型继续训练即可,直至训练后的识别子模型和分类子模型通过测试,此时即可将训练后的识别子模型和分类子模型作为后续可用的模型,即步骤103中根据训练后的识别子模型和分类子模型进行敏感识别判断。
在上述实施例的基础上,“根据所有的样本数据对分类子模型进行训练”的过程具体可以采用现有的对敏感数据进行分类的训练方法,而本申请实施例中,步骤1012“根据样本集中的所有样本字段对识别子模型进行训练”中基于样本字段的词频对识别子模型进行训练,对样本字段进行训练的过程包括:
步骤A1:分别对样本集中的样本字段进行分词处理,确定每个样本字段的分词。
本申请实施例中,对样本字段进行分词后,即可确定每个样本字段的分词;例如,样本字段“手机号码”分词处理后可以得到两个分词:“手机”和“号码”。其中,分词处理的过程具体可基于分词模型进行分词,本实施例对此不做限定。
步骤A2:将所有样本字段的分词作为分词集合,确定样本字段每个分词在分词集合中的词频。
本申请实施例中,在确定样本集中每个样本字段的分词后,即可生成总的分词集合,从而确定分词在该分词集合中的词频。本申请实施例中,由于样本字段可能对应多个样本数据,某个样本字段包含的敏感数据越多,则说明该样本字段越是与敏感相关的字段,即基于该样本字段判断其他字段是否是敏感字段时,该样本字段具有更高的权重。具体的,步骤A2“将所有样本字段的分词作为分词集合,确定样 本字段每个分词在分词集合中的词频”包括:
步骤A21:分别确定样本集中每个样本字段所对应的样本数据的数量ω i,ω i表示第i个样本字段所对应的样本数量,i∈[1,n],n为样本集中的样本字段的数量。
本申请实施例中,每个样本字段可以对应一个或多个样本数据,根据样本字段对应的样本数据的数量来确定样本字段的权重。例如,参见下面表1所示,样本集中包含三个样本字段“姓名”、“身份证号码”、“手机号码”,每个样本字段对应的样本数据如表1所示,表1中空白部分表示不存在样本数据,则样本字段“姓名”对应有4个样本数据,“身份证号码”对应有2个样本数据,“手机号码”对应有3个样本数据。
表1
姓名 身份证号码 手机号码
110105xxxx 135xxx
   
310000xxxx 134xxx
  186xxx
步骤A22:将ω i作为样本字段中的每个分词的数量的权重值,将所有分词作为分词集合,并确定分词集合的分词总数量:
Figure PCTCN2019103529-appb-000001
其中,N为分词总数量,m i为样本集中第i个样本字段的分词数量。
本申请实施例中,将ω i作为样本字段中的每个分词的数量的权重值,将所有分词作为分词集合,相当于将第i个样本字段的分词进行复制,样本字段的每个分词总共生成ω i个,故对于第i个样本字段,其包含ω im i个分词,则n个样本字段的分词总数量即为
Figure PCTCN2019103529-appb-000002
例如,如上述表1所示,第3个样本字段“手机号码”分词后得到两个分词“手机”和“号码”,即m 3=2;又由于该样本字段“手机号码”相对应有3个样本数据,则ω 3=3。此时,在将样本字段“手机号码”的分词添加至分词集合时,相当于将该样本字段重复了ω 3=3次,即共添加了6个分词,即{“手机”、“号码”、“手机”、“号码”、“手机”、“号码”}。通过为样本字段设置权重值ω i的方式来提高具有多个样本数据的样本字段的权重,进一步提高对字段敏感识别的准确性。
步骤A23:分别确定样本字段的每个分词a ij在分词集合中的词频:
Figure PCTCN2019103529-appb-000003
其中,f ij表示第i个样本字段中第j个分词a ij的词频,j∈[1,m i];k表示具有分 词a ij的样本字段的顺位,ω k表示第k个样本字段的分词数量的权重值,λ k表示第k个样本字段中包含分词a ij的数量。
本申请实施例中,在计算样本集中第i个样本字段中第j个分词a ij的词频时,仍然需要考虑分词的权重;即在将样本集中包含该分词a ij的所有样本字段作为参考量的同时,还要引入样本字段的权重。具体的,第k个样本字段中包含有分词a ij,即k表示具有分词a ij的样本字段的顺位;此时即可根据第k个样本字段的权重值ω k来确定第k个样本字段中所包含的分词a ij的数量。同时,由于一个样本字段中可能包含多个相同的分词,即第k个样本字段中包含λ k个分词a ij,故第k个样本字段中共包含ω kλ k个分词a ij。其中,由于一般样本字段的字符较短,一般不会包含重复的分词,即λ k可以默认为1,以简化计算过程。同时,由于第i个样本字段中一定包含分词a ij,故k的一个取值一定为i;k的其他取值具体根据实际情况而定。
例如,如上述表1所示,若需要计算第3个样本字段“手机号码”中的第2个字段“号码”的词频,即分词a 32“号码”的词频,此时k的一个取值为3;由于第2个样本字段中也包含分词“号码”,即k的另一个取值为2;同时,两个样本字段中均只包含一个分词“号码”,故λ 2和λ 3均为1。分词a 32“号码”的词频
Figure PCTCN2019103529-appb-000004
同理,可计算样本字段其他分词的词频。
本申请实施例通过为样本字段设置权重值的方式来提高具有多个样本数据的样本字段的权重,在确定分词总数量以及每个分词在分词集合中的数量时均引入该权重值,使得分词的词频更符合该样本集的特性,建立的识别子模型更加准确,可以进一步提高对字段敏感识别的准确性。
步骤A3:根据分词的词频生成样本字段的特征向量,并根据样本字段的特征向量对识别子模型进行训练。
本申请实施例中,在确定分词的词频后即可生成相应的特征向量,进而根据该特征向量作为模型的输入参数对模型进行训练。例如,第i个样本字段的敏感特征向量
Figure PCTCN2019103529-appb-000005
利用样本字段本身的语义(分词结果)、以及其在整个样本字段集中的词频来训练识别子模型,更容易识别出敏感数据字段的特性。比如,一般身份证号码、手机号码等均是敏感数据,此时即可确定“号码”这一分词具有更高概率被认定为是敏感字段对应的分词。
本领域技术人员可以理解,在步骤103中对待测字段进行判断时,与上述步骤A1-A3类似,也通过对待测字段进行分词并确定词频后生成待测字段的特征向量,并进而根据识别子模型进行识别判断;其与训练过程中确定字段的特征向量的过程类似,此处不做赘述。
在上述实施例的基础上,在步骤103之后,该方法还包括:
步骤B1:在待测数据是敏感数据但待测字段不是敏感字段时,获取待测字段对应的多个其他数据,并根据分类子模块判断每个其他数据是否为敏感数据。
步骤B2:当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将待测字段标记为敏感字段,并将标记后的待测信息作为样本训练识别模型。
本申请实施例中,由于样本集中的样本字段的数量远小于样本数据的数量,故一般来说,相对于分类子模型,识别子模型的识别准确度更低。故当待测数据是敏感数据但待测字段不是敏感字段时,需要进一步判断该待测字段是不是敏感字段。具体的,该待测字段为数据库中的一个字段,该待测字段除了对应该待测数据之外,还会对应其他的数据;本申请实施例中通过判断该待测字段对应的其他数据是否是敏感数据来及你一步确定待测字段是不是为敏感字段。如步骤B2所示,当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,说明该待测字段中包含大量的敏感数据,则该待测字段也应该为敏感字段。同时,由于现有的识别子模型并不能正确识别该待测字段(现有的识别子模型认为该待测字段不是敏感字段),本申请实施例同通过将该待测信息作为样本继续训练识别模型中的识别子型,可以修正识别子模型,提高识别子模型的识别准确度。同时,通过对识别模型的不断学习和优化,可以逐渐提高模型的准确度,最终建立比较实用的识别模型。
此外,在步骤B1之后,当所有的其他数据中没有超过预设数量或预设比例的数据是敏感数据时,此时可以确定该待测字段中只是包含了少量的敏感数据,此时可以将待测字段作为非敏感字段,输出对该待测信息的识别结果:待测字段为非敏感字段,待测数据为敏感数据。可选的,还可以通过与该待测字段相关的其他字段再次进一步判断该待测字段的敏感性。具体的,当所有的其他数据中没有超过预设数量或预设比例的数据是敏感数据时,该方法还包括:
步骤B3:查询是否存在目标字段,待测数据的一部分与目标字段对应的数据相同,且待测字段的其他数据的一部分也与目标字段对应的其他数据相同;当存在目标字段、且目标字段为敏感字段时,将待测字段标记为敏感字段,并将标记后的待测信息作为样本训练识别模型。
本申请实施例中,通过判断待测字段的数据是否包含其他敏感数据来确定该待测字段是否为敏感字段。具体的,若存在一个目标字段,且待测字段中的数据的一部分与目标字段的数据相同,则说明待测字段的数据与该目标字段的数据之间是包含关系,即待测字段包含目标字段;此时若目标字段为敏感字段,则目标字段的数据也为敏感数据,相应的,由于待测字段包含目标字段,则待测字段以及待测字段中的数据也均具有敏感性。例如,待测字段为“身份证号码”,若存在一个目标字段“出生日期”,由于身份证号码中的一部分是出生日期的信息,则待测字段中的数据的一部分是与“出生日期”的数据完全相同的,此时若“出生日期”为敏感字段,则“身份证号码”也应该设为敏感字段。本申请实施例中通过查询待测字段所包含的目标字段来确定待测字段是否是敏感字段,可以对识别子模型的识别结果进行修正,进一步提高对待测字段的识别准确性。
在上述实施例的基础上,在步骤103之后,该方法还包括:
步骤C1:在待测数据不是敏感数据但待测字段是敏感字段时,获取待测字段对应的多个其他数据,并根据分类子模块判断每个其他数据是否为敏感数据。
步骤C2:当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将待测数据标记为敏感数据,并将标记后的待测信息作为样本训练识别模型。
本申请实施例中,若待测数据不是敏感数据但待测字段是敏感字段时,与上述实施例中的步骤B1-B2 类似,仍然可以通过判断该待测字段中其他数据是否是敏感数据来对分类子模型的分类结果进行修正。具体的,若所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,此时说明该待测数据有极大概率是敏感数据,此时将待测数据标记为敏感数据,并将标记后的待测信息作为样本训练识别模型,修正分类子模型,也可以提高分类子模型的识别准确度。同时,通过识别子模型与分类子模型的两个判断结果实现判断结果的相互验证,进一步提高识别准确度。
可选的,当所有的其他数据中没有超过预设数量或预设比例的数据是敏感数据时,与上述实施例步骤B3类似,在步骤C1之后,该方法还包括:
步骤C3:查询是否存在目标字段,待测数据的一部分与目标字段对应的数据相同,且待测字段的其他数据的一部分也与目标字段对应的其他数据相同;当存在目标字段、且目标字段为敏感字段时,将待测数据标记为敏感数据,并将标记后的待测信息作为样本训练识别模型。同样的,本申请实施例中虽然确定待测字段为敏感字段,但是该待测字段中只有少量数据是敏感数据,此时需要对待测字段的数据进行进一步的识别验证,具体通过查询目标字段的方式进行识别验证。即,若存在目标字段,且目标字段中的数据均为敏感数据时,则将该待测字段的待测数据也作为敏感数据,实现对待测信息的准确识别。
在上述实施例的基础上,若待测数据不是敏感数据且待测字段不是敏感字段,也可以基于上述实施例步骤B1-B3和步骤C1-C3类似的过程进一步判断待测信息的敏感性。例如,获取待测字段对应的多个其他数据,并根据分类子模块判断每个其他数据是否为敏感数据;当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将待测数据标记为敏感数据,并将待测字段标记为敏感字段;将标记后的待测信息作为样本训练识别模型;当所有的其他数据中没有超过预设数量或预设比例的数据是敏感数据时,通过查询目标字段的方式确定待测字段是否是敏感字段以及待测数据是否是敏感数据。
本申请实施例提供的一种识别敏感数据的方法,利用数据库中数据包含字段属性这一特点,在识别敏感数据时增加识别数据的字段的过程,通过建立识别子模型和分类子模型对待识别的待测信息中的待测字段和待测数据分别进行识别判断,基于字段和数据两个维度来确定待测信息是否为敏感信息,从而可以更加准确的判断待测数据是否为敏感数据,可以使得识别准确度更高。且该方法适用于识别数据库中的大量数据,可以省去人工去逐条查看大量实际数据而建立起来的字段与数据之间的联系,提高识别效率。将与敏感字段对应的样本数据设为敏感数据,与非敏感字段对应的样本数据设为非敏感数据,从而可以快速确定样本数据是否为敏感数据,方便快速获取包含大量数据的样本集。通过为样本字段设置权重值的方式来提高具有多个样本数据的样本字段的权重,在确定分词总数量以及每个分词在分词集合中的数量时均引入该权重值,使得分词的词频更符合该样本集的特性,建立的识别子模型更加准确,可以进一步提高对字段敏感识别的准确性。通过识别子模型与分类子模型的两个判断结果实现判断结果的相互验证,进一步提高识别准确度。通过将该待测信息作为样本继续训练识别模型中的识别子型,可以修正识别子模型,提高识别子模型的识别准确度。同时,通过对识别模型的不断学习和优化,可以逐渐提高模型的准确度,最终建立比较实用的识别模型。
以上详细介绍了识别敏感数据的方法流程,该方法也可以通过相应的装置实现,下面详细介绍该装置的结构和功能。本申请实施例还提供一种识别敏感数据的装置,参见图3所示,包括:
模型模块31,用于建立识别模型,所述识别模型包括用于识别字段是否为敏感字段的识别子模型和 用于区分敏感数据和非敏感数据的分类子模型;获取模块32,用于获取待测信息,所述待测信息包括待测字段和与所述待测字段相对应的待测数据;判断模块33,用于根据所述识别子模型判断所述待测字段是否为敏感字段,并根据所述分类子模型判断所述待测数据是否为敏感数据;识别处理模块34,用于在所述待测字段是敏感字段且所述待测数据是敏感数据时,确定所述待测信息为敏感的信息。
在上述实施例的基础上,所述模型模块包括:获取样本单元,用于获取样本集,所述样本集包括样本字段和与所述样本字段相对应的一个或多个样本数据;所述样本字段包括敏感字段和非敏感字段,且与所述敏感字段对应的样本数据为敏感数据,与所述非敏感字段对应的样本数据为非敏感数据;训练单元,用于根据所述样本集中的所有样本字段对识别子模型进行训练,确定训练后的识别子模型,根据所有的所述样本数据对分类子模型进行训练,确定训练后的分类子模型;测试单元,用于根据测试集对训练后的所述识别子模型和所述分类子模型进行测试,在所述识别子模型和所述分类子模型通过测试时,根据训练后的所述识别子模型和所述分类子模型生成识别模型。
在上述实施例的基础上,所述训练单元包括:分词子单元,用于分别对样本集中的所述样本字段进行分词处理,确定每个所述样本字段的分词;处理子单元,用于将所有所述样本字段的分词作为分词集合,确定所述样本字段每个分词在所述分词集合中的词频;训练子单元,用于根据分词的词频生成所述样本字段的特征向量,并根据所述样本字段的特征向量对识别子模型进行训练。
在上述实施例的基础上,所述处理子单元具体用于:分别确定所述样本集中每个所述样本字段所对应的样本数据的数量ω i,ω i表示第i个样本字段所对应的样本数量,i∈[1,n],n为所述样本集中的样本字段的数量;将ω i作为所述样本字段中的每个分词的数量的权重值,将所有分词作为分词集合,并确定所述分词集合的分词总数量:
Figure PCTCN2019103529-appb-000006
其中,N为分词总数量,m i为所述样本集中第i个样本字段的分词数量;分别确定所述样本字段的每个分词a ij在所述分词集合中的词频:
Figure PCTCN2019103529-appb-000007
其中,f ij表示第i个样本字段中第j个分词a ij的词频,j∈[1,m i];k表示具有所述分词a ij的样本字段的顺位,ω k表示第k个样本字段的分词数量的权重值,λ k表示第k个样本字段中包含分词a ij的数量。
在上述实施例的基础上,识别处理模块34还用于:在所述待测数据是敏感数据但所述待测字段不是敏感字段时,获取所述待测字段对应的多个其他数据,并根据所述分类子模块判断每个其他数据是否为敏感数据;当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将所述待测字段标记为敏感字段,并将标记后的所述待测信息作为样本训练所述识别模型。在上述实施例的基础上,识别处理模块34还用于:在所述待测数据不是敏感数据但所述待测字段是敏感字段时,获取所述待测字段对应的多个其他数据,并根据所述分类子模块判断每个其他数据是否为敏感数据;当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将所述待测数据标记为敏感数据,并将标记后的所述待测信息作为样本训练所述识别模型。在上述实施例的基础上,当所有的其他数据中没有超过预设数量或预设比例的数据是敏感数据时,识别处理模块34还用于:查询是否存在目标字段,所述待测数据的一部分 与所述目标字段对应的数据相同,且所述待测字段的其他数据的一部分也与所述目标字段对应的其他数据相同;当存在所述目标字段、且所述目标字段为敏感字段时,将所述待测字段标记为敏感字段和/或将所述待测数据标记为敏感数据,并将标记后的所述待测信息作为样本训练所述识别模型。
本申请实施例提供的一种识别敏感数据的装置,利用数据库中数据包含字段属性这一特点,在识别敏感数据时增加识别数据的字段的过程,通过建立识别子模型和分类子模型对待识别的待测信息中的待测字段和待测数据分别进行识别判断,基于字段和数据两个维度来确定待测信息是否为敏感信息,从而可以更加准确的判断待测数据是否为敏感数据,可以使得识别准确度更高。且该方法适用于识别数据库中的大量数据,可以省去人工去逐条查看大量实际数据而建立起来的字段与数据之间的联系,提高识别效率。将与敏感字段对应的样本数据设为敏感数据,与非敏感字段对应的样本数据设为非敏感数据,从而可以快速确定样本数据是否为敏感数据,方便快速获取包含大量数据的样本集。通过为样本字段设置权重值的方式来提高具有多个样本数据的样本字段的权重,在确定分词总数量以及每个分词在分词集合中的数量时均引入该权重值,使得分词的词频更符合该样本集的特性,建立的识别子模型更加准确,可以进一步提高对字段敏感识别的准确性。通过识别子模型与分类子模型的两个判断结果实现判断结果的相互验证,进一步提高识别准确度。通过将该待测信息作为样本继续训练识别模型中的识别子型,可以修正识别子模型,提高识别子模型的识别准确度。同时,通过对识别模型的不断学习和优化,可以逐渐提高模型的准确度,最终建立比较实用的识别模型。
本申请实施例还提供了一种计算机存储介质,所述计算机存储介质存储有计算机可执行指令,其包含用于执行上述识别敏感数据的方法的程序,该计算机可执行指令可执行上述任意方法实施例中的方法。其中,所述计算机存储介质可以是计算机能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器(例如软盘、硬盘、磁带、磁光盘(MO)等)、光学存储器(例如CD、DVD、BD、HVD等)、以及半导体存储器(例如ROM、EPROM、EEPROM、非易失性存储器(NAND FLASH)、固态硬盘(SSD))等。图4示出了本申请的另一个实施例的一种计算机设备的结构框图。所述计算机设备1100可以是具备计算能力的主机服务器、个人计算机PC、或者可携带的便携式计算机或终端等。本申请具体实施例并不对计算机设备的具体实现做限定。该计算机设备1100包括至少一个处理器(processor)1110、通信接口(Communications Interface)1120、存储器(memory array)1130和总线1140。其中,处理器1110、通信接口1120、以及存储器1130通过总线1140完成相互间的通信。通信接口1120用于与网元通信,其中网元包括例如虚拟机管理中心、共享存储等。处理器1110用于执行程序。处理器1110可能是一个中央处理器CPU,或者是专用集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。存储器1130用于可执行的指令。存储器1130可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1130也可以是存储器阵列。存储器1130还可能被分块,并且所述块可按一定的规则组合成虚拟卷。存储器1130存储的指令可被处理器1110执行,以使处理器1110能够执行上述任意方法实施例中的方法。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (20)

  1. 一种识别敏感数据的方法,包括:
    建立识别模型,所述识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型;
    获取待测信息,所述待测信息包括待测字段和与所述待测字段相对应的待测数据;
    根据所述识别子模型判断所述待测字段是否为敏感字段,并根据所述分类子模型判断所述待测数据是否为敏感数据;
    在所述待测字段是敏感字段且所述待测数据是敏感数据时,确定所述待测信息为敏感的信息。
  2. 根据权利要求1所述的方法,所述建立识别模型包括:
    获取样本集,所述样本集包括样本字段和与所述样本字段相对应的一个或多个样本数据;所述样本字段包括敏感字段和非敏感字段,且与所述敏感字段对应的样本数据为敏感数据,与所述非敏感字段对应的样本数据为非敏感数据;
    根据所述样本集中的所有样本字段对识别子模型进行训练,确定训练后的识别子模型,根据所有的所述样本数据对分类子模型进行训练,确定训练后的分类子模型;
    根据测试集对训练后的所述识别子模型和所述分类子模型进行测试,在所述识别子模型和所述分类子模型通过测试时,根据训练后的所述识别子模型和所述分类子模型生成识别模型。
  3. 根据权利要求2所述的方法,所述根据所述样本集中的所有样本字段对识别子模型进行训练包括:
    分别对样本集中的所述样本字段进行分词处理,确定每个所述样本字段的分词;
    将所有所述样本字段的分词作为分词集合,确定所述样本字段每个分词在所述分词集合中的词频;
    根据分词的词频生成所述样本字段的特征向量,并根据所述样本字段的特征向量对识别子模型进行训练。
  4. 根据权利要求3所述的方法,所述将所有所述样本字段的分词作为分词集合,确定所述样本字段每个分词在所述分词集合中的词频,包括:
    分别确定所述样本集中每个所述样本字段所对应的样本数据的数量ω i,ω i表示第i个样本字段所对应的样本数量,i∈[1,n],n为所述样本集中的样本字段的数量;
    将ω i作为所述样本字段中的每个分词的数量的权重值,将所有分词作为分词集合,并确定所述分词集合的分词总数量:
    Figure PCTCN2019103529-appb-100001
    其中,N为分词总数量,m i为所述样本集中第i个样本字段的分词数量;
    分别确定所述样本字段的每个分词a ij在所述分词集合中的词频:
    Figure PCTCN2019103529-appb-100002
    其中,f ij表示第i个样本字段中第j个分词a ij的词频,j∈[1,m i];k表示具有所 述分词a ij的样本字段的顺位,ω k表示第k个样本字段的分词数量的权重值,λ k表示第k个样本字段中包含分词a ij的数量。
  5. 根据权利要求1所述的方法,还包括:
    在所述待测数据是敏感数据但所述待测字段不是敏感字段时,获取所述待测字段对应的多个其他数据,并根据所述分类子模块判断每个其他数据是否为敏感数据;
    当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将所述待测字段标记为敏感字段,并将标记后的所述待测信息作为样本训练所述识别模型。
  6. 根据权利要求1所述的方法,还包括:
    在所述待测数据不是敏感数据但所述待测字段是敏感字段时,获取所述待测字段对应的多个其他数据,并根据所述分类子模块判断每个其他数据是否为敏感数据;
    当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将所述待测数据标记为敏感数据,并将标记后的所述待测信息作为样本训练所述识别模型。
  7. 根据权利要求5所述的方法,当所有的其他数据中没有超过预设数量或预设比例的数据是敏感数据时,还包括:
    查询是否存在目标字段,所述待测数据的一部分与所述目标字段对应的数据相同,且所述待测字段的其他数据的一部分也与所述目标字段对应的其他数据相同;
    当存在所述目标字段、且所述目标字段为敏感字段时,将所述待测字段标记为敏感字段和/或将所述待测数据标记为敏感数据,并将标记后的所述待测信息作为样本训练所述识别模型。
  8. 一种识别敏感数据的装置,包括:
    模型模块,用于建立识别模型,所述识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型;
    获取模块,用于获取待测信息,所述待测信息包括待测字段和与所述待测字段相对应的待测数据;
    判断模块,用于根据所述识别子模型判断所述待测字段是否为敏感字段,并根据所述分类子模型判断所述待测数据是否为敏感数据;
    识别处理模块,用于在所述待测字段是敏感字段且所述待测数据是敏感数据时,确定所述待测信息为敏感的信息。
  9. 根据权利要求8所述的装置,所述模型模块包括:
    获取样本单元,用于获取样本集,所述样本集包括样本字段和与所述样本字段相对应的一个或多个样本数据;所述样本字段包括敏感字段和非敏感字段,且与所述敏感字段对应的样本数据为敏感数据,与所述非敏感字段对应的样本数据为非敏感数据;
    训练单元,用于根据所述样本集中的所有样本字段对识别子模型进行训练,确定训练后的识别子模型,根据所有的所述样本数据对分类子模型进行训练,确定训练后的分类子模型;
    测试单元,用于根据测试集对训练后的所述识别子模型和所述分类子模型进行测试,在所述识别子模型和所述分类子模型通过测试时,根据训练后的所述识别子模型和所述分类子模型生成识别模型。
  10. 根据权利要求9所述的装置,所述训练单元包括:
    分词子单元,用于分别对样本集中的所述样本字段进行分词处理,确定每个所述样本字段的分词;
    处理子单元,用于将所有所述样本字段的分词作为分词集合,确定所述样本字段每个分词在所述分词集合中的词频;
    训练子单元,用于根据分词的词频生成所述样本字段的特征向量,并根据所述样本字段的特征向量对识别子模型进行训练。
  11. 根据权利要求10所述的装置,所述处理子单元,具体用于分别确定所述样本集中每个所述样本字段所对应的样本数据的数量ω i,ω i表示第i个样本字段所对应的样本数量,i∈[1,n],n为所述样本集中的样本字段的数量;将ω i作为所述样本字段中的每个分词的数量的权重值,将所有分词作为分词集合,并确定所述分词集合的分词总数量:
    Figure PCTCN2019103529-appb-100003
    其中,N为分词总数量,m i为所述样本集中第i个样本字段的分词数量;分别确定所述样本字段的每个分词a ij在所述分词集合中的词频:
    Figure PCTCN2019103529-appb-100004
    其中,f ij表示第i个样本字段中第j个分词a ij的词频,j∈[1,m i];k表示具有所述分词a ij的样本字段的顺位,ω k表示第k个样本字段的分词数量的权重值,λ k表示第k个样本字段中包含分词a ij的数量。
  12. 根据权利要求8所述的装置,识别处理模块,还用于在所述待测数据是敏感数据但所述待测字段不是敏感字段时,获取所述待测字段对应的多个其他数据,并根据所述分类子模块判断每个其他数据是否为敏感数据;当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将所述待测字段标记为敏感字段,并将标记后的所述待测信息作为样本训练所述识别模型。
  13. 根据权利要求8所述的装置,识别处理模块,还用于在所述待测数据不是敏感数据但所述待测字段是敏感字段时,获取所述待测字段对应的多个其他数据,并根据所述分类子模块判断每个其他数据是否为敏感数据;当所有的其他数据中有超过预设数量或预设比例的数据是敏感数据时,将所述待测数据标记为敏感数据,并将标记后的所述待测信息作为样本训练所述识别模型。
  14. 根据权利要求12所述的装置,当所有的其他数据中没有超过预设数量或预设比例的数据是敏感数据时,识别处理模块还用于查询是否存在目标字段,所述待测数据的一部分与所述目标字段对应的数据相同,且所述待测字段的其他数据的一部分也与所述目标字段对应的其他数据相同;当存在所述目标字段、且所述目标字段为敏感字段时,将所述待测字段标记为敏感字段和/或将所述待测数据标记为敏感数据,并将标记后的所述待测信息作为样本训练所述识别模型。
  15. 一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时实现识别敏感数据的方法,包括:建立识别模型,所述识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型;获取待测信息,所述待测信息包括待测字段和与所述待测字段相对应的待测数据;根据所述识别子模型判断所述待测字段是否为敏感字段, 并根据所述分类子模型判断所述待测数据是否为敏感数据;在所述待测字段是敏感字段且所述待测数据是敏感数据时,确定所述待测信息为敏感的信息。
  16. 根据权利要求15所述的非易失性可读存储介质,所述计算机可读指令被所述处理器执行时实现所述建立识别模型包括:获取样本集,所述样本集包括样本字段和与所述样本字段相对应的一个或多个样本数据;所述样本字段包括敏感字段和非敏感字段,且与所述敏感字段对应的样本数据为敏感数据,与所述非敏感字段对应的样本数据为非敏感数据;根据所述样本集中的所有样本字段对识别子模型进行训练,确定训练后的识别子模型,根据所有的所述样本数据对分类子模型进行训练,确定训练后的分类子模型;根据测试集对训练后的所述识别子模型和所述分类子模型进行测试,在所述识别子模型和所述分类子模型通过测试时,根据训练后的所述识别子模型和所述分类子模型生成识别模型。
  17. 根据权利要求16所述的非易失性可读存储介质,所述计算机可读指令被所述处理器执行时实现所述根据所述样本集中的所有样本字段对识别子模型进行训练包括:分别对样本集中的所述样本字段进行分词处理,确定每个所述样本字段的分词;将所有所述样本字段的分词作为分词集合,确定所述样本字段每个分词在所述分词集合中的词频;根据分词的词频生成所述样本字段的特征向量,并根据所述样本字段的特征向量对识别子模型进行训练。
  18. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现识别敏感数据的方法,包括:建立识别模型,所述识别模型包括用于识别字段是否为敏感字段的识别子模型和用于区分敏感数据和非敏感数据的分类子模型;获取待测信息,所述待测信息包括待测字段和与所述待测字段相对应的待测数据;根据所述识别子模型判断所述待测字段是否为敏感字段,并根据所述分类子模型判断所述待测数据是否为敏感数据;在所述待测字段是敏感字段且所述待测数据是敏感数据时,确定所述待测信息为敏感的信息。
  19. 根据权利要求18所述的计算机设备,所述计算机可读指令被所述处理器执行时实现所述建立识别模型包括:获取样本集,所述样本集包括样本字段和与所述样本字段相对应的一个或多个样本数据;所述样本字段包括敏感字段和非敏感字段,且与所述敏感字段对应的样本数据为敏感数据,与所述非敏感字段对应的样本数据为非敏感数据;根据所述样本集中的所有样本字段对识别子模型进行训练,确定训练后的识别子模型,根据所有的所述样本数据对分类子模型进行训练,确定训练后的分类子模型;根据测试集对训练后的所述识别子模型和所述分类子模型进行测试,在所述识别子模型和所述分类子模型通过测试时,根据训练后的所述识别子模型和所述分类子模型生成识别模型。
  20. 根据权利要求19所述的计算机设备,所述计算机可读指令被所述处理器执行时实现所述根据所述样本集中的所有样本字段对识别子模型进行训练包括:分别对样本集中的所述样本字段进行分词处理,确定每个所述样本字段的分词;将所有所述样本字段的分词作为分词集合,确定所述样本字段每个分词在所述分词集合中的词频;根据分词的词频生成所述样本字段的特征向量,并根据所述样本字段的特征向量对识别子模型进行训练。
PCT/CN2019/103529 2019-04-25 2019-08-30 一种识别敏感数据的方法、装置、存储介质及计算机设备 WO2020215571A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910337266.8A CN110222170B (zh) 2019-04-25 2019-04-25 一种识别敏感数据的方法、装置、存储介质及计算机设备
CN201910337266.8 2019-04-25

Publications (1)

Publication Number Publication Date
WO2020215571A1 true WO2020215571A1 (zh) 2020-10-29

Family

ID=67819891

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103529 WO2020215571A1 (zh) 2019-04-25 2019-08-30 一种识别敏感数据的方法、装置、存储介质及计算机设备

Country Status (2)

Country Link
CN (1) CN110222170B (zh)
WO (1) WO2020215571A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580094A (zh) * 2020-12-14 2021-03-30 京东数字科技控股股份有限公司 数据处理方法、电子设备以及存储介质
CN113157854A (zh) * 2021-01-22 2021-07-23 奇安信科技集团股份有限公司 Api的敏感数据泄露检测方法及系统
CN113486392A (zh) * 2021-06-07 2021-10-08 四川新网银行股份有限公司 一种基于大数据平台的敏感数据识别与脱敏方法
CN113688837A (zh) * 2021-09-29 2021-11-23 平安科技(深圳)有限公司 图像脱敏方法、装置、电子设备及计算机可读存储介质
CN116090006A (zh) * 2023-02-01 2023-05-09 北京三维天地科技股份有限公司 一种基于深度学习的敏感识别方法及系统
CN117009596A (zh) * 2023-06-28 2023-11-07 国网冀北电力有限公司信息通信分公司 一种电网敏感数据的识别方法及装置
CN117391076A (zh) * 2023-12-11 2024-01-12 东亚银行(中国)有限公司 敏感数据的识别模型的获取方法、装置、电子设备及介质
CN117421730A (zh) * 2023-09-11 2024-01-19 暨南大学 一种基于集成学习的代码片段敏感信息检测方法

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598115A (zh) * 2019-09-18 2019-12-20 北京市博汇科技股份有限公司 一种基于人工智能多引擎的敏感网页识别方法及系统
CN112528315A (zh) * 2019-09-19 2021-03-19 华为技术有限公司 识别敏感数据的方法和装置
CN110674414A (zh) * 2019-09-20 2020-01-10 北京字节跳动网络技术有限公司 目标信息识别方法、装置、设备及存储介质
CN110750981A (zh) * 2019-10-16 2020-02-04 杭州安恒信息技术股份有限公司 一种基于机器学习的高准确度网站敏感词检测方法
CN111079185B (zh) * 2019-12-20 2022-12-30 医渡云(北京)技术有限公司 数据库信息处理的方法、装置、存储介质及电子设备
CN111291044A (zh) * 2020-01-14 2020-06-16 中移(杭州)信息技术有限公司 敏感数据识别方法、装置、电子设备及存储介质
CN111475651B (zh) * 2020-04-08 2023-04-07 掌阅科技股份有限公司 文本分类方法、计算设备及计算机存储介质
CN111613107A (zh) * 2020-05-19 2020-09-01 富邦教育科技(深圳)有限公司 一种人工智能作业系统
CN111611312A (zh) * 2020-05-19 2020-09-01 四川万网鑫成信息科技有限公司 一种利用规则引擎、区块链技术为基础的数据脱敏方法
CN111914130A (zh) * 2020-08-03 2020-11-10 支付宝(杭州)信息技术有限公司 一种敏感数据检测方法及装置
CN112069540A (zh) * 2020-09-04 2020-12-11 中国平安人寿保险股份有限公司 敏感信息处理方法、装置及介质
CN112417887B (zh) * 2020-11-20 2023-12-05 小沃科技有限公司 敏感词句识别模型处理方法、及其相关设备
CN112507376B (zh) * 2020-12-01 2024-01-05 浙商银行股份有限公司 一种基于机器学习的敏感数据检测方法及装置
CN113220801B (zh) * 2021-05-17 2022-07-29 支付宝(杭州)信息技术有限公司 一种结构化数据分类方法、装置、设备及介质
CN113343699B (zh) * 2021-06-22 2023-10-20 湖北华中电力科技开发有限责任公司 日志安全风险的监测方法、装置、电子设备及介质
CN113472686B (zh) * 2021-07-06 2024-03-08 深圳乐信软件技术有限公司 信息识别方法、装置、设备及存储介质
CN113672976A (zh) * 2021-08-04 2021-11-19 支付宝(杭州)信息技术有限公司 敏感信息检测方法及装置
WO2023077815A1 (zh) * 2021-11-03 2023-05-11 深圳前海微众银行股份有限公司 一种处理敏感数据的方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012985A (zh) * 2010-11-19 2011-04-13 国网电力科学研究院 一种基于数据挖掘的敏感数据动态识别方法
US20170061149A1 (en) * 2015-08-24 2017-03-02 Alibaba Group Holding Limited System, method, and apparatus for data access in a cloud computing environment
CN107862214A (zh) * 2017-06-16 2018-03-30 平安科技(深圳)有限公司 防止敏感信息泄露的方法、装置及存储介质
CN108537056A (zh) * 2018-03-07 2018-09-14 新博卓畅技术(北京)有限公司 一种双层过滤式的数据脱敏方法和系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7380281B2 (en) * 2004-05-06 2008-05-27 International Business Machines Corporation System and method for automatically hiding sensitive information obtainable from a process table
US8738604B2 (en) * 2012-03-30 2014-05-27 Go Daddy Operating Company, LLC Methods for discovering sensitive information on computer networks
US9785795B2 (en) * 2014-05-10 2017-10-10 Informatica, LLC Identifying and securing sensitive data at its source
CN104506545B (zh) * 2014-12-30 2017-12-22 北京奇安信科技有限公司 数据泄露防护方法及装置
CN105825138B (zh) * 2015-01-04 2019-02-15 北京神州泰岳软件股份有限公司 一种敏感数据识别的方法和装置
CN108268785B (zh) * 2016-12-30 2020-05-22 广东精点数据科技股份有限公司 一种敏感数据识别和脱敏的装置及方法
CN108763952B (zh) * 2018-05-03 2022-04-05 创新先进技术有限公司 一种数据分类方法、装置及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012985A (zh) * 2010-11-19 2011-04-13 国网电力科学研究院 一种基于数据挖掘的敏感数据动态识别方法
US20170061149A1 (en) * 2015-08-24 2017-03-02 Alibaba Group Holding Limited System, method, and apparatus for data access in a cloud computing environment
CN107862214A (zh) * 2017-06-16 2018-03-30 平安科技(深圳)有限公司 防止敏感信息泄露的方法、装置及存储介质
CN108537056A (zh) * 2018-03-07 2018-09-14 新博卓畅技术(北京)有限公司 一种双层过滤式的数据脱敏方法和系统

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580094A (zh) * 2020-12-14 2021-03-30 京东数字科技控股股份有限公司 数据处理方法、电子设备以及存储介质
CN112580094B (zh) * 2020-12-14 2024-05-17 京东科技控股股份有限公司 数据处理方法、电子设备以及存储介质
CN113157854B (zh) * 2021-01-22 2023-08-04 奇安信科技集团股份有限公司 Api的敏感数据泄露检测方法及系统
CN113157854A (zh) * 2021-01-22 2021-07-23 奇安信科技集团股份有限公司 Api的敏感数据泄露检测方法及系统
CN113486392A (zh) * 2021-06-07 2021-10-08 四川新网银行股份有限公司 一种基于大数据平台的敏感数据识别与脱敏方法
CN113486392B (zh) * 2021-06-07 2023-06-06 四川新网银行股份有限公司 一种基于大数据平台的敏感数据识别与脱敏方法
CN113688837A (zh) * 2021-09-29 2021-11-23 平安科技(深圳)有限公司 图像脱敏方法、装置、电子设备及计算机可读存储介质
CN116090006B (zh) * 2023-02-01 2023-09-08 北京三维天地科技股份有限公司 一种基于深度学习的敏感识别方法及系统
CN116090006A (zh) * 2023-02-01 2023-05-09 北京三维天地科技股份有限公司 一种基于深度学习的敏感识别方法及系统
CN117009596A (zh) * 2023-06-28 2023-11-07 国网冀北电力有限公司信息通信分公司 一种电网敏感数据的识别方法及装置
CN117421730A (zh) * 2023-09-11 2024-01-19 暨南大学 一种基于集成学习的代码片段敏感信息检测方法
CN117421730B (zh) * 2023-09-11 2024-06-04 暨南大学 一种基于集成学习的代码片段敏感信息检测方法
CN117391076A (zh) * 2023-12-11 2024-01-12 东亚银行(中国)有限公司 敏感数据的识别模型的获取方法、装置、电子设备及介质
CN117391076B (zh) * 2023-12-11 2024-02-27 东亚银行(中国)有限公司 敏感数据的识别模型的获取方法、装置、电子设备及介质

Also Published As

Publication number Publication date
CN110222170B (zh) 2024-05-24
CN110222170A (zh) 2019-09-10

Similar Documents

Publication Publication Date Title
WO2020215571A1 (zh) 一种识别敏感数据的方法、装置、存储介质及计算机设备
TWI689871B (zh) 梯度提升決策樹(gbdt)模型的特徵解釋方法和裝置
WO2019051941A1 (zh) 车型识别方法、装置、设备及计算机可读存储介质
WO2015135452A1 (en) Text information processing method and apparatus
WO2018112783A1 (zh) 图像识别方法及装置
TW201909112A (zh) 圖像特徵獲取
CN111368024A (zh) 文本语义相似度的分析方法、装置及计算机设备
CN107341220B (zh) 一种多源数据融合方法和装置
WO2020238229A1 (zh) 交易特征生成模型的训练、交易特征的生成方法和装置
CN108550065B (zh) 评论数据处理方法、装置及设备
WO2020082734A1 (zh) 文本情感识别方法、装置、电子设备及计算机非易失性可读存储介质
CN108959474B (zh) 实体关系提取方法
WO2017157165A1 (zh) 信用分数模型训练方法、信用分数计算方法、装置及服务器
WO2022199185A1 (zh) 用户操作检测方法及程序产品
WO2022042297A1 (zh) 文本聚类方法、装置、电子设备及存储介质
WO2019061664A1 (zh) 电子装置、基于用户上网数据的产品推荐方法及存储介质
CN111723870A (zh) 基于人工智能的数据集获取方法、装置、设备和介质
CN111062440B (zh) 一种样本选择方法、装置、设备及存储介质
US11367311B2 (en) Face recognition method and apparatus, server, and storage medium
CN114239805A (zh) 跨模态检索神经网络及训练方法、装置、电子设备、介质
CN113762303B (zh) 图像分类方法、装置、电子设备及存储介质
CN108734393A (zh) 房源信息的匹配方法、用户设备、存储介质及装置
CN111489262A (zh) 保单信息检测方法、装置、计算机设备和存储介质
CN107665443B (zh) 获取目标用户的方法及装置
CN110717817A (zh) 贷前审核方法及装置、电子设备和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925982

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19925982

Country of ref document: EP

Kind code of ref document: A1