CN113868497A - Data classification method and device and storage medium - Google Patents

Data classification method and device and storage medium Download PDF

Info

Publication number
CN113868497A
CN113868497A CN202111140419.3A CN202111140419A CN113868497A CN 113868497 A CN113868497 A CN 113868497A CN 202111140419 A CN202111140419 A CN 202111140419A CN 113868497 A CN113868497 A CN 113868497A
Authority
CN
China
Prior art keywords
classification
information
data
label
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111140419.3A
Other languages
Chinese (zh)
Inventor
张正欣
王豪
肖春亮
何坤
牟黎明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhou Lvmeng Chengdu Technology Co ltd
Nsfocus Technologies Inc
Nsfocus Technologies Group Co Ltd
Original Assignee
Shenzhou Lvmeng Chengdu Technology Co ltd
Nsfocus Technologies Inc
Nsfocus Technologies Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhou Lvmeng Chengdu Technology Co ltd, Nsfocus Technologies Inc, Nsfocus Technologies Group Co Ltd filed Critical Shenzhou Lvmeng Chengdu Technology Co ltd
Priority to CN202111140419.3A priority Critical patent/CN113868497A/en
Publication of CN113868497A publication Critical patent/CN113868497A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The application relates to the technical field of network information security, in particular to a data classification method, a device and a storage medium, which are used for improving the classification accuracy of sensitive data and comprise the following steps: acquiring data information to be detected; respectively matching the data information to be detected with each sensitive information label in a preset sensitive information base to obtain a first classification label matched with the data information to be detected, and classifying the sensitive information of the data information to be detected based on a trained classification model to obtain a corresponding second classification label; if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is not greater than a preset threshold value, taking the first classification label as a target classification label corresponding to the data information to be detected; and if the confidence coefficient of the second classification label is greater than a preset threshold value, taking the second classification label as a corresponding target classification label. According to the method and the device, the safety protection capability of the sensitive data is improved by fusing the results matched based on the sensitive information base and the neural network.

Description

Data classification method and device and storage medium
Technical Field
The application relates to the technical field of network information security, and provides a data classification method, a data classification device and a storage medium.
Background
The big data era is an age of everything interconnection, and data classification and classification are used as a bridgehead castle of data security and are of great importance in the data security control process. At present, with the development of big data, artificial intelligence and internet of things, how to effectively protect personal privacy and protect sensitive data of enterprises from being revealed becomes the focus of attention of more and more security manufacturers. Therefore, the control of sensitive data needs to be focused, different safety protection is realized according to different data levels, and the major loss caused by sensitive data leakage is avoided.
In the related art, the problem of how to classify and grade sensitive data is mainly a classification and grading system based on a preset rule or a classification and grading system based on a neural network model, but the classification and grading system based on the rule is poor in generality of the rule and lacks of knowledge reusability and shareability; while classification and classification systems based on neural network models can solve the above problems, neural networks lack interpretability.
Disclosure of Invention
The embodiment of the application provides a data classification method, a data classification device and a storage medium, which are used for improving the classification accuracy and interpretability of sensitive data.
The first data classification method provided by the embodiment of the application comprises the following steps:
acquiring data information to be detected, wherein the data information to be detected is database data information containing sensitive information;
matching the data information to be detected with each sensitive information label in a preset sensitive information base respectively to obtain a first classification label matched with the data information to be detected, and classifying the sensitive information of the data information to be detected based on a trained classification model to obtain a second classification label corresponding to the data information to be detected;
if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is not greater than a preset threshold value, taking the first classification label as a target classification label corresponding to the data information to be detected;
and if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is greater than a preset threshold value, taking the second classification label as a target classification label corresponding to the data information to be detected.
In the above embodiment, based on the preset sensitive information base and the classification model, the data information to be detected is classified according to the sensitive information, the first classification label and the second classification label corresponding to the data information to be detected are obtained, the two labels are compared, and if the first classification label and the second classification label are not consistent, the result matched based on the sensitive information base and the result matched by the classification model of the neural network are fused.
An optional implementation manner is that the matching of the data information to be detected and each sensitive information tag in a preset sensitive information base is performed respectively, and the method further includes:
and if all the sensitive information labels of the preset sensitive information base are not matched with the data information to be detected, using the second classification label as a target classification label corresponding to the data information to be detected.
In the above embodiment, if each sensitive information tag of the preset sensitive information base is not matched with the data information to be detected, the second classification tag is used as the target classification tag corresponding to the data information to be detected, and when a new sensitive information category exists and the sensitive data tag cannot be matched based on the sensitive information base, the result matched by the classification model of the neural network can be directly used as the target classification tag, so that the safety protection capability on the sensitive data is improved.
An optional implementation manner is that, based on the trained classification model, the sensitive information classification is performed on the data information to be detected, so as to obtain a second classification label corresponding to the data information to be detected, which specifically includes:
splitting the data information to be detected into at least two dimensions of characteristic information, and respectively encoding the at least two dimensions of characteristic information to obtain first encoding vectors corresponding to the at least two dimensions of characteristic information;
splicing the first coding vectors to obtain a first coding matrix corresponding to the data information to be detected;
performing weighting processing based on a weight matrix and the first coding matrix to obtain a first classification vector corresponding to the data information to be detected, wherein the weight matrix is used for representing the importance of the characteristic information of the at least two dimensions to the classification of the sensitive information;
classifying the sensitive information in the data information to be detected based on the first classification vector, determining the prediction score of the sensitive information for each sensitive information label, and determining the second classification label corresponding to the data information to be detected based on each prediction score.
In the embodiment, the database data information is divided into at least two dimensions of feature information and is respectively coded, the multidimensional information of the database is dynamically fused, so that a sensitive data classification grading system is constructed, sensitive data is identified from multiple angles, the sensitive data information is classified based on a neural network, the defect of insufficient single feature information is overcome, the accuracy and the interpretability of the whole system are enhanced, and the accuracy of sensitive data classification is improved.
In an alternative embodiment, the weight matrix is obtained by:
and performing attention characteristic extraction according to a first self-attention mechanism matrix and the first encoding matrix to obtain the weight matrix, wherein the first self-attention mechanism matrix is used for representing the context relationship between the characteristic information of the at least two dimensions.
In the above embodiment, attention feature extraction is performed according to the first self-attention mechanism matrix and the first encoding matrix to obtain a weight matrix, which represents the importance of feature information of different dimensions to the current sensitive information tag.
In an alternative embodiment, the trained classification model is obtained by training in the following way:
acquiring a training sample data set, executing cycle iterative training on an initial classification model according to training samples in the training sample data set, and outputting the trained classification model when the training is finished, wherein each training sample comprises sample data information and a real score obtained by classifying sensitive information of the sample data information; wherein the following operations are executed in a loop iteration training process:
selecting at least one training sample from the training sample data set, inputting the selected training sample into the classification model, and obtaining the prediction score of the training sample output by the classification model for each sensitive information label;
constructing a focus loss function based on the predicted scores of the training samples for each sensitive information label and the difference between the corresponding real scores;
adjusting parameters of the classification model based on the focus loss function.
In the above embodiment, the classification model is subjected to loop iterative training, a focus loss function is constructed based on the prediction scores of the training samples for each sensitive information label and the difference between corresponding real scores, and parameters of the classification model are adjusted based on the focus loss function, so that the attention of the classification model to relevant features is improved, the attention to irrelevant features is reduced, and the accuracy of the classification model in classifying sensitive information is improved.
An optional implementation manner is that, the inputting the selected training sample into the classification model, and obtaining the second classification label output by the classification model, includes:
splitting the training sample into at least two-dimensional feature information, respectively inputting the at least two-dimensional feature information into the classification model, and respectively encoding the at least two-dimensional feature information based on an embedded layer in the classification model to obtain second encoding vectors corresponding to the at least two-dimensional feature information;
splicing the second coding vectors to obtain second coding features corresponding to the training samples;
inputting the second coding features into a full-connection layer of the classification model, and performing attention feature extraction on the second coding features based on a second self-attention mechanism matrix corresponding to the full-connection layer to obtain corresponding second classification vectors of the training samples;
and classifying the sensitive information in the training samples based on the second classification vector, and obtaining the prediction score of the training samples output by the classification model for each sensitive information label.
In the above embodiment, the training sample is divided into at least two dimensions of feature information for encoding, attention feature extraction is performed through a second encoding feature of a self-attention mechanism, a second classification vector is obtained, sensitive information in the training sample is classified based on the second classification vector, a prediction score of the training sample output by the classification model for each sensitive information label is obtained, and accuracy and interpretability of sensitive information classification are enhanced.
In an optional implementation manner, the method further includes:
acquiring a test sample data set, and executing a loop iteration test on the trained classification model according to the test sample in the test sample data set; wherein the following operations are executed in a loop iteration test process:
inputting the test sample into the trained classification model to obtain a corresponding second classification vector;
carrying out similarity comparison according to the second classification vector of the test sample and the second classification vectors of all the training samples to obtain a training sample with the highest similarity to the test sample;
and determining the sensitive information label corresponding to the test sample based on the sensitive information label corresponding to the training sample with the highest similarity, wherein the sensitive information label corresponding to the training sample is determined based on the prediction score of the training sample for each sensitive information label.
In the above embodiment, the training sample with the highest similarity to the test sample is obtained by adopting a semantic similarity measurement mode for the test sample and the training sample, and the sensitive information label corresponding to the training sample is used as the sensitive information label of the test sample, so that model training is not required to be performed again when the sensitive information is added with new categories, the trouble that the newly added categories are required to be retrained every time when the traditional neural network is classified is reduced, and the data classification efficiency is improved.
In an optional embodiment, the dimensions of the feature information include at least two of the following: field name, field value, database table name, data type.
In the above embodiment, the training sample is divided into at least two dimensions of feature information for encoding, sensitive information identification is performed from multiple dimensions, the importance of different dimensions of feature information on the current sensitive information label is fused, the defect of insufficient single feature information is overcome, and the accuracy and interpretability of sensitive information classification are enhanced.
The data classification device provided by the embodiment of the application comprises:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data information to be detected, and the data information to be detected is database data information containing sensitive information;
the matching unit is used for respectively matching the data information to be detected with each sensitive information label in a preset sensitive information base to obtain a first classification label matched with the data information to be detected, and classifying the sensitive information of the data information to be detected based on a trained classification model to obtain a second classification label corresponding to the data information to be detected;
a first determining unit, configured to, if the first classification tag and the second classification tag are inconsistent and a confidence of the second classification tag is not greater than a preset threshold, use the first classification tag as a target classification tag corresponding to the to-be-detected data information;
and the second determining unit is used for taking the second classification label as a target classification label corresponding to the data information to be detected if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is greater than a preset threshold value.
Optionally, the matching unit is specifically configured to:
and if all the sensitive information labels of the preset sensitive information base are not matched with the data information to be detected, using the second classification label as a target classification label corresponding to the data information to be detected.
Optionally, the matching unit is specifically configured to:
splitting the data information to be detected into at least two dimensions of characteristic information, and respectively encoding the at least two dimensions of characteristic information to obtain first encoding vectors corresponding to the at least two dimensions of characteristic information;
splicing the first coding vectors to obtain a first coding matrix corresponding to the data information to be detected;
performing weighting processing based on a weight matrix and the first coding matrix to obtain a first classification vector corresponding to the data information to be detected, wherein the weight matrix is used for representing the importance of the characteristic information of the at least two dimensions to the classification of the sensitive information;
classifying the sensitive information in the data information to be detected based on the first classification vector, determining the prediction score of the sensitive information for each sensitive information label, and determining the second classification label corresponding to the data information to be detected based on each prediction score.
Optionally, the matching unit is further configured to determine the weight matrix by:
and performing attention characteristic extraction according to a first self-attention mechanism matrix and the first encoding matrix to obtain the weight matrix, wherein the first self-attention mechanism matrix is used for representing the context relationship between the characteristic information of the at least two dimensions.
Optionally, the matching unit is specifically configured to:
acquiring a training sample data set, executing cycle iterative training on an initial classification model according to training samples in the training sample data set, and outputting the trained classification model when the training is finished, wherein each training sample comprises sample data information and a real score obtained by classifying sensitive information of the sample data information; wherein the following operations are executed in a loop iteration training process:
selecting at least one training sample from the training sample data set, inputting the selected training sample into the classification model, and obtaining the prediction score of the training sample output by the classification model for each sensitive information label;
constructing a focus loss function based on the predicted scores of the training samples for each sensitive information label and the difference between the corresponding real scores;
adjusting parameters of the classification model based on the focus loss function.
Optionally, the matching unit is specifically configured to:
splitting the training sample into at least two-dimensional feature information, respectively inputting the at least two-dimensional feature information into the classification model, and respectively encoding the at least two-dimensional feature information based on an embedded layer in the classification model to obtain second encoding vectors corresponding to the at least two-dimensional feature information;
splicing the second coding vectors to obtain second coding features corresponding to the training samples;
inputting the second coding features into a full-connection layer of the classification model, and performing attention feature extraction on the second coding features based on a second self-attention mechanism matrix corresponding to the full-connection layer to obtain corresponding second classification vectors of the training samples;
and classifying the sensitive information in the training samples based on the second classification vector, and obtaining the prediction score of the training samples output by the classification model for each sensitive information label.
Optionally, the apparatus further comprises a test unit, configured to:
acquiring a test sample data set, and executing a loop iteration test on the trained classification model according to the test sample in the test sample data set; wherein the following operations are executed in a loop iteration test process:
inputting the test sample into the trained classification model to obtain a corresponding second classification vector;
carrying out similarity comparison according to the second classification vector of the test sample and the second classification vectors of all the training samples to obtain a training sample with the highest similarity to the test sample;
and determining the sensitive information label corresponding to the test sample based on the sensitive information label corresponding to the training sample with the highest similarity, wherein the sensitive information label corresponding to the training sample is determined based on the prediction score of the training sample for each sensitive information label.
Optionally, the dimensions of the feature information include at least two of the following: field name, field value, database table name, data type.
An electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory stores program codes, and when the program codes are executed by the processor, the processor is caused to execute any one of the steps of the data classification method.
An embodiment of the present application provides a computer-readable storage medium, which includes program code, when the storage medium is run on an electronic device, the program code is configured to enable the electronic device to execute any one of the steps of the data classification method described above.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is an alternative schematic diagram of an application scenario in an embodiment of the present application;
FIG. 2 is a flow chart illustrating a data classification method according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a data classification model provided in an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a self-attention model in an embodiment of the present application;
FIG. 5 is a diagram illustrating a structure of data information encoding according to an embodiment of the present application;
FIG. 6 is a diagram illustrating an embodiment of a data information encoding structure;
FIG. 7 is a schematic structural diagram of a classification model test in an embodiment of the present application;
FIG. 8 is a flowchart illustrating an overall data classification method according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of a data classification apparatus in an embodiment of the present application;
fig. 10 is a schematic diagram of a hardware component of an electronic device to which an embodiment of the present application is applied;
fig. 11 is a schematic diagram of a hardware component structure of another electronic device to which the embodiment of the present application is applied.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.
Some concepts related to the embodiments of the present application are described below.
1. In the embodiment of the present application, the term "and/or" describes an association relationship of associated objects, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
2. The term "neural network" in the embodiment of the application is an algorithmic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing. The network achieves the aim of processing information by adjusting the interconnection relationship among a large number of internal nodes depending on the complexity of the system, and the main task of the network is to construct a practical artificial neural network model according to the principle of a biological neural network and the requirement of practical application, design a corresponding learning algorithm, simulate certain intelligent activity of human brain, and then technically realize the method for solving the practical problem. The neural network in the embodiment of the application is used for classifying the sensitive information.
3. In the embodiment of the present application, the term "BERT" (Bidirectional Encoder Representation based on converters) is a pre-training model, and the model architecture of BERT is based on multi-layer Bidirectional conversion decoding, so that the model has the capability of understanding the relation of long sequence context, and BERT in the embodiment of the present application is used for encoding data information.
The embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning technologies, and are designed based on a computer vision technology and Machine Learning (ML) in the AI.
Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.
Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. The classification model in the embodiment of the application is obtained by training through a machine learning or deep learning technology. The sensitive information can be identified based on the training method of the classification model in the embodiment of the application.
The following briefly introduces the design concept of the embodiments of the present application:
the big data era is an age of everything interconnection, and data classification and classification are used as a bridgehead castle of data security and are of great importance in the data security control process. At present, with the development of big data, artificial intelligence and internet of things, how to effectively protect personal privacy and protect sensitive data of enterprises from being revealed becomes the focus of attention of more and more security manufacturers. Therefore, the control of sensitive data needs to be focused, different safety protection is realized according to different data levels, and the major loss caused by sensitive data leakage is avoided.
The classification and grading of sensitive data is an important application field of data security, and aims to provide a solution for protecting personal privacy and enterprise sensitive data by using professional knowledge, so that the sensitivity and sensitive content of the data can be effectively identified. In the process of identifying and classifying the sensitive data, the sensitive data classification grading system can identify possible sensitive contents and sensitive grades by using the established sensitive criteria and provide interpretable grading description, thereby providing suggestions for individuals and enterprises and helping the individuals and the enterprises to protect related privacy.
Meanwhile, with the technological progress, the machine learning and deep learning technologies are rapidly developed, and particularly, the neural network obtains better results in various fields. In turn, in the security field, sensitive data classification hierarchical systems have evolved from initial rule-based systems to neural network-based systems, and while traditional rule-based systems have not performed as well as neural network-based systems, neural network technology is a black box, lacking interpretability. It is of great practical significance to study how to combine neural networks with rule logic to better serve the security domain.
In the related art, the problem of how to classify and grade sensitive data is mainly a classification and grading system based on a preset rule or a classification and grading system based on a neural network model, but the classification and grading system based on the rule is poor in knowledge reusability and sharing due to weak generalization of the rule; while classification and classification systems based on neural network models can solve the above problems, neural networks lack interpretability.
In view of this, embodiments of the present application provide a data classification method, apparatus, and storage medium. According to the method and the device, the result matched based on the sensitive information base and the classification model of the neural network are matched to be fused, so that the accuracy and the interpretability of sensitive data classification are improved, the problem of low generalization capability of a traditional sensitive data rule-based system is solved, and the safety protection capability of sensitive data is improved.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Fig. 1 is a schematic view of an application scenario in the embodiment of the present application. The method is an application scenario schematic diagram of the embodiment of the application. The application scenario diagram includes two terminal devices 110 and a server 120. The terminal device 110 and the server 120 may communicate with each other via a communication network.
In an alternative embodiment, the communication network is a wired network or a wireless network. The terminal device 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
In this embodiment, the terminal device 110 is an electronic device used by a user, and the electronic device may be a computer device having a certain computing capability and running instant messaging software and a website or social contact software and a website, such as a personal computer, a mobile phone, a tablet computer, a notebook, an e-book reader, and the like. Each terminal device 110 is connected to the server 120 through a wireless Network, and the server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
In the embodiment of the present application, the classification model may be deployed on the terminal device 110 for training, and may also be deployed on the server 120 for training. The server 120 may store a plurality of training samples, including at least one set of training samples, for training the classification model. Optionally, after the classification model is obtained by training based on the training method in the embodiment of the present application, the trained classification model may be directly deployed on the server 120 or the terminal device 110. The classification model is typically deployed directly on the server 120, and in the embodiment of the present application, the classification model is often used to classify data.
It should be noted that the classification model and the data classification method provided by the embodiment of the present application may be applied to various application scenarios including data classification tasks, for example, a session classification module in a session auditing platform corresponding to some chat software, a comment classification module in a content sharing platform, and the like, and may be used to classify some public information, public conversations, and the like. Correspondingly, the training samples used in different scenes are different, and are not listed here.
The data classification method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.
Referring to fig. 2, it is a flowchart illustrating an implementation of a data classification method according to an embodiment of the present application, and the specific implementation flow of the method is as follows:
s21: acquiring data information to be detected;
the data information to be detected is database data information containing sensitive information.
S22: respectively matching the data information to be detected with each sensitive information label in a preset sensitive information base to obtain a first classification label matched with the data information to be detected, and classifying the sensitive information of the data information to be detected based on a trained classification model to obtain a second classification label corresponding to the data information to be detected;
the sensitive information base comprises a large number of sensitive information identification rules which are manually sorted, the sensitive information identification rules comprise regular expressions, various AND/OR logic combinations and the like, the sensitive information can be a mobile phone number, a home address and the like, and the sensitive information identification rules are not specifically limited herein. The rules can be maintained by continuously modifying information manually, and the identification of partial sensitive information can be completed through the priori knowledge of the rules.
For example, when the sensitive information is a mobile phone number, a regular expression including the mobile phone number is as follows:
^1[3 5 6 7 8]\d{9}
s23: if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is not greater than a preset threshold value, taking the first classification label as a target classification label corresponding to the data information to be detected;
s24: and if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is greater than a preset threshold value, taking the second classification label as a target classification label corresponding to the data information to be detected.
In the embodiment, the result matched based on the sensitive information base and the classification model of the neural network are matched to be fused with the result, so that the accuracy and the interpretability of sensitive data classification are improved, the problem of low generalization capability of the traditional sensitive data rule-based system is solved, and the safety protection capability of sensitive data is improved.
In an optional implementation manner, if all the sensitive information tags of the preset sensitive information library are not matched with the data information to be detected, the second classification tag is used as a target classification tag corresponding to the data information to be detected.
Specifically, when the data information to be detected is not matched with each sensitive information label in the sensitive information base, that is, the first classification label cannot be matched, the second classification label is directly used as the target classification label.
In an alternative embodiment, the sensitive information classification may be performed on the data information to be detected, and the second classification tag may be obtained based on the following steps:
step 1: splitting data information to be detected into at least two dimensions of characteristic information, and respectively encoding the at least two dimensions of characteristic information to obtain first encoding vectors corresponding to the at least two dimensions of characteristic information;
step 2: splicing the first coding vectors to obtain a first coding matrix corresponding to the data information to be detected;
and step 3: performing weighting processing based on a weight matrix and a first coding matrix to obtain a first classification vector corresponding to data information to be detected, wherein the weight matrix is used for representing the importance of characteristic information of at least two dimensions to classification of sensitive information;
and 4, step 4: classifying sensitive information in the data information to be detected based on the first classification vector, determining a prediction score of the sensitive information for each sensitive information label, and determining a second classification label corresponding to the data information to be detected based on each prediction score.
Specifically, the sensitive information tag with the highest score, that is, the sensitive information tag with TOP (highest) 1 confidence is generally selected as the second classification tag corresponding to the data information to be detected, or N (N is a positive integer) sensitive information tags with scores higher than the score threshold are also selected as the second classification tags, and the like, which is not specifically limited herein and is specifically determined according to the actual situation.
For example, when there are 20 kinds of sensitive information tags, three names before the prediction score are the name, the mobile phone number, and the home address, and the prediction scores are 0.7, 0.8, and 0.9, respectively, the home address with the highest score may be selected as the second classification tag, or the sensitive information tag with the score higher than 0.7 may be preset as the second classification tag, that is, the mobile phone number and the home address are selected as the second classification tag.
In an alternative embodiment, the dimensions of the feature information may include field names, field values, database table names, and data types.
A piece of information to be detected can be split into a field name, a field value, a database table name and a data type.
For example, after splitting the to-be-detected data information with the sensitive tag category as the address, the following steps may be performed:
field name: a customer address;
field value: home in Kunming City of Yunnan province;
database table name: customer information;
data type: a text;
specifically, data information to be detected is split into feature information of at least two dimensions, after one piece of data information to be detected is split into multi-dimensional feature information, the data information to be detected can be respectively coded to obtain first coding vectors, and the first coding vectors are spliced into a first coding matrix, so that the first coding matrix is fused with the multi-dimensional feature information. And finally classifying the sensitive information in the data information to be detected based on the first classification vector to obtain the prediction scores of the sensitive information on various sensitive information labels, and determining a second classification label corresponding to the data information to be detected according to the prediction scores.
In an alternative embodiment, the weight matrix may be derived based on the following:
and performing attention characteristic extraction according to a first self-attention mechanism matrix and the first encoding matrix to obtain the weight matrix, wherein the first self-attention mechanism matrix is used for representing the context relationship between the characteristic information of the at least two dimensions.
Specifically, attention feature extraction is carried out on the first encoding matrix through the first self-attention mechanism matrix, the importance degree of feature information of different dimensions in the first encoding matrix to the current sensitive label is obtained, and the importance degree is represented through the weight matrix.
In an alternative embodiment, the classification model is trained by:
acquiring a training sample data set, executing cycle iterative training on an initial classification model according to training samples in the training sample data set, and outputting the trained classification model when the training is finished, wherein each training sample comprises sample data information and a real score obtained by classifying sensitive information of the sample data information; wherein the following operations are executed in a loop iteration training process:
selecting at least one training sample from a training sample data set, inputting the selected training sample into a classification model, and obtaining the prediction score of the training sample output by the classification model for each sensitive information label;
constructing a focus loss function based on the prediction scores of the training samples for each sensitive information label and the difference between the corresponding real scores;
parameters of the classification model are adjusted based on the focus loss function.
Specifically, before model training is started, a training sample data set needs to be acquired, a training sample is input into a classification model, the classification model can output the prediction scores of the training samples for each sensitive information label, then a focus loss function is constructed based on the difference between the prediction scores and the real scores, parameters of the classification model are adjusted based on the focus loss function, most of losses in cross entropy losses are reserved for the training samples with poor prediction scores, and the cross entropy losses of the training samples with good prediction scores are greatly reduced for the training samples with good prediction scores. And finally, carrying out cycle iterative training on the process, and outputting a trained classification model when the training is finished.
The focus Loss function in the embodiment of the present application may be a variant Focal Loss function based on the cross entropy Loss function, and the formula of the focus Loss function is shown as follows:
Figure BDA0003283678810000151
wherein, ypredRepresenting the prediction score of the training sample output by the classification model for each sensitive information label, and fitting the artificial label by the prediction score, y true1 indicates that the training sample matches the sensitive information label, ytrueWith 0 indicating that the training sample does not match the sensitive information label, α and γ are hyper-parameters, which are empirically derived and can be set to 2 and 0.25, respectively. And finally, optimizing a focus loss function through an Adam (adaptive moment estimation) optimization function to enable the relevance of the training sample and the current sensitive data classification label to be maximum.
In an alternative embodiment, the selected training sample may be input into the classification model to obtain the second classification label based on the following steps:
s1: splitting a training sample into at least two-dimensional feature information, respectively inputting the at least two-dimensional feature information into a classification model, and respectively encoding the at least two-dimensional feature information based on an embedded layer in the classification model to obtain second encoding vectors corresponding to the at least two-dimensional feature information;
s2: splicing the second coding vectors to obtain second coding features corresponding to the training samples;
s3: inputting the second coding features into a full-link layer of the classification model, and extracting attention features of the second coding features based on a second self-attention mechanism matrix corresponding to the full-link layer to obtain corresponding second classification vectors of the training samples;
s4: and classifying the sensitive information in the training samples based on the second classification vector, and obtaining the prediction score of the training samples output by the classification model for each sensitive information label.
Specifically, a training sample is split into at least two dimensions of feature information and input into a classification model, an embedding layer of the classification model is coded to obtain second coding vectors corresponding to the feature information of different dimensions, the second coding vectors are spliced into second coding features, the second coding features are input into a full connection layer of the classification model to extract attention features, the corresponding second classification vectors of the training sample are obtained, and finally sensitive information in the training sample is classified based on the second classification vectors to obtain a prediction score of the training sample output by the classification model for each sensitive information label.
Referring to fig. 3, which is a schematic diagram illustrating a result of a data classification model in the present embodiment, the classification model in the present embodiment is described below with reference to fig. 3.
A piece of training data information may contain: the field comprises a field value and a field name, wherein the field value refers to the content of the training data information, and the field name is the name of the field corresponding to the training data information; the database table comprises a database table name; the data type comprises data, values, time and the like, and refers to the category of the information content of the training data. The sensitive information category includes a sensitive field tag.
Firstly, splitting a piece of training data information into field name, field value, database table name and data type feature information of four dimensions, and then respectively coding the feature information of the four dimensions; for example, a field value is split into nine characters, namely "family is in kunming city, Yunnan province", to obtain a character type variable (char), then encoding is performed based on an embedded layer Encoder (for example, a BERT model) to obtain an embedded representation, the field name and the database table name are encoded in the same encoding mode, the data type includes a small number of types, so that a random encoding mode is adopted, the Encoder is not specifically limited, and finally, encoding of each piece of data is as follows:
ei=[BERT(echar1),BERT(echar2),BERT(echar3),etype
wherein e ischar1、echar2、echar3Pre-trained character-embedded representation for the BERT model, etypeA random initialization representation is used. After the feature information of four dimensions of each piece of training data information is obtained and encoded, E ═ Concat (E) can be obtained through full-connection layer splicingi) And E represents a multidimensional coding set of each piece of training data information, namely the second coding feature. For example, the feature information of each piece of training data information is encoded into a dimension of [1, 256%]Then the second coding feature of the training data information after splicing is the dimension [4, 256%]Of the matrix of (a). After the matrix representation after the training data information coding is obtained, the importance of the feature information coding with different dimensions on the sensitive label classification is calculated by using a Self Attention model.
Referring to fig. 4, which is a schematic structural diagram of a self-attention model in the embodiment of the present application, the working process of the self-attention model in the embodiment of the present application is described below with reference to fig. 4.
Step 1: input rwAnd rdataWherein r iswIs the second self-note in this applicationThe mean mechanism matrix represents the context relationship between the characteristic information in the training sample information and adopts random initialization representation with the dimension of [1,256]],rdataI.e. the second coding characteristic in the present application, i.e. the coding matrix of the training sample information after splicing, with dimension [4,256%];
Step 2: will r iswAnd rdataRespectively inputting the full connection layer to obtain hwAnd hdataAfter passing through the full connection layer, the data are in the same semantic space, and the specific calculation formula is as follows:
hw=σ(wrw+b)
hdata=σ(wrdata+b)
where σ is the activation function and w and b are the fully-connected layer parameters.
And step 3: two matrices h to be transformed by the full connection layerwAnd hdataPerforming dot product to obtain a matrix m (i) to measure the similarity between the two matrices, and calculating the formula as follows:
M(i)=hw·hdata(i)
and 4, step 4: performing softmax (reversible maximum value solving) operation on the matrix M (i), mapping the matrix M (i) into the interval (0,1) to obtain a weight matrix, and then using the weight matrix to perform rdataWeighting to obtain a vector roDimension of [1,256]The calculation formula is as follows:
Figure BDA0003283678810000181
finally, the vector r is output from the attention modeloThe output here is weighted fusion of four kinds of coded information according to different weights, that is, the importance of fusion of different dimension information to the current sensitive information label, and then r representing the context relationship between the characteristic information can be obtainedwAnd re-encoding is carried out, so that the attention to relevant features is improved, and the attention to irrelevant features is reduced. When the training sample information is plural, the training sample information is finally encoded as (batch, 256), where batch is the number of samples.
Referring to fig. 5, which is a schematic structural diagram of data information encoding in an embodiment of the present application, a piece of data information may be divided into feature information of four dimensions, i.e., a field value is divided into nine characters, i.e., a character type variable (char), of "living in kunming city, Yunnan, and then an embedded representation of the field name, the database table name, and the feature information of the field value is obtained through BERT, a data type is randomly initialized and encoded, the codes of the feature information are spliced and then input to a Self Attention model, and finally, vectors of four kinds of encoded information weighted and fused according to different weights are obtained.
Referring to fig. 6, which is a schematic diagram of a specific structure of data information encoding in the embodiment of the present application, first, a piece of data information is split into four-dimensional feature information of a field name, a database table name, a data type, and a field value, and may also be split into more-dimensional feature information, which is not limited herein. And then, pre-training is carried out through BERT to obtain embedded representation of field names, database table names and field value characteristic information, random initialization coding is carried out on data types, the obtained characteristic information of four dimensions is vectors of [1,256] dimensions, the coded vectors of the characteristic information are spliced and input into a Self orientation model, the Self orientation model carries out weighting fusion on the four kinds of coded information based on different weights, and finally the vectors of the data information after weighting fusion are output, wherein the dimensions are [1,256 ].
Obtaining a vector r after weighted fusion of data informationoInputting the data into the full connection layer as shown in FIG. 3, the predicted score of the training sample data for each sensitive information label can be obtainedCEThe concrete implementation is shown as a formula:
scoreCE=σ(wro+b)
where σ is the activation function and w and b are the fully-connected layer parameters.
Finally, obtaining the prediction score of the training sample data for each sensitive information labelCEThen, it can be input into the focus loss function in the embodiment of the present application, and the training sample data and the focus loss function are combined through the focus loss functionThe relevance of the current sensitive information tag is the greatest.
In an alternative embodiment, after the training of the classification model is completed, the classification model may be tested based on the following ways:
acquiring a test sample data set, and executing a loop iteration test on the trained classification model according to the test sample in the test sample data set; wherein the following operations are executed in a loop iteration test process:
inputting the test sample into the trained classification model to obtain a corresponding second classification vector;
carrying out similarity comparison according to the second classification vector of the test sample and the second classification vectors of all the training samples to obtain a training sample with the highest similarity to the test sample;
and determining the sensitive information label corresponding to the test sample based on the sensitive information label corresponding to the training sample with the highest similarity, wherein the sensitive information label corresponding to the training sample is determined based on the prediction score of the training sample for each sensitive information label.
Specifically, firstly, a test sample is input into a trained classification model to obtain a second classification vector after coding, then the second classification vector of the test sample is compared with the second classification vector of each training sample in similarity to obtain a training sample with the highest similarity to the test sample, and a sensitive information label corresponding to the training sample is used as a sensitive information label of the test sample.
Fig. 7 is a schematic structural diagram of a classification model test in the embodiment of the present application, and the specific implementation process is as follows:
firstly, training sample data is used as a resource library to input database information codes, a test sample is input into the database to be coded to obtain coded vectors, namely second classification vectors in the application, then the mode of comparing semantic similarity is adopted to compare the coded vectors of the test sample with the coded vectors of the training sample to obtain the ranking of similarity scores (scores), further training sample data most similar to the test sample is obtained, and a sensitive information label corresponding to the training sample is used as a sensitive information label of the test sample. When the category is newly added, the similarity scores of the test sample and each training sample in the resource library are displayed to be low, the most similar training sample cannot be found, and the newly added sample and the corresponding sensitive information label can be manually added into the data resource library, so that model training is not required to be carried out again, the training efficiency is improved, and the time is saved.
Referring to fig. 8, it is a general flowchart of a data classification method in the embodiment of the present application, and the specific steps are as follows:
firstly, acquiring data information to be classified and graded;
further, classifying and grading the data based on the sensitive rule information base to obtain a first classification label (namely a rule result); and carrying out data classification and grading based on the classification model of the neural network to obtain a second classification label (namely the neural network result).
If the first classification label is matched based on the sensitive rule information base, firstly, whether the first classification label is consistent with the second classification label is judged:
if the two classification tags are consistent, the second classification tag is used as the target classification tag.
If the two sensitive information labels are inconsistent, judging whether the confidence level in the sensitive information labels matched based on the neural network is the highest, namely whether the score of the sensitive information label with the highest score is larger than a threshold value, if so, trusting the result of the neural network, and taking the second classification label as a target classification label; if not, the second classification label is deemed as the target classification label based on the matching result of the sensitive information rule.
And if the first classification label is not matched based on the sensitive rule information base, directly believing the result matched based on the neural network, and taking the second classification label as a target classification label.
Based on the same inventive concept, the embodiment of the application also provides a structural schematic diagram of the data classification device. As shown in fig. 9, which is a schematic structural diagram of the data sorting apparatus 900, the data sorting apparatus may include:
an acquisition unit 901: the method comprises the steps of acquiring data information to be detected;
the matching unit 902: the system comprises a database, a classification model and a database management system, wherein the database is used for storing data information to be detected, the database is used for storing a classification model of the data information to be detected, and the classification model is used for matching the data information to be detected with each sensitive information label in a preset sensitive information database to obtain a first classification label matched with the data information to be detected and classifying the sensitive information of the data information to be detected based on the trained classification model to obtain a second classification label corresponding to the data information to be detected;
first determination unit 903: the data detection device is used for detecting whether the first classification label is consistent with the second classification label or not, and if the confidence of the second classification label is not larger than a preset threshold, taking the first classification label as a target classification label corresponding to the data information to be detected;
the second determination unit 904: and if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is greater than a preset threshold value, using the second classification label as a target classification label corresponding to the data information to be detected.
Optionally, the matching unit 902 is specifically configured to:
and if all the sensitive information labels of the preset sensitive information base are not matched with the data information to be detected, taking the second classification label as a target classification label corresponding to the data information to be detected.
Optionally, the matching unit 902 is specifically configured to:
splitting data information to be detected into at least two dimensions of characteristic information, and respectively encoding the at least two dimensions of characteristic information to obtain first encoding vectors corresponding to the at least two dimensions of characteristic information;
splicing the first coding vectors to obtain a first coding matrix corresponding to the data information to be detected;
performing weighting processing based on a weight matrix and a first coding matrix to obtain a first classification vector corresponding to data information to be detected, wherein the weight matrix is used for representing the importance of characteristic information of at least two dimensions to classification of sensitive information;
classifying sensitive information in the data information to be detected based on the first classification vector, determining a prediction score of the sensitive information for each sensitive information label, and determining a second classification label corresponding to the data information to be detected based on each prediction score.
Optionally, the matching unit 902 is further configured to determine the weight matrix by:
and performing attention characteristic extraction according to the first self-attention mechanism matrix and the first encoding matrix to obtain a weight matrix, wherein the first self-attention mechanism matrix is used for representing the context relationship between the characteristic information of at least two dimensions.
Optionally, the matching unit 902 is specifically configured to:
acquiring a training sample data set, executing cycle iterative training on an initial classification model according to training samples in the training sample data set, and outputting the trained classification model when the training is finished, wherein each training sample comprises sample data information and a real score obtained by classifying sensitive information of the sample data information; wherein the following operations are executed in a loop iteration training process:
selecting at least one training sample from a training sample data set, inputting the selected training sample into a classification model, and obtaining the prediction score of the training sample output by the classification model for each sensitive information label;
constructing a focus loss function based on the prediction scores of the training samples for each sensitive information label and the difference between the corresponding real scores;
parameters of the classification model are adjusted based on the focus loss function.
Optionally, the matching unit 902 is specifically configured to:
splitting a training sample into at least two-dimensional feature information, respectively inputting the at least two-dimensional feature information into a classification model, and respectively encoding the at least two-dimensional feature information based on an embedded layer in the classification model to obtain second encoding vectors corresponding to the at least two-dimensional feature information;
splicing the second coding vectors to obtain second coding features corresponding to the training samples;
inputting the second coding features into a full-link layer of the classification model, and extracting attention features of the second coding features based on a second self-attention mechanism matrix corresponding to the full-link layer to obtain corresponding second classification vectors of the training samples;
and classifying the sensitive information in the training samples based on the second classification vector, and obtaining the prediction score of the training samples output by the classification model for each sensitive information label.
Optionally, the apparatus further comprises a test unit 905 for:
acquiring a test sample data set, and executing a loop iteration test on the trained classification model according to the test samples in the test sample data set; wherein the following operations are executed in a loop iteration test process:
inputting the test sample into the trained classification model to obtain a corresponding second classification vector;
carrying out similarity comparison according to the second classification vector of the test sample and the second classification vectors of all the training samples to obtain a training sample with the highest similarity to the test sample;
and determining the sensitive information label corresponding to the test sample based on the sensitive information label corresponding to the training sample with the highest similarity, wherein the sensitive information label corresponding to the training sample is determined based on the prediction score of the training sample for each sensitive information label.
For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, a data classification apparatus according to the present application may include at least a processor and a memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the data classification method according to various exemplary embodiments of the present application described in the specification. For example, the processor may perform the steps as shown in fig. 2.
The electronic equipment is based on the same inventive concept as the method embodiment, and the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 10, and include a memory 1001, a communication module 1003, and one or more processors 1002.
A memory 1001 for storing computer programs executed by the processor 1002. The memory 1001 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
Memory 1001 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1001 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 1001 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 1001 may be a combination of the above memories.
The processor 1002 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. The processor 1002 is configured to implement the data classification method when calling the computer program stored in the memory 1001.
The communication module 1003 is used for communicating with the terminal device and other servers.
In the embodiment of the present application, the specific connection medium among the memory 1001, the communication module 1003, and the processor 1002 is not limited. In the embodiment of the present application, the memory 1001 and the processor 1002 are connected by a bus 1004 in fig. 10, the bus 1004 is represented by a thick line in fig. 10, and the connection manner between other components is merely illustrative and is not limited thereto. The bus 1004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.
The memory 1001 stores therein a computer storage medium, and the computer storage medium stores therein computer-executable instructions for implementing the data classification method according to the embodiment of the present application. The processor 1002 is configured to perform the data classification method described above, such as the steps shown in fig. 2.
In another embodiment, the electronic device may also be other electronic devices, such as the terminal device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may be as shown in fig. 11, including: communications component 1110, memory 1120, display unit 1130, camera 1140, sensor 1150, audio circuit 1160, bluetooth module 1170, processor 1180, and the like.
The communication component 1110 is configured to communicate with a server. In some embodiments, a Wireless Fidelity (WiFi) module may be included, the WiFi module being a short-range Wireless transmission technology, through which the electronic device may help the user to transmit and receive information.
The memory 1120 may be used to store software programs and data. The processor 1180 performs various functions of the terminal device 110 and data processing by executing software programs or data stored in the memory 1120. The memory 1120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The memory 1120 stores an operating system that enables the terminal device 110 to operate. The memory 1120 may store an operating system and various application programs, and may also store codes for performing the data classification method according to the embodiment of the present application.
The display unit 1130 may also be used to display information input by the user or information provided to the user and a Graphical User Interface (GUI) of various menus of the terminal apparatus 110. Specifically, the display unit 1130 may include a display screen 1132 disposed on the front surface of the terminal device 110. The display screen 1132 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 1130 may be used to display an interface related to multimedia information recommendation in the embodiment of the present application, and the like.
The display unit 1130 may also be used to receive input numeric or character information and generate signal input related to user settings and function control of the terminal apparatus 110, and specifically, the display unit 1130 may include a touch screen 1131 disposed on the front surface of the terminal apparatus 110 and may collect touch operations of a user thereon or nearby, such as clicking a button, dragging a scroll box, and the like.
The touch screen 1131 may be covered on the display screen 1132, or the touch screen 1131 and the display screen 1132 may be integrated to implement the input and output functions of the terminal device 110, and after the integration, the touch screen may be referred to as a touch display screen for short. The display unit 1130 in the present application may display the application programs and the corresponding operation steps.
Camera 1140 may be used to capture still images and a user may upload comments from an image captured by camera 1140 via a video client. The number of the cameras 1140 may be one or more. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing elements convert the light signals into electrical signals, which are then passed to the processor 1180 for conversion into digital image signals.
The terminal device may further comprise at least one sensor 1150, such as an acceleration sensor 1151, a distance sensor 1152, a fingerprint sensor 1153, a temperature sensor 1154. The terminal device may also be configured with other sensors such as a gyroscope, barometer, hygrometer, thermometer, infrared sensor, light sensor, motion sensor, and the like.
Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and terminal device 110. The audio circuit 1160 may transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161. Terminal device 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1162 converts the collected sound signals into electrical signals, which are received by the audio circuit 1160 and converted into audio data, which is then output to the communication assembly 1110 for transmission to, for example, another terminal device 110, or to the memory 1120 for further processing.
The bluetooth module 1170 is used for performing information interaction with other bluetooth devices having bluetooth modules through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that is also equipped with a bluetooth module via the bluetooth module 1170, so as to perform data interaction.
The processor 1180 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1120 and calling data stored in the memory 1120. In some embodiments, processor 1180 may include one or more processing units; the processor 1180 may also integrate an application processor, which primarily handles operating systems, user interfaces, application programs, and the like, and a baseband processor, which primarily handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 1180. In the present application, the processor 1180 may run an operating system, an application program, a user interface display, a touch response, and the data classification method according to the embodiment of the present application. Additionally, the processor 1180 is coupled to the display unit 1130.
In some possible embodiments, the various aspects of the data classification method provided herein may also be implemented in the form of a program product comprising program code for causing a computer device to perform the steps of the data classification method according to various exemplary embodiments of the present application described above in this specification when the program product is run on a computer device, for example the computer device may perform the steps as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of the data sorting method of the embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method of data classification, the method comprising:
acquiring data information to be detected, wherein the data information to be detected is database data information containing sensitive information;
matching the data information to be detected with each sensitive information label in a preset sensitive information base respectively to obtain a first classification label matched with the data information to be detected, and classifying the sensitive information of the data information to be detected based on a trained classification model to obtain a second classification label corresponding to the data information to be detected;
if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is not greater than a preset threshold value, taking the first classification label as a target classification label corresponding to the data information to be detected;
and if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is greater than a preset threshold value, taking the second classification label as a target classification label corresponding to the data information to be detected.
2. The method of claim 1, wherein the matching of the data information to be detected with each sensitive information tag in a preset sensitive information base respectively further comprises:
and if all the sensitive information labels of the preset sensitive information base are not matched with the data information to be detected, using the second classification label as a target classification label corresponding to the data information to be detected.
3. The method according to claim 1, wherein the classifying sensitive information of the data information to be detected based on the trained classification model to obtain a second classification label corresponding to the data information to be detected specifically comprises:
splitting the data information to be detected into at least two dimensions of characteristic information, and respectively encoding the at least two dimensions of characteristic information to obtain first encoding vectors corresponding to the at least two dimensions of characteristic information;
splicing the first coding vectors to obtain a first coding matrix corresponding to the data information to be detected;
performing weighting processing based on a weight matrix and the first coding matrix to obtain a first classification vector corresponding to the data information to be detected, wherein the weight matrix is used for representing the importance of the characteristic information of the at least two dimensions to the classification of the sensitive information;
classifying the sensitive information in the data information to be detected based on the first classification vector, determining the prediction score of the sensitive information for each sensitive information label, and determining the second classification label corresponding to the data information to be detected based on each prediction score.
4. The method of claim 3, wherein the weight matrix is obtained by:
and performing attention characteristic extraction according to a first self-attention mechanism matrix and the first encoding matrix to obtain the weight matrix, wherein the first self-attention mechanism matrix is used for representing the context relationship between the characteristic information of the at least two dimensions.
5. The method of any of claims 1 to 4, wherein the trained classification model is trained by:
acquiring a training sample data set, executing cycle iterative training on an initial classification model according to training samples in the training sample data set, and outputting the trained classification model when the training is finished, wherein each training sample comprises sample data information and a real score obtained by classifying sensitive information of the sample data information; wherein the following operations are executed in a loop iteration training process:
selecting at least one training sample from the training sample data set, inputting the selected training sample into the classification model, and obtaining the prediction score of the training sample output by the classification model for each sensitive information label;
constructing a focus loss function based on the predicted scores of the training samples for each sensitive information label and the difference between the corresponding real scores;
adjusting parameters of the classification model based on the focus loss function.
6. The method of claim 5, wherein the inputting the selected training sample into the classification model and obtaining the second classification label output by the classification model comprises:
splitting the training sample into at least two-dimensional feature information, respectively inputting the at least two-dimensional feature information into the classification model, and respectively encoding the at least two-dimensional feature information based on an embedded layer in the classification model to obtain second encoding vectors corresponding to the at least two-dimensional feature information;
splicing the second coding vectors to obtain second coding features corresponding to the training samples;
inputting the second coding features into a full-connection layer of the classification model, and performing attention feature extraction on the second coding features based on a second self-attention mechanism matrix corresponding to the full-connection layer to obtain corresponding second classification vectors of the training samples;
and classifying the sensitive information in the training samples based on the second classification vector, and obtaining the prediction score of the training samples output by the classification model for each sensitive information label.
7. The method of claim 5, wherein the method further comprises:
acquiring a test sample data set, and executing a loop iteration test on the trained classification model according to the test sample in the test sample data set; wherein the following operations are executed in a loop iteration test process:
inputting the test sample into the trained classification model to obtain a corresponding second classification vector;
carrying out similarity comparison according to the second classification vector of the test sample and the second classification vectors of all the training samples to obtain a training sample with the highest similarity to the test sample;
and determining the sensitive information label corresponding to the test sample based on the sensitive information label corresponding to the training sample with the highest similarity, wherein the sensitive information label corresponding to the training sample is determined based on the prediction score of the training sample for each sensitive information label.
8. The method of claim 3 or 4, wherein the dimensions of the feature information include at least two of: field name, field value, database table name, data type.
9. A data sorting apparatus, characterized in that the device comprises:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data information to be detected, and the data information to be detected is database data information containing sensitive information;
the matching unit is used for respectively matching the data information to be detected with each sensitive information label in a preset sensitive information base to obtain a first classification label matched with the data information to be detected, and classifying the sensitive information of the data information to be detected based on a trained classification model to obtain a second classification label corresponding to the data information to be detected;
a first determining unit, configured to, if the first classification tag and the second classification tag are inconsistent and a confidence of the second classification tag is not greater than a preset threshold, use the first classification tag as a target classification tag corresponding to the to-be-detected data information;
and the second determining unit is used for taking the second classification label as a target classification label corresponding to the data information to be detected if the first classification label is inconsistent with the second classification label and the confidence coefficient of the second classification label is greater than a preset threshold value.
10. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to carry out the steps of the method according to any one of claims 1 to 8, when said storage medium is run on said electronic device.
CN202111140419.3A 2021-09-28 2021-09-28 Data classification method and device and storage medium Pending CN113868497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111140419.3A CN113868497A (en) 2021-09-28 2021-09-28 Data classification method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111140419.3A CN113868497A (en) 2021-09-28 2021-09-28 Data classification method and device and storage medium

Publications (1)

Publication Number Publication Date
CN113868497A true CN113868497A (en) 2021-12-31

Family

ID=78991612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111140419.3A Pending CN113868497A (en) 2021-09-28 2021-09-28 Data classification method and device and storage medium

Country Status (1)

Country Link
CN (1) CN113868497A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114465823A (en) * 2022-04-08 2022-05-10 杭州海康威视数字技术股份有限公司 Industrial Internet terminal encrypted flow data security detection method, device and equipment
CN114565030A (en) * 2022-02-17 2022-05-31 北京百度网讯科技有限公司 Feature screening method and device, electronic equipment and storage medium
CN115081629A (en) * 2022-08-16 2022-09-20 杭州比智科技有限公司 Deep learning method and system for sensitive data discovery and identification
CN115470198A (en) * 2022-08-11 2022-12-13 北京百度网讯科技有限公司 Database information processing method and device, electronic equipment and storage medium
CN116108393A (en) * 2023-04-12 2023-05-12 国网智能电网研究院有限公司 Power sensitive data classification and classification method and device, storage medium and electronic equipment
CN116415103A (en) * 2023-06-09 2023-07-11 之江实验室 Data processing method, device, storage medium and electronic equipment
CN117033889A (en) * 2023-08-02 2023-11-10 瀚能科技有限公司 Smart park production data statistics method and related device

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565030A (en) * 2022-02-17 2022-05-31 北京百度网讯科技有限公司 Feature screening method and device, electronic equipment and storage medium
CN114565030B (en) * 2022-02-17 2022-12-20 北京百度网讯科技有限公司 Feature screening method and device, electronic equipment and storage medium
CN114465823A (en) * 2022-04-08 2022-05-10 杭州海康威视数字技术股份有限公司 Industrial Internet terminal encrypted flow data security detection method, device and equipment
CN114465823B (en) * 2022-04-08 2022-08-19 杭州海康威视数字技术股份有限公司 Industrial Internet terminal encrypted flow data security detection method, device and equipment
CN115470198A (en) * 2022-08-11 2022-12-13 北京百度网讯科技有限公司 Database information processing method and device, electronic equipment and storage medium
CN115470198B (en) * 2022-08-11 2023-09-22 北京百度网讯科技有限公司 Information processing method and device of database, electronic equipment and storage medium
CN115081629A (en) * 2022-08-16 2022-09-20 杭州比智科技有限公司 Deep learning method and system for sensitive data discovery and identification
CN116108393A (en) * 2023-04-12 2023-05-12 国网智能电网研究院有限公司 Power sensitive data classification and classification method and device, storage medium and electronic equipment
CN116415103A (en) * 2023-06-09 2023-07-11 之江实验室 Data processing method, device, storage medium and electronic equipment
CN116415103B (en) * 2023-06-09 2023-09-05 之江实验室 Data processing method, device, storage medium and electronic equipment
CN117033889A (en) * 2023-08-02 2023-11-10 瀚能科技有限公司 Smart park production data statistics method and related device
CN117033889B (en) * 2023-08-02 2024-04-05 瀚能科技有限公司 Smart park production data statistics method and related device

Similar Documents

Publication Publication Date Title
CN113868497A (en) Data classification method and device and storage medium
US20210012198A1 (en) Method for training deep neural network and apparatus
WO2022016556A1 (en) Neural network distillation method and apparatus
CN111783903B (en) Text processing method, text model processing method and device and computer equipment
CN114283316A (en) Image identification method and device, electronic equipment and storage medium
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN114238690A (en) Video classification method, device and storage medium
CN114707513A (en) Text semantic recognition method and device, electronic equipment and storage medium
WO2019116352A1 (en) Scalable parameter encoding of artificial neural networks obtained via an evolutionary process
CN115269786B (en) Interpretable false text detection method and device, storage medium and terminal
CN113379045B (en) Data enhancement method and device
Archilles et al. Vision: a web service for face recognition using convolutional network
CN113919361A (en) Text classification method and device
CN113297525A (en) Webpage classification method and device, electronic equipment and storage medium
WO2023231753A1 (en) Neural network training method, data processing method, and device
CN114238968A (en) Application program detection method and device, storage medium and electronic equipment
CN114282094A (en) Resource ordering method and device, electronic equipment and storage medium
CN113569081A (en) Image recognition method, device, equipment and storage medium
CN114970494A (en) Comment generation method and device, electronic equipment and storage medium
CN113704460B (en) Text classification method and device, electronic equipment and storage medium
CN117392379B (en) Method and device for detecting target
CN117216373A (en) Article recommendation method and device, electronic equipment and storage medium
Sunitha et al. Identification of Bird Species Using Deep Learning
CN116909911A (en) Code similarity detection method, device, equipment and storage medium
CN114492750A (en) Training method, device, equipment and medium of feedback information prediction model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination