CN117009509A - Data security classification method, apparatus, device, storage medium and program product - Google Patents

Data security classification method, apparatus, device, storage medium and program product Download PDF

Info

Publication number
CN117009509A
CN117009509A CN202211479544.1A CN202211479544A CN117009509A CN 117009509 A CN117009509 A CN 117009509A CN 202211479544 A CN202211479544 A CN 202211479544A CN 117009509 A CN117009509 A CN 117009509A
Authority
CN
China
Prior art keywords
field
data
subclass
identification
security
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211479544.1A
Other languages
Chinese (zh)
Inventor
张龙
王雁鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211479544.1A priority Critical patent/CN117009509A/en
Publication of CN117009509A publication Critical patent/CN117009509A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to a data security ranking method, apparatus, computer device, storage medium and computer program product. The method relates to natural language processing technology of artificial intelligence, and comprises the following steps: determining a target field to be classified safely, wherein the target field is a field in a data table, and collecting identification data required for classifying the target field; traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses when traversing to one field subclass; determining the field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field; and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class. By adopting the method, the efficiency and accuracy of data security classification can be improved.

Description

Data security classification method, apparatus, device, storage medium and program product
Technical Field
The present application relates to the field of computer technology, and in particular, to a data security classification method, apparatus, computer device, storage medium, and computer program product.
Background
With the development of computer technology, various Data are utilized to bring social benefits in production, and these Data may be referred to as Data assets (Data assets), where Data assets are Data resources legally owned or controlled by organizations (government institutions, enterprises and institutions, etc.), and are recorded electronically or in other manners, such as text, images, voice, video, web pages, databases, sensing signals, etc., structured or unstructured Data can be metered or traded, which can bring economic benefits and social benefits directly or indirectly.
Data asset management refers to a set of activity functions that program, control and provide data assets, including data security hierarchical classifications that can build a framework for data security risk protection, thereby providing support for data opening and sharing security policy formulation.
At present, after the historical data are arranged, safety classification and grading are marked manually, so that the efficiency is low, classification standards are understood differently, and the accuracy of classification and grading cannot be guaranteed.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data security classification method, apparatus, computer device, computer readable storage medium, and computer program product that can improve the efficiency and accuracy of data security classification.
In a first aspect, the present application provides a data security classification method. The method comprises the following steps:
determining a target field to be subjected to security grading, wherein the target field is a field in a data table;
collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of the data table or sample data of the target field in the data table;
traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses or not when traversing one field subclass;
determining a field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
In a second aspect, the application further provides a data security grading device. The device comprises:
the determining module is used for determining a target field to be subjected to security classification, wherein the target field is a field in a data table;
A collection module, configured to collect identification data required for classifying the target field, where the identification data includes at least one of metadata of the data table or sample data of the target field in the data table;
the identification module is used for traversing a preset field subclass, and identifying the identification data through an identification rule configured for the field subclass to obtain an identification result of whether the target field is matched with the field subclass or not when traversing to one field subclass;
the decision module is used for determining the field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field; and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
determining a target field to be subjected to security grading, wherein the target field is a field in a data table;
Collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of the data table or sample data of the target field in the data table;
traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses or not when traversing one field subclass;
determining a field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
determining a target field to be subjected to security grading, wherein the target field is a field in a data table;
collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of the data table or sample data of the target field in the data table;
Traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses or not when traversing one field subclass;
determining a field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
determining a target field to be subjected to security grading, wherein the target field is a field in a data table;
collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of the data table or sample data of the target field in the data table;
traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses or not when traversing one field subclass;
Determining a field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
According to the data security classification method, device, computer equipment, storage medium and computer program product, through configuring the corresponding identification rule for each preset field subclass, when the classification and classification standards of each industry change, the corresponding identification rule can be immediately adjusted and validated, so that when the security classification and classification of the target field in the data table are required, the efficiency of classifying and classifying the field can be improved by immediately expanding the identification rule. Moreover, corresponding recognition rules are configured for each field subclass, and the recognition rules can be adjusted according to the recognition effect, so that the recognition accuracy can be remarkably improved. In addition, rules which are limited in coverage scenes and can be used for directly determining security classification and classification are not required to be configured, the rules are limited in coverage scenes and limited in knowledge of data, the security classification and classification of some data are inaccurate, each field subclass is assigned to a corresponding security classification and security level through dividing the field subclasses according to service scenes, so that the field subclasses which the target field belongs to can be determined according to the identification result of each field subclass which the target field corresponds to, and then the security classification and the security level which the target field belongs to are determined according to the security classification which the field subclasses belong to and the security level which the security classification corresponds to, and the accuracy of the security classification and classification of the data is improved.
Drawings
FIG. 1 is an application environment diagram of a data security ranking method in one embodiment;
FIG. 2 is a flow diagram of data security classification in one embodiment;
FIG. 3 is a flow diagram of a method of data security classification in one embodiment;
FIG. 4 is a schematic diagram of a sort management interface in one embodiment;
FIG. 5 is a schematic diagram of a hierarchical management interface in one embodiment;
FIG. 6 is a schematic diagram of a field management interface in one embodiment;
FIG. 7 is a schematic diagram of an identification rule configuration interface in one embodiment;
FIG. 8 is a schematic diagram of an interface for configuring recognition rules in one embodiment;
FIG. 9 is a timing diagram of data security classification hierarchy in one embodiment;
FIG. 10 is a schematic diagram of an architecture of a data security classification system in one embodiment;
FIG. 11 is a schematic diagram of the logic and data flow of a data security classification method in one embodiment;
FIG. 12 is a schematic diagram of a hierarchical validation interface in one embodiment;
FIG. 13 is a block diagram of a data security hierarchy in one embodiment;
fig. 14 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The application relates to the following concepts:
data asset: a data asset is a data asset that is owned or controlled by an individual or business and is physically or electronically recorded to bring future economic benefits to the business.
Data security: data security refers to the ability to ensure that data is in an effectively protected and legally utilized state and to ensure a continuous security state by taking necessary measures. The whole process of data processing is ensured to be safe, and the data processing comprises data collection, storage, use, processing, transmission, provision, disclosure and the like.
Data security classification: by quantifying the security grading result of the data table, the data table with the functions of being absolutely secret, confidential, high-sensitivity, medium-sensitivity and low-sensitivity is identified, and preparation is made for issuing a targeted data protection strategy.
Text classification: text is classified into one or more of a plurality of categories based on given text content. The process is roughly divided into text preprocessing, text feature extraction, classification model construction and the like.
The data security grading method provided by the embodiment of the application can relate to natural language processing (Natural Language Processing, NLP) of artificial intelligence, and the natural language processing technology is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Deep learning-based natural language processing technology: deep learning is a big branch of machine learning, and deep learning models, such as convolutional neural networks, cyclic neural networks and the like, are required to be applied in natural language processing, and the generated word vectors are learned to complete the natural language classification and understanding process.
Data security has become a hot topic in the security field, protecting user data security is a very important responsibility for enterprises or institutions, but also faces some difficulties: what core data is, where it is stored, how well the integrity of the data report, compliance, whether the behavior of data transfer, usage and sharing, etc. is compliant, etc. The data security classification and classification are the basis of the construction of the data life cycle security capability, and the position of sensitive data can be accurately identified through the data security classification and classification, so that relevant protection strategies are made.
In the related art, for data security classification and classification, there are mainly the following ways:
1. based on the manual mode, the historical data is manually arranged, so that the safety classification and grading of the data are judged subjectively, and obviously, the mode is low in efficiency, and the accuracy of classification and grading cannot be guaranteed because each person understands different classification standards. 2. And setting a plurality of rules with smaller coverage scenes to directly identify the security classification and the security classification, for example, identifying very obvious characteristics of an identity card, a mobile phone number, a mailbox and the like. Obviously, these rules cover a limited scenario, make decisions based on limited knowledge of the data, e.g. detection of an identification number is confidential or not, which is easily subject to omission or false alarms, and maintenance effort is high. 3. And marking classification information of a part of the tables, and realizing the classification by using a classification model. Because only part of the table/field classification is marked, a training sample of the classification model is constructed, and the training sample is simultaneously used for predicting the classification of other tables/fields, the interpretation is poor, and in the scene that the security target changes, the work updating and migration cannot be completed in a lightweight mode.
According to the data security grading method provided by the embodiment of the application, the corresponding identification rule is configured for each preset field subclass, when the classification and grading standards of each industry are changed, the corresponding identification rule can be immediately adjusted and validated, so that when the security classification and grading of the target field in the data table are required, the efficiency of classifying and grading the field can be improved by immediately expanding the identification rule. Moreover, corresponding recognition rules are configured for each field subclass, and the recognition rules can be adjusted according to the recognition effect, so that the recognition accuracy can be remarkably improved. In addition, rules which are limited in coverage scenes and can be used for directly determining security classification and grading are not required to be configured, the rules are limited in coverage scenes and limited in cognition on data, the security classification and grading on some data are inaccurate, each field subclass is assigned to a corresponding security classification and security grade by dividing the field subclasses according to service scenes, so that the field subclasses which the target field belongs to can be determined according to the identification result of each field subclass which the target field corresponds to, and then the security classification and the security grade which the target field belongs to are determined according to the security classification which the field subclasses belong to and the security grade which the security classification corresponds to, and the accuracy of the security classification and grading of the data is improved. The efficiency, accuracy and expandability of data security classification and grading can be improved.
The data security grading method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, the server 104 may determine a target field to be security classified, the target field being a field in a data table; collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of a data table or sample data of the target field in the data table; traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses when traversing to one field subclass; determining the field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field; and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class. The server 104 may be a background server of the data security classification system, and the server 104 may feed the security classification and classification result back to the terminal 102, and display the security classification and classification result through the terminal 102.
Optionally, the terminal 102 may further display a classification confirmation interface, where the classification confirmation interface displays the target field and the classification and classification result of the target field according to the classification and classification result fed back by the server 104 for confirmation by the user, and the classification confirmation interface further includes a control for editing or confirming the classification and classification result of the field for the user to modify the classification and classification result of the field or confirm the classification and classification result.
FIG. 2 is a schematic flow diagram of data security classification in one embodiment, and referring to FIG. 2, the flow includes the following steps:
first, metadata is collected. In the metadata collection stage, metadata information of a data table is pulled. Next, sample data is collected, and in the sample data collection stage, corresponding query sentences are generated according to the metadata of the previous stage and sent to corresponding cluster pull sample data. Then, the rule scanning is performed, that is, in the rule scanning stage, the metadata and the sample data are classified and classified by using the configured recognition rule (recognition based on a pattern matching algorithm or recognition based on a natural language processing model). Then, manual correction is performed, namely, in the manual correction stage, a user can correct and update the obtained classification result, and the correction and update can further optimize a pattern matching algorithm and a natural language processing model, so that the obtained classification is more accurate. And finally, the classification and grading result management, namely, in the classification and grading result management stage, the final classification and grading result is displayed, so that the user can conveniently inquire the classification and grading result, and a related interface is provided for a third party system to pull the classification and grading result.
Fig. 3 is a flow chart of a data security classification method provided by the application. The execution body in the embodiment of the application can be one computer device or a computer device cluster formed by a plurality of computer devices. The computer device may be a server or a terminal. Therefore, the execution body in the embodiment of the application can be a server, a terminal or a server and a terminal. In one embodiment, as shown in fig. 3, a data security classification method is provided, which is illustrated by using the method applied to the computer device (the terminal 102 or the server 104) in fig. 1 as an example, and includes the following steps:
in step 302, a target field to be security classified is determined, where the target field is a field in a data table.
A data table is one of the most important components of a database in which one or more data tables are present. A data table is an object used to store specific data, and is a collection of structured data. Each data table contains a plurality of fields representing variables associated with an object or class. In databases, the "columns" of a table are often referred to as "fields," each of which contains information for a particular topic.
The target field is a field in the data table to be security classified. In one embodiment, the field may be a field in a data table in which corresponding metadata or data content is changed, for example, when the table description content of the data table is changed, a field included in the data table may be a target field, for example, a newly added field in the data table may be a target field, and for example, when a certain field in the data table is newly added or subtracted from corresponding data content, the field may be a target field.
In one embodiment, determining a target field to be security classified includes: determining a data table to be subjected to security classification; acquiring metadata of a data table, and obtaining fields contained in the data table from the metadata; and taking the field to be subjected to security classification in the fields contained in the data table as a target field.
Optionally, the data security classification method provided by the embodiment of the present application may apply a data classification scanning task, where the scanning task may be periodically executed by a computer device, before the scanning task is executed, it needs to determine which data tables have changed metadata or data contents, and use the data tables as data tables needing security classification, and then determine, from the data tables, a target field needing security classification or re-security classification.
Metadata of the data table is data describing attribute information of the data table, and includes at least one of a table name of the data table, a library name of a database where the data table is located, a cluster name of a cluster where the database is located, table description information of the data table, a field contained in the data table, or field description information of the contained field. Alternatively, metadata of the data tables and records of changes in the metadata may be recorded by the metadata management system, so that the computer device may pull from the metadata management system those data tables and target fields in the data tables that need security hierarchies.
Step 304, collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of a data table or sample data of the target field in the data table.
After determining the target fields that need to be classified, ranked, the computer device may collect the identification data required for the target fields to be classified. The identification data required for classification may include at least one of metadata of a data table in which the target field is located or sample data of the target field in the data table.
In one embodiment, the computer device may read metadata of the data table from a metadata management system, where the metadata includes a table name of the data table, a library name of a database in which the data table is located, a cluster name of a cluster in which the database is located, table description information of the data table, fields contained in the data table or field description information of the contained fields, and so forth.
Metadata is exemplified as follows:
fcrouster (cluster name): tdw _fin
fdb _name (store name): dcl_db
ftb _name (table name): user_info
ftb _comment (table description information): user information table
fcol_name (field name): fmobile (fMobile)
fcol_comment (field description information): the number of the user mobile phone.
Identification data required for classifying the target field may also include sample data of the field, such as: fcol_samples (field sample data): 13412345678.
in one embodiment, collecting identification data required to categorize the target field includes: reading metadata of a data table, wherein the metadata comprises at least one of a table name of the data table, a library name of a database where the data table is located, a cluster name of a cluster where the database is located, table description information of the data table, a field contained in the data table or field description information of the contained field; acquiring field description information of a target field from metadata; generating a query statement according to the cluster name, the library name, the table name and the target field indicated by the metadata, and extracting sample data of the target field from a corresponding data table of a corresponding database of a corresponding cluster according to the query statement; and the field description information of the target field and the sample data are used as identification data.
That is, after collecting metadata of the data table, the computer device may generate a corresponding query statement according to the metadata, and send the query statement to sample data of a corresponding cluster pull target field, where the sample data is a part of data content obtained by sampling data content of the target field, and may be used as one of the bases of security classification and classification. For example:
the computer device obtains the metadata information of the data table a of the cluster fc by:
DESC fc::A;
the computer device can obtain the field S contained in the data table A and the description information of the field S according to the acquired metadata information. The computer device then obtains sample data for field S by:
SELECT S FROM fc::A WHERE RAND()<=0.01LIMIT 100。
the computer device may use at least one of metadata of a data table in which the target field is located or sample data of the target field as identification data required for classifying and grading the target field.
Step 306, traversing the preset field subclasses, and if one field subclass is traversed, identifying the identification data through the identification rule configured for the field subclass to obtain the identification result of whether the target field is matched with the field subclass.
In the embodiment of the application, the following multi-level security classification is configured according to industry standards or related specifications and in combination with actual service data: primary classification, secondary classification, and field subclasses, a field subclass belongs to a certain secondary classification, which in turn belongs to a primary classification. The goal of data security classification is to classify a field or data table into its field subclasses, such that the corresponding security classification is determined by classifying the belonging field subclasses as either a primary classification or a secondary classification.
Optionally, as shown in fig. 4, which is a schematic diagram of a classification management interface in one embodiment, as can be seen from fig. 4, the first class classification configured for data classification may include:
PI: general personal information
SI: sensitive personal information
CI: unit information
BI: service information
MI: management information
GI: and (5) supervision information.
Referring to fig. 4, general personal information includes a variety of secondary classifications, such as personal basic information, personal educational material information, personal network identification information, personal general device information, personal tag information, etc. Both the primary and secondary classifications may support user editing, modification, promotion, demotion, and the like.
Optionally, as shown in fig. 5, which is a schematic diagram of a hierarchical management interface in one embodiment, as can be seen from fig. 5, the security levels configured for data classification include the following 5 types, referring to fig. 5, the sensitivity levels, i.e. the security levels, include: l0-absolute, L1-secret, L2-secret, L3-internal, L4-public.
Optionally, as shown in fig. 6, which is a schematic diagram of a field management interface in one embodiment, it can be seen from fig. 6 that the field subclass configured for the data security classification includes a plurality of field subclasses, and each preset field subclass is classified into a secondary class, a primary class, and each field subclass is associated with a corresponding security level according to the primary class to which it belongs. Referring to fig. 6, in conjunction with a data scenario related to a service, and a national specification, an industry standard, such as a financial scenario, field information at a data hierarchical decision point is preferentially carded, and 200+ field subclasses (i.e., classification targets) are cumulatively carded out and correspond to corresponding secondary classification and primary classification and security level.
In the related art, the efficiency of safety classification is extremely low by a manual mode, the safety classification and the safety level corresponding to the fields are directly identified through specific rules, the coverage scene is less, the expansibility and the universality are not strong, a large number of fields cannot be identified, and the accuracy of the classification result is not enough. According to the embodiment of the application, a series of field subclasses are set by combining the service data, the computer equipment only needs to configure corresponding identification rules for each field subclass, and the identification rules are used for identifying the target field, so that whether the target field belongs to a certain subclass is identified instead of identifying the main class to which the target field belongs in one step, and then the final security classification and security classification are determined according to the identification result corresponding to each field subclass, so that the identification accuracy can be remarkably improved. Along with the expansion of the service scene, only the field subclasses are increased and decreased according to the actual service demands, and the corresponding recognition rules are increased and decreased for the field subclasses, so that the method has strong expansibility and better universality.
Specifically, the computer device traverses all preset field subclasses, and if one field subclass is traversed, the recognition rule configured for the field subclass is used for recognizing the recognition data corresponding to the target field, so as to obtain a recognition result of whether the target field matches the field subclass. The identification rules configured for each field subclass can comprise a plurality of identification rules, and each identification rule identifies the identification data of the target field to obtain a corresponding identification result.
The recognition rule defines recognition objects, recognition modes, preset weights and preset thresholds required for recognizing the target field. The identification object is used for indicating the type of the identification data, the identification object comprises at least one of a table name of a data table, table description information of the data table, a field name, field description information or sample data of a field, that is, some identification rules are used for identifying the table description information of the data table where the target field is located to obtain a corresponding identification result, some identification rules are used for identifying the sample data of the target field to obtain a corresponding identification result, and some identification rules are used for identifying the field description information to obtain a corresponding identification result. The different field subclasses may set respective recognition rules, and the same field subclass may have a plurality of recognition rules set, for example, a "cell phone number" field subclass may set a plurality of recognition rules for "field name", "field description information", "sample data of field", and the like.
The recognition mode is used for indicating through which algorithm to recognize the recognition data, the recognition mode comprises at least one of recognition through a pattern matching algorithm or recognition through a natural language processing model, the pattern matching algorithm is built based on rule experience, the rule experience comprises regular expressions, keyword matching, verification algorithm matching and the like, and the natural language processing model is obtained through training samples by training a text classification model based on deep learning. That is, some recognition rules adopt regular expressions to perform character string matching recognition, some recognition rules adopt text classification models based on deep learning to perform classification recognition, and some recognition rules adopt algorithms based on keyword matching to perform recognition.
The preset weights are used for indicating the importance degree of each recognition rule for matching a target field with the field subclass when a plurality of recognition rules corresponding to the field subclass exist, for example, 2 recognition rules configured for a certain field subclass exist, the weight corresponding to the recognition rule 1 is 0.8, and the weight corresponding to the recognition rule 2 is 0.4, so that it is obvious that the recognition rule 1 has more important influence on whether the target field matches the field subclass, and then the recognition result obtained by the recognition of the recognition rule 1 has more important influence.
The preset threshold is used to indicate that the lowest index of the field subclass is hit, for example, the preset threshold set for a certain field subclass is 0.8, and the hit of the recognition rule is indicated when the recognition result is greater than 0.8.
For each recognition rule, it is possible to define by the above four aspects.
As shown in fig. 7, a schematic diagram of an identification rule configuration interface in an embodiment is shown, and as can be seen from fig. 7, for each field subclass, at least one corresponding identification rule is configured for each field subclass, in the identification rule list shown in fig. 7, each identification rule has a corresponding rule code, each identification rule includes an identification object, an identification mode, a preset weight and a preset threshold, and each identification rule can be edited or the identification rule can be increased or decreased for a certain field subclass through an "edit" control.
In one embodiment, the data security ranking method may further include: receiving a trigger operation of adding an identification rule for the target field subclass in the rule configuration interface; and responding to the triggering operation, and adding a corresponding recognition rule for the target field subclass according to the recognition object, the recognition mode and the preset weight selected for the recognition rule.
FIG. 8 is a schematic diagram of an interface for configuring recognition rules in one embodiment. Referring to fig. 8, the field code of the field subclass "tax registration validity period" is 189, the currently identified security classification result is L3CI33, that is, the unit information of the inner level, in the interface, through the "new rule" control, the new identification rule 1 to be edited can be displayed, in the identification rule 1, the corresponding identification object, the identification mode, the weight, the threshold value, and the like can be configured for the identification rule 1, and after the identification rule 1 is submitted to pass the confirmation, the identification rule 1 is one of the identification rules of the field subclass. If there are a plurality of recognition rules corresponding to the field subclass, editing rule logic of the plurality of recognition rules may be supported, for example, whether the plurality of recognition rules are in parallel relationship or in a relationship conforming to any item.
In one embodiment, in order to reduce the time consumed by traversing all the field subclasses and scanning through corresponding identification rules, the computer device may further set exclusion rules for some field subclasses, identify the identification data of the target field through the exclusion rules first, if the exclusion rules are matched, indicate that the target field does not match the field subclass, then the identification needs not to be continued through subsequent identification rules, and if the exclusion rules are not matched, then the identification needs to be further continued through subsequent identification rules. It will be appreciated that the exclusion rules should be some simple, strongly deterministic rules.
In one embodiment, the computer device may determine an identification object, an identification manner, a preset weight, and a preset identification threshold corresponding to an identification rule configured for the field subclass; identifying data corresponding to the identification object in the identification data by adopting an identification mode corresponding to the identification rule to obtain the matching probability of the matching field subclass of the target field, and obtaining the identification probability based on the preset weight and the matching probability; comparing the recognition probability with a preset recognition threshold, when the recognition probability is larger than or equal to the preset recognition threshold, obtaining a recognition result of the target field matching field subclass, and when the recognition probability is smaller than the preset recognition threshold, obtaining a recognition result of the target field non-matching field subclass.
In one embodiment, when configuring a plurality of recognition rules for the field subclass, recognizing the recognition data by the recognition rules configured for the field subclass to obtain a recognition result of whether the target field matches the field subclass, including: identifying the identification objects designated by the corresponding identification rules in the identification data through the identification rules configured for the field subclasses to obtain the identification probability of matching the target field with each field subclass; weighting each recognition probability according to the weights configured for each recognition rule to obtain the recognition probability of the corresponding field subclass of the target field; comparing the recognition probability with a preset recognition threshold configured for the field subclass, when the recognition probability is larger than or equal to the preset recognition threshold, obtaining a recognition result of the target field matching field subclass, and when the recognition probability is smaller than the preset recognition threshold, obtaining a recognition result of the target field unmatched field subclass.
Taking as an example the configuration of two recognition rules for a certain field subclass, it is illustrated: the computer equipment can identify first type data in the identification data through a first identification rule configured for the field subclass to obtain first identification probability of the target field matching field subclass; identifying the second type data in the identification data through a second identification rule configured for the field subclass, and obtaining a second identification probability of the target field matching field subclass; weighting the first recognition probability and the second recognition probability according to the first weight configured for the first recognition rule and the second weight configured for the second recognition rule to obtain recognition probabilities of the target field and the field subclass; comparing the recognition probability with a preset recognition threshold configured for the field subclass, when the recognition probability is larger than or equal to the preset recognition threshold, obtaining a recognition result of the target field matching field subclass, and when the recognition probability is smaller than the preset recognition threshold, obtaining a recognition result of the target field unmatched field subclass.
If the preset threshold of the field subclass "mobile phone number" is set to 2, and three recognition rules are set for the field subclass, the preset weights are a, b and c in sequence, and the field A is recognized by the three recognition rules in sequence, then when the recognition result obtained by the three recognition rules for recognizing the A is weighted according to a, b and c, the obtained weighted result is greater than 2, and the field A can be recognized as "mobile phone number". Because the "weighted result" of the other field subclass is also compared, for example, the preset weights corresponding to the 2 recognition rules of the field subclass "bank card number" are d and e in turn, and the preset threshold is 1, when the "weighted result" obtained by weighting the recognition results obtained by the two recognition rules by the a by d and e is greater than 1, the field a is likely to be recognized as "bank card number". When the "weighted result" (denoted as M) corresponding to the "mobile phone number" is greater than 2 and the "weighted result" (denoted as N) corresponding to the "bank card number" is greater than 1, the sizes of M and N may be compared, and the final classification result takes the weighted result that is greater. If M > N, this field A is identified as "cell phone number".
Step 308, determining the field subclass to which the target field belongs according to the recognition result of the target field corresponding to each field subclass.
Specifically, after all the field subclasses are traversed, the computer device may determine, according to the identification result of each field subclass corresponding to the target field, the field subclass to which the target field belongs. In one embodiment, when there is only one field subclass matched with the target field, the matched field subclass is used as the field subclass to which the target field belongs, and when there is a plurality of field subclasses matched with the target field, the computer device can determine one field subclass from the field subclasses according to the identification result, and the field subclass is used as the field subclass to which the target field belongs. Optionally, the recognition result is determined according to recognition probabilities representing the matching degree of the target field and the field subclasses, if the recognition results of the target field corresponding to the plurality of field subclasses indicate that the target field matches the corresponding field subclass, the computer device may determine the field subclass with the largest corresponding recognition probability in the plurality of field subclasses as the field subclass to which the target field belongs.
In step 310, the security class and the security level to which the target field belongs are determined according to the security class corresponding to the belonging field subclass and the security level corresponding to the security class.
In the field management, the corresponding secondary classification, primary classification and security level are associated with each preset field subclass, so after determining the field subclass to which the target field belongs, the computer device may determine the security class corresponding to the field subclass to which the target field belongs and the security level corresponding to the security class according to the association, thereby determining the security class and the security level to which the target field belongs.
According to the data security grading method, the corresponding identification rule is configured for each preset field subclass, when the classification and grading standards of each industry are changed, the corresponding identification rule can be immediately adjusted and validated, and therefore when the security classification and grading of the target field in the data table are required, the efficiency of classifying and grading the field can be improved by immediately expanding the identification rule. Moreover, corresponding recognition rules are configured for each field subclass, and the recognition rules can be adjusted according to the recognition effect, so that the recognition accuracy can be remarkably improved. In addition, rules which are limited in coverage scenes and can be used for directly determining security classification and grading are not required to be configured, the rules are limited in coverage scenes and limited in knowledge of data, the security classification and grading of some data are inaccurate, each field subclass is assigned to a corresponding security classification and security grade through dividing the field subclasses according to service scenes, so that the field subclasses to which the target field belongs can be determined according to the identification result of each field subclass corresponding to the target field, and then the security classification and the security grade corresponding to the security classification to which the target field belongs are determined according to the security classification corresponding to the field subclass to which the target field belongs, and the accuracy of the security classification and grading of the data is improved.
As shown in fig. 9, a timing diagram of data security classification hierarchy in one embodiment is shown. Referring to fig. 9, a basic data management module, a data collection engine, a rule configuration module, a rule scan engine and a decision engine are provided in the data security hierarchy system. The basic data management module triggers a metadata collection interface of the data collection engine to pull metadata information of the data table. The basic data management module then triggers the sample data collection interface of the data collection engine to pull the sample data required for classification and grading. The interface of the rule configuration module is then invoked to pull the configured recognition rules (based on pattern matching algorithms or NLP model building). And then triggering a rule scanning engine, performing classified and graded scanning on the metadata and the sample data collected by the front by using the identification rules, and outputting hit results of the target field under the identification rules of different field subclasses. Finally, the decision engine decides the final security classification and security classification of the target field according to the preset weights and the preset thresholds of the recognition rules and by combining the hit results of the target field output in the previous step under the recognition rules of different field subclasses.
FIG. 10 is a schematic diagram of an architecture of a data security classification hierarchy system in one embodiment. Referring to fig. 10, the system includes modules of basic data management, engine algorithm configuration management, table and field classification hierarchical management, classification hierarchical result management, and the like. Wherein:
the basic data management module comprises metadata management, sample data management and classification configuration data management sub-modules which are respectively used for managing metadata, sample data and classification configuration data (such as configured identification rules, primary classification of data security, secondary classification and security level and the like).
The engine algorithm configuration management module is used for decision modeling, pattern matching modeling and NLP modeling, wherein the pattern matching modeling and the NLP modeling are used for constructing an identification mode in an identification rule, namely, identifying fields according to identification data, and can output classification results of various models (identification modes); and the decision modeling is to execute weighted summation according to the configured weights to obtain a weighted result, and compare the weighted result with a preset threshold value to obtain a final classification grading result.
And the table and field classification and classification management module is used for executing classification and classification related logic aiming at the field related information and executing classification and classification related logic of the table related information.
The classification and grading result management module is used for executing classification and grading result confirmation and sharing logic, and comprises manual confirmation logic, experience deviation correction logic, accuracy judgment logic and result sharing logic of high-grade fields, wherein the result confirmation is a process of displaying field classification and grading results with high sensitivity grade (such as L4/L5 grade) in automatic classification results of a system at the front end and manually carrying out secondary confirmation; the experience deviation correction is carried out according to the analysis of the classification result and the conditions of classification errors and unclassified success found in the process of confirming the result, so as to optimize the classification rule and algorithm and improve the accuracy and coverage rate. And (3) judging the accuracy rate, and supporting evaluation of classification effects by developing classification result statistical reports and daily employee feedback. The result sharing is that the classified and graded result is returned to the metadata management system for recording, and each system queries and invokes through an interface and performs security control, such as encryption, desensitization, authority control and the like.
In one embodiment, the data security ranking method may further include: determining the security level of each field contained in the data table; and determining the security level of the data table according to the security level with the highest level in the security levels to which the fields belong.
For example, the computer device may use the highest security level among the security levels to which the fields belong as the security level of the data table, and for example, in the case where the data content corresponding to the field is encrypted data, the computer device may slightly lower the security level of the data table. For example, assuming that there are 10 fields in a data table, the highest security level of the 10 fields is L4 level, the security level of the data table is L4 level. For another example, the highest level of the 10 fields is L4 level, but the fields are encrypted, the security level of the data table may be L3 level.
In one embodiment, as shown in fig. 11, a schematic diagram of logic and data flow of a data security classification method in one embodiment is shown. A specific example is described in connection with fig. 11: the classification and grading process for a certain field (fcon, field description information is an amount) of the "account use intermediate table" in the cluster is as follows:
1. basic data management, namely pulling metadata of an account use intermediate table of a related cluster, such as a table name, table description information, a field name contained in the table description information, field description information and sample data obtained by sampling the data content of the field (fcon);
2. Classification hierarchical data management, i.e. configuring security classifications, field subclasses and security levels, and configuring corresponding identification rules for each security classification, field subclass under security level.
3. And (3) engine algorithm management, namely pulling all recognition rules to match the data, namely recognizing whether an amount field (fcon) in an account use intermediate table is matched with a corresponding field subclass or not through a pattern matching algorithm or an NLP algorithm to obtain recognition results, wherein all the recognition results are incorporated into a decision engine to carry out calculation decision, and the decision logic of the decision engine is as follows: and calculating a weighted result of the field (fcon) corresponding to the corresponding field subclass by using preset weights and preset thresholds configured in the recognition rules of the field subclasses, and taking the field subclass with the largest weighted result as the field subclass to which the field (fcon) belongs. If the output result of the decision engine is: if the field subclass to which the amount field (fcon) of the table belongs is "account amount", the security class associated with the field subclass is classified as sensitive personal information, and the security level is L1 level.
4. Classification result correction means that the security classification result is confirmed manually. Because the safety level identified by the amount field (fcon) is higher, the safety classification grading result of the amount field (fcon) can be displayed through a grading confirmation interface so as to be manually confirmed by operators, and if the classification grading result is confirmed to be correct, deviation correction is not needed.
5. Classification grading result sharing refers to classification grading results of an amount field (fcon) of "account use middle table" that a third party system can query through an interface.
In one embodiment, the data security ranking method may further include: in response to a validation operation of a security level of a field in the hierarchical validation interface, modifying a state of the security level of the field to a validated state; and in response to a modification operation of the security level of the field in the classification confirmation interface, modifying the security level state of the field to a confirmed state after modifying the security level of the field.
In this embodiment, through the classification confirmation interface, confirmation of the automatically recognized security classification result, especially manual secondary confirmation and marking of some core fields, may be used for automatic statistics of indexes such as recognition accuracy, and may also be used for constructing training samples of NLP models, so as to realize self-optimization of the models.
In one embodiment, the data security ranking method may further include: for the field with the safety level not modified, constructing a training positive sample of the field subclass according to the attribute data of the field and the identified safety level; for the data table with the modified security level, constructing a training negative sample of the field subclass according to the attribute data of the field and the identified security level; and training a natural language processing model based on deep learning according to the training positive sample and the training negative sample, wherein the trained natural language processing model is used for constructing recognition rules configured for field subclasses.
The attribute data of the field may include at least one of metadata of a data table where the field is located, field description information of the field, or sample data of the field. In this embodiment, a training sample for training a natural language processing model is constructed based on a manually confirmed field and attribute data of the field. Obviously, the field which is not manually modified, the attribute data of the field and the corresponding grading result can be used as a training positive sample for training the natural language processing model, and the field which is required to be manually modified, the attribute data of the field and the corresponding grading result can be used as a training negative sample for training the natural language processing model, so that a large number of training samples can be obtained along with classification grading, the NLP model is trained, the accuracy of model classification is improved, manual secondary confirmation is not required later, and the recognition efficiency is improved while the labor cost is reduced.
In one embodiment, the data security ranking method may further include: responding to the confirmation operation of the security level of the data table in the hierarchical confirmation interface, and modifying the state of the security level of the data table into a confirmed state; and in response to a modification operation of the security level of the data table in the classification confirmation interface, modifying the security level state of the data table to a confirmed state after modifying the security level of the data table.
FIG. 12 is a schematic diagram of a hierarchical validation interface in one embodiment. Referring to fig. 11, for the classification result about the data table identified by the security classification method provided by the embodiment of the present application, the classification result may be displayed through a classification confirmation interface for manual secondary confirmation.
In one embodiment, the data security ranking method may further include: after determining the security level of the data table according to the security level of each field included in the data table, recording the security level of the data table into metadata of the data table; and configuring access rights corresponding to the security level for the data table according to the security level indicated in the metadata of the data table.
In a specific embodiment, the data security ranking method may include the steps of:
1. configuring security classification of data in enterprises, including primary classification, secondary classification, field subclasses and security levels, dividing each field subclass into corresponding security classifications and associating each field subclass to a corresponding security level;
2. configuring a corresponding identification rule for each field subclass, namely receiving a trigger operation of adding the identification rule for the target field subclass in a rule configuration interface; and responding to the triggering operation, and adding a corresponding recognition rule for the target field subclass according to the recognition object, the recognition mode and the preset weight selected for the recognition rule. The recognition rule is constructed based on a pattern matching algorithm or an NLP model based on deep learning;
3. Determining a data table to be subjected to security classification, acquiring metadata of the data table, obtaining fields contained in the data table from the metadata, and taking the fields to be subjected to security classification in the fields contained in the data table as target fields;
4. acquiring field description information of a target field from metadata;
5. generating a query statement according to the cluster name, the library name, the table name and the target field indicated by the metadata, and extracting sample data of the target field from a corresponding data table of a corresponding database of a corresponding cluster according to the query statement;
6. taking metadata of a data table, field description information of a target field and sample data as identification data;
7. traversing a preset field subclass, determining an identification object, an identification mode, preset weight and a preset identification threshold corresponding to an identification rule configured for the field subclass when traversing to one field subclass, identifying data corresponding to the identification object in identification data by adopting the identification mode corresponding to the identification rule to obtain the matching probability of matching the target field with the field subclass, and obtaining the identification probability based on the preset weight and the matching probability;
8. when a plurality of recognition rules are configured for the field subclasses, weighting each recognition probability of the target field matching each field subclass according to the weights configured for each recognition rule, and obtaining the recognition probability of the field subclass corresponding to the target field;
9. Comparing the recognition probability with a preset recognition threshold, when the recognition probability is larger than or equal to the preset recognition threshold, obtaining a recognition result of the target field matching field subclass, and when the recognition probability is smaller than the preset recognition threshold, obtaining a recognition result of the target field unmatched field subclass;
10. if the recognition results corresponding to the plurality of field subclasses indicate that the target field matches the corresponding field subclass, determining the field subclass with the highest recognition probability in the plurality of field subclasses as the field subclass to which the target field belongs;
11. and determining the security level of each field included in the data table, and taking the security level with the highest level in the security levels of the fields as the security level of the data table.
In combination with the specific embodiment, the provided data security grading system is divided into 5 modules of classification management, grading management, field management, rule management, grading confirmation and the like, and can support the following functions:
1. in a classification management interface of the data security classification grading system, operators are supported to configure various security levels according to international/industry standards, specifications, enterprise reality and the like for the system to call.
2. In the classification management interface of the data security classification grading system, operators are supported to configure various security classifications (such as primary classification, secondary classification and field subclass) according to international/industry standards, specifications, enterprise reality and the like for the system to call.
3. In a field management interface of the data security classification grading system, an operator is supported to associate a field subclass to a primary class, a secondary class and a security level.
4. In the recognition rule management interface of the data security classification grading system, the support operator further edits each recognition rule, for example, configures a plurality of recognition rules corresponding to the field subclass, for example, configures a recognition object, a recognition mode, a weight and a hit threshold of the field subclass of each recognition rule, and the like.
5. In the grading confirmation interface of the data safety classification grading system, operators are supported to secondarily confirm grading results of some data tables or fields, and the fields and attribute data thereof can be used for constructing training samples required for optimizing an NLP model.
The implementation and application steps of the data classification and classification method provided by the embodiment of the application comprise the following steps: the specification is formulated, and the classification and grading standards and the mapping relation between the classification and the field subclass in the field management are determined through the specification; confirming a classification target, and determining a field subclass to be classified through field management; classification realization, classification result confirmation, classification result maintenance and classification result application.
Firstly, classifying the field subclasses of the fields of the data table by a classification engine, combining the specification requirements to maintain the configuration for field classification and classification, finally completing the output of the field classification result, and calculating the classification result of the table according to the field classification result, for example, taking the field grade of the highest sensitivity grade in the table as the table sensitivity grade. Has the following advantages:
firstly, the method has strong configurability and universality, can configure classification, classification standards and field subclasses according to laws and regulations and industry standards of each industry, and can be immediately adjusted and validated when the classification and classification standards are changed;
secondly, the recognition rule is dynamically adjustable, the expandability is strong, the recognition rule or the recognition object and the recognition mode of the recognition rule can be flexibly added, adjusted, deactivated or deleted, different weights can be configured according to the effects of different recognition rules, and the recognition accuracy is remarkably improved. When a better recognition rule appears, the dynamic expansion can be configured.
Third, the maintenance cost is low, and the experience-based pattern matching algorithm is combined with the NLP model based on deep learning. For example, the recognition rules can be manually configured in the initial stage, recognition results are analyzed, marked and corrected, the machine learning training NLP model is submitted, the recognition effects of the manual rules and the machine learning are evaluated, and the automatic continuous optimization of the NLP recognition model is subsequently realized, so that the labor cost can be saved.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data security grading device for realizing the data security grading method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the data security classification device or devices provided below may be referred to the limitation of the data security classification method hereinabove, and will not be described herein.
In one embodiment, as shown in FIG. 13, there is provided a data security ranking apparatus 1300 comprising: a determination module 1302, a collection module 1304, an identification module 1306, and a decision module 1308, wherein:
a determining module 1302, configured to determine a target field to be classified, where the target field is a field in the data table;
a collection module 1304 for collecting identification data required to classify the target field, the identification data including at least one of metadata of the data table or sample data of the target field in the data table;
the identifying module 1306 is configured to traverse the preset field subclasses, and identify the identifying data by using an identifying rule configured for the field subclass to obtain an identifying result of whether the target field matches the field subclass;
a decision module 1308, configured to determine, according to the identification result of each field subclass corresponding to the target field, a field subclass to which the target field belongs; and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
In one embodiment, the determining module 1302 is further configured to determine a data table to be security classified; acquiring metadata of a data table, and obtaining fields contained in the data table from the metadata; and taking the field to be subjected to security classification in the fields contained in the data table as a target field.
In one embodiment, the collecting module 1304 is further configured to read metadata of the data table, where the metadata includes at least one of a table name of the data table, a library name of a database where the data table is located, a cluster name of a cluster where the database is located, table description information of the data table, a field included in the data table, or field description information of the included field; acquiring field description information of a target field from metadata; generating a query statement according to the cluster name, the library name, the table name and the target field indicated by the metadata, and extracting sample data of the target field from a corresponding data table of a corresponding database of a corresponding cluster according to the query statement; metadata of the data table, field description information of the target field and sample data are used as identification data.
In one embodiment, the identifying module 1306 is further configured to determine an identifying object, an identifying manner, a preset weight, and a preset identifying threshold corresponding to the identifying rule configured for the field subclass; the identification object is used for indicating the type of the identification data, and comprises at least one of a table name of a data table, table description information of the data table, a field name, field description information or sample data of a field; the identification mode comprises at least one of identification through a pattern matching algorithm or identification through a natural language processing model; identifying data corresponding to the identification object in the identification data by adopting an identification mode corresponding to the identification rule to obtain the matching probability of the matching field subclass of the target field, and obtaining the identification probability based on the preset weight and the matching probability; comparing the recognition probability with a preset recognition threshold, when the recognition probability is larger than or equal to the preset recognition threshold, obtaining a recognition result of the target field matching field subclass, and when the recognition probability is smaller than the preset recognition threshold, obtaining a recognition result of the target field non-matching field subclass.
In one embodiment, when a plurality of recognition rules are configured for the field subclass, the recognition module 1306 is further configured to recognize, by using each recognition rule configured for the field subclass, a recognition object specified by a corresponding recognition rule in the recognition data, so as to obtain a recognition probability that the target field matches each field subclass; weighting each recognition probability according to the weights configured for each recognition rule to obtain the recognition probability of the corresponding field subclass of the target field; comparing the recognition probability with a preset recognition threshold configured for the field subclass, when the recognition probability is larger than or equal to the preset recognition threshold, obtaining a recognition result of the target field matching field subclass, and when the recognition probability is smaller than the preset recognition threshold, obtaining a recognition result of the target field unmatched field subclass.
In one embodiment, the recognition result is determined according to a recognition probability characterizing how well the target field matches the field subclass; the decision module 1308 is further configured to determine, if the recognition results corresponding to the plurality of field subclasses each indicate that the target field matches the corresponding field subclass, a field subclass with the highest recognition probability among the plurality of field subclasses as a field subclass to which the target field belongs.
In one embodiment, the decision module 1308 is further configured to determine a security level to which each field included in the data table belongs; and taking the security level with the highest level among the security levels to which the fields belong as the security level of the data table.
In one embodiment, the data security ranking apparatus 1300 further comprises: the confirming module is used for responding to the confirming operation of the security level of the field in the grading confirming interface and modifying the state of the security level of the field into a confirmed state; and in response to a modification operation of the security level of the field in the classification confirmation interface, modifying the security level state of the field to a confirmed state after modifying the security level of the field.
In one embodiment, the data security ranking apparatus 1300 further comprises: the training module is used for constructing training positive samples of the field subclasses according to the attribute data of the field and the identified security level for the field with the security level not modified; for the data table with the modified security level, constructing a training negative sample of the field subclass according to the attribute data of the field and the identified security level; and training a natural language processing model based on deep learning according to the training positive sample and the training negative sample, wherein the trained natural language processing model is used for constructing recognition rules configured for field subclasses.
In one embodiment, the data security ranking apparatus 1300 further comprises: the identification rule configuration module is used for receiving a trigger operation of adding an identification rule for the target field subclass in the rule configuration interface; and responding to the triggering operation, and adding a corresponding recognition rule for the target field subclass according to the recognition object, the recognition mode and the preset weight selected for the recognition rule.
In one embodiment, the data security ranking apparatus 1300 further comprises: the security level recording module is used for recording the security level of the data table into metadata of the data table after determining the security level of the data table according to the security level of each field contained in the data table; and configuring access rights corresponding to the security level for the data table according to the security level indicated in the metadata of the data table.
According to the data security classifying device 1300, by configuring the corresponding identification rule for each preset field subclass, when the classification and classification standards of each industry are changed, the corresponding identification rule can be immediately adjusted and validated, so that when the security classification and classification of the target field in the data table are required, the efficiency of classifying and classifying the field can be improved by immediately expanding the identification rule. Moreover, corresponding recognition rules are configured for each field subclass, and the recognition rules can be adjusted according to the recognition effect, so that the recognition accuracy can be remarkably improved. In addition, rules which are limited in coverage scenes and can be used for directly determining security classification and grading are not required to be configured, the rules are limited in coverage scenes and limited in cognition on data, the security classification and grading on some data are inaccurate, each field subclass is assigned to a corresponding security classification and security grade by dividing the field subclasses according to service scenes, so that the field subclasses which the target field belongs to can be determined according to the identification result of each field subclass which the target field corresponds to, and then the security classification and the security grade which the target field belongs to are determined according to the security classification which the field subclasses belong to and the security grade which the security classification corresponds to, and the accuracy of the security classification and grading of the data is improved.
The various modules in the data security hierarchy described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server or a terminal, and the internal structure of which may be as shown in fig. 14. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store data tables or metadata for data tables. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external device through a network connection. The computer program is executed by a processor to implement a data security ranking method.
It will be appreciated by those skilled in the art that the structure shown in fig. 14 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the data security classification method provided by the embodiment of the present application when the processor executes the computer program, where:
determining a target field to be subjected to security classification, wherein the target field is a field in a data table;
collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of a data table or sample data of the target field in the data table;
traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses when traversing to one field subclass;
determining the field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
And determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the data security classification method provided by the embodiment of the application, such as:
determining a target field to be subjected to security classification, wherein the target field is a field in a data table;
collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of a data table or sample data of the target field in the data table;
traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses when traversing to one field subclass;
determining the field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
In one embodiment, a computer program product is provided, which includes a computer program, where the computer program when executed by a processor implements the steps of the data security classification method provided by the embodiment of the application, such as:
determining a target field to be subjected to security classification, wherein the target field is a field in a data table;
collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of a data table or sample data of the target field in the data table;
traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses when traversing to one field subclass;
determining the field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (15)

1. A method of data security classification, the method comprising:
determining a target field to be subjected to security grading, wherein the target field is a field in a data table;
collecting identification data required for classifying the target field, wherein the identification data comprises at least one of metadata of the data table or sample data of the target field in the data table;
Traversing preset field subclasses, and identifying the identification data through an identification rule configured for the field subclasses to obtain an identification result of whether the target field is matched with the field subclasses or not when traversing one field subclass;
determining a field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field;
and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
2. The method of claim 1, wherein the determining the target field to be security ranked comprises:
determining a data table to be subjected to security classification;
acquiring metadata of the data table, and acquiring fields contained in the data table from the metadata;
and taking the field to be subjected to security classification in the fields contained in the data table as the target field.
3. The method of claim 1, wherein the collecting identification data required to categorize the target field comprises:
reading metadata of the data table, wherein the metadata comprises at least one of a table name of the data table, a library name of a database where the data table is located, a cluster name of a cluster where the database is located, table description information of the data table, a field contained in the data table or field description information of the contained field;
Acquiring field description information of the target field from the metadata;
generating a query statement according to the cluster name, the library name, the table name and the target field indicated by the metadata, and extracting sample data of the target field from a corresponding data table of a corresponding database of a corresponding cluster according to the query statement;
and taking metadata of the data table, field description information of the target field and the sample data as the identification data.
4. The method according to claim 1, wherein the identifying data by the identifying rule configured for the field subclass to obtain the identifying result of whether the target field matches the field subclass includes:
determining an identification object, an identification mode, preset weight and a preset identification threshold corresponding to an identification rule configured for the field subclass; the identification object is used for indicating the type of the identification data, and comprises at least one of a table name of a data table, table description information of the data table, a field name, field description information or sample data of a field; the identification mode comprises at least one of identification through a pattern matching algorithm or identification through a natural language processing model;
Identifying data corresponding to the identification object in the identification data by adopting an identification mode corresponding to the identification rule to obtain the matching probability of the target field matching the field subclass, and obtaining the identification probability based on the preset weight and the matching probability;
comparing the recognition probability with the preset recognition threshold, when the recognition probability is larger than or equal to the preset recognition threshold, obtaining a recognition result that the target field is matched with the field subclass, and when the recognition probability is smaller than the preset recognition threshold, obtaining a recognition result that the target field is not matched with the field subclass.
5. The method according to claim 1, wherein when a plurality of recognition rules are configured for the field subclass, the recognizing the recognition data by the recognition rules configured for the field subclass to obtain a recognition result of whether the target field matches the field subclass includes:
identifying the identification objects designated by the corresponding identification rules in the identification data through the identification rules configured for the field subclasses, so as to obtain the identification probability of the target field matching each field subclass;
Weighting each recognition probability according to the weight configured for each recognition rule to obtain the recognition probability of the target field corresponding to the field subclass;
comparing the recognition probability with a preset recognition threshold configured for the field subclass, obtaining a recognition result of the target field matching the field subclass when the recognition probability is greater than or equal to the preset recognition threshold, and obtaining a recognition result of the target field not matching the field subclass when the recognition probability is smaller than the preset recognition threshold.
6. The method of claim 1, wherein the recognition result is determined based on a recognition probability characterizing how well the target field matches a field sub-class; the determining the field subclass to which the target field belongs according to the identification result of each field subclass traversed by the target field includes:
if the recognition results corresponding to the plurality of field subclasses indicate that the target field matches the corresponding field subclass, determining the field subclass with the highest recognition probability in the plurality of field subclasses as the field subclass to which the target field belongs.
7. The method according to claim 1, wherein the method further comprises:
determining the security level of each field contained in the data table;
and taking the security level with the highest level among the security levels to which the fields belong as the security level of the data table.
8. The method according to claim 1, wherein the method further comprises:
in response to a validation operation of a security level of a field in a hierarchical validation interface, modifying a state of the security level of the field to a validated state;
and modifying the security level state of the field to a confirmed state after modifying the security level of the field in response to a modification operation of the security level of the field in the classification confirmation interface.
9. The method of claim 8, wherein the method further comprises:
for a field with the safety level not modified, constructing a training positive sample of the field subclass according to attribute data of the field and the identified safety level;
for the data table with the modified security level, constructing a training negative sample of the field subclass according to the attribute data of the field and the identified security level;
And training a natural language processing model based on deep learning according to the training positive sample and the training negative sample, wherein the trained natural language processing model is used for constructing recognition rules configured for the field subclasses.
10. The method according to claim 1, wherein the method further comprises:
receiving a trigger operation of adding an identification rule for the target field subclass in the rule configuration interface;
and responding to the triggering operation, and adding a corresponding recognition rule for the target field subclass according to the recognition object, the recognition mode and the preset weight selected for the recognition rule.
11. The method according to any one of claims 1 to 10, further comprising:
after determining the security level of the data table according to the security level of each field included in the data table, recording the security level of the data table into metadata of the data table;
and configuring access rights corresponding to the security level for the data table according to the security level indicated in the metadata of the data table.
12. A data security classification apparatus, the apparatus comprising:
The determining module is used for determining a target field to be subjected to security classification, wherein the target field is a field in a data table;
a collection module, configured to collect identification data required for classifying the target field, where the identification data includes at least one of metadata of the data table or sample data of the target field in the data table;
the identification module is used for traversing a preset field subclass, and identifying the identification data through an identification rule configured for the field subclass to obtain an identification result of whether the target field is matched with the field subclass or not when traversing to one field subclass;
the decision module is used for determining the field subclass to which the target field belongs according to the identification result of each field subclass corresponding to the target field; and determining the security class and the security level of the target field according to the security class corresponding to the field subclass and the security level corresponding to the security class.
13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.
14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.
15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.
CN202211479544.1A 2022-11-24 2022-11-24 Data security classification method, apparatus, device, storage medium and program product Pending CN117009509A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211479544.1A CN117009509A (en) 2022-11-24 2022-11-24 Data security classification method, apparatus, device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211479544.1A CN117009509A (en) 2022-11-24 2022-11-24 Data security classification method, apparatus, device, storage medium and program product

Publications (1)

Publication Number Publication Date
CN117009509A true CN117009509A (en) 2023-11-07

Family

ID=88566157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211479544.1A Pending CN117009509A (en) 2022-11-24 2022-11-24 Data security classification method, apparatus, device, storage medium and program product

Country Status (1)

Country Link
CN (1) CN117009509A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435144A (en) * 2023-12-20 2024-01-23 山东云天安全技术有限公司 Intelligent data hierarchical security management method and system for data center

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435144A (en) * 2023-12-20 2024-01-23 山东云天安全技术有限公司 Intelligent data hierarchical security management method and system for data center
CN117435144B (en) * 2023-12-20 2024-03-22 山东云天安全技术有限公司 Intelligent data hierarchical security management method and system for data center

Similar Documents

Publication Publication Date Title
EP3985578A1 (en) Method and system for automatically training machine learning model
WO2020253358A1 (en) Service data risk control analysis processing method, apparatus and computer device
CN106682527B (en) A kind of data security control method and system based on data classification classification
US20220067738A1 (en) System and Method for Blockchain Automatic Tracing of Money Flow Using Artificial Intelligence
US20230289665A1 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
CN111680153A (en) Big data authentication method and system based on knowledge graph
CN112257959A (en) User risk prediction method and device, electronic equipment and storage medium
US20220229854A1 (en) Constructing ground truth when classifying data
Kaiser et al. Attack hypotheses generation based on threat intelligence knowledge graph
CN117009509A (en) Data security classification method, apparatus, device, storage medium and program product
CN116150663A (en) Data classification method, device, computer equipment and storage medium
Haroon et al. Application of machine learning in forensic science
CN117972783A (en) Big data privacy protection method and system based on federal learning
CN117764724A (en) Intelligent credit rating report construction method and system
CN115604025B (en) PLI4 DA-based network intrusion detection method
CN110740111B (en) Data leakage prevention method and device and computer readable storage medium
VandanaKolisetty et al. Integration and classification approach based on probabilistic semantic association for big data
CN116484025A (en) Vulnerability knowledge graph construction method, vulnerability knowledge graph evaluation equipment and storage medium
US20220374401A1 (en) Determining domain and matching algorithms for data systems
CN114495137A (en) Bill abnormity detection model generation method and bill abnormity detection method
CA3167219A1 (en) Methods and systems for facilitating analysis of a model
Bertrand et al. A novel multi-perspective trace clustering technique for IoT-enhanced processes: a case study in smart manufacturing
CN106326472B (en) One kind investigation information integrity verification method
Zaki et al. Predictive Analysis of Big data in Egypt Census 2017 Comparison of Four ML Predictive Models
Levshun et al. Active learning approach for inappropriate information classification in social networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication