CN113590698B

CN113590698B - Artificial intelligence technology-based data asset classification modeling and hierarchical protection method

Info

Publication number: CN113590698B
Application number: CN202110725975.0A
Authority: CN
Inventors: 石凯; 张锋军; 牛作元; 许杰; 李庆华
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2023-01-31
Anticipated expiration: 2041-06-29
Also published as: CN113590698A

Abstract

The invention discloses a data asset classification modeling and grading protection method based on an artificial intelligence technology, which comprises the following steps: determining a data source: the data source selects a database and a big data platform for storing data assets; data sampling: confirming connection information of a data source, establishing connection with the data source by adopting MDBC, ODBC and database drive, and extracting data according to a set sampling strategy; modeling the data category: establishing a classification model of the data assets; data security classification: dividing the data into different levels according to the sensitivity and the attribute category in the data use process; the data security policy is formulated and issued: and automatically adapting the corresponding data security policy according to the security level of the data. The invention determines the safety level of the data assets by using the data attribute model, realizes intelligent classification of data under visual operation, and achieves the purpose of issuing the data safety strategy in time, thereby laying a foundation for data differentiation classification safety protection.

Description

Artificial intelligence technology-based data asset classification modeling and hierarchical protection method

Technical Field

The invention relates to the technical field of electric digital data processing, in particular to a data asset classification modeling and hierarchical protection method based on an artificial intelligence technology.

Background

With the rapid development of the internet, the rapid expansion construction of information systems, and the application of cloud and big data technology systems, big data systems of various industries accumulate a large amount of data and form corresponding data assets, and whether the data types or the data amounts reach a large scale, if a uniform and undifferentiated data security protection strategy is performed on all data, either the security protection level of the data is reduced, or the use efficiency of the data is reduced, so that a good balance between the service use and the security protection of the data cannot be realized. Thus, data assets have different requirements, both from the business attributes and from the level of security protection.

The '1990 principle' exists in the concept of data security management, namely, 1% of core data, 9% of important data and 90% of general data, so that on the basis of passing through security categories and classification, a high-level data security protection strategy can be set for 10% of the core data and the important data, and a common-level data security protection strategy can be set for 90% of the general data, so that the security of the high-level sensitive data can be ensured, the efficiency and the convenience of data use can be ensured integrally, and the dynamic balance of data security protection and data service utilization is realized.

At present, in order to solve the problem of effective hierarchical classification protection of data assets of a big data system, the idea is generally based on data characteristics in a big data scene, data is considered to be sensitive as a whole, but different types of data assets are classified into different security categories, so that the sensitivity is graded. And finally, different safety protection standards and protection strategies are formulated for different generic and different graded data on the basis of the security generic and graded data.

Based on the solution idea, most of data classification and classification in the industry at present belong to a functional point of a data asset management system, most of implementation modes are to automatically find sensitive data and then to classify in combination with a manual mode, and although the data classification and classification can help related personnel to quickly find the sensitive data, the subjective proportion of the data classification and classification is heavy, the subjective data is still unconscious, the classification mode is inflexible, and the data classification and classification method cannot adapt to the data security classification requirements of various organizations. The current technical achievements mainly aim at identification and hierarchical classification of sensitive data, wherein personal information, financial information and the like are taken as main data, the data types mainly comprise structured data such as texts, and the like, but the classification and the classification of unstructured data such as videos, audios and pictures and semi-structured data such as xml and json are lacked, so that the data assets of a large data system cannot be suitable, and therefore safety generic models need to be specially constructed for different types of data assets, and corresponding hierarchical protection measures need to be taken.

The prior art has the following technical problems:

(1) The sensitive data asset discovery capability is weak, the sensitive data discovery is the basis of data classification and is also an early condition for objective judgment, for example, the sensitive data in an organization is found in time by judging various data such as a telephone number, an identity card number, a social security card number, a bank account number and the like, but the current sensitive data discovery has fewer supported data types and lower accuracy.

(2) In a typical big data application scene, due to the existence of massive heterogeneous data, dynamic adjustment of security and confidentiality strategies of data with different granularities aiming at different data types is difficult.

(3) The traditional matching mode of the hierarchical label associated data category is high in mismatching rate, and a large amount of labor and time cost is consumed for correction in a manual mode.

Disclosure of Invention

In order to solve the problems, the invention provides a data asset classification modeling and hierarchical protection method based on an artificial intelligence technology, which comprises the following steps:

s1, determining a data source: the data source selects a database and a big data platform for storing data assets, wherein the database and the big data platform comprise a traditional relational database and a big data platform represented by Hadoop;

s2, data sampling: confirming connection information of a data source, establishing connection with the data source by adopting three modes of MDBC, ODBC and database driving, and extracting data according to a set sampling strategy, wherein the sampling strategy information comprises whether full sampling, sampling quantity, sampling interval and sampling concurrence; the connection information of the data source comprises an IP address, a port number, an account name and/or an access mode;

s3, data category modeling: establishing a classification model of the data assets to realize classification of the data assets, and performing subsequent data security classification based on a classification result; aiming at data with obvious and specific characteristics, description of data category characteristics and pattern recognition based on data item content are carried out in a regular expression mode; aiming at most data, a knowledge base learning engine based on machine learning is adopted to automatically learn the characteristics of data categories, and data types after data classification and intelligent identification are automatically associated;

s4, data security classification: dividing the data into different levels according to the sensitivity and attribute types in the data use process so as to realize differentiated data protection;

s5, making and issuing a data security strategy: and automatically adapting a corresponding data security policy according to the security level of the data, wherein the adaptation scheme can be adjusted or modified, and the data security policy is preset according to the security level of the data and/or the level of a data consumer.

Further, the data category modeling in step S3 includes four sub-steps of feature definition, feature learning, automatic association, and pattern recognition, where:

the feature definition includes: for the identified data with specific characteristics, describing and defining the characteristics of the data in a mode of including regular expressions; defining a specific checking algorithm for the data item with the checking algorithm, and strengthening the definition of data characteristics through the checking algorithm; for data classes with a limited set, the definition of data features is aided by means of a feature library.

The feature learning includes: and automatically scanning and generating the characteristics of each column of structured data in the database and the big data platform in a machine learning mode so as to generate and classify the characteristics of each item of data in a massive data set, thereby realizing the automation of the generic modeling.

The automatically associating includes: and automatically aggregating data fields with similar features based on the feature definition and the feature learning result, identifying a set of data of the same type according to the aggregation result, and automatically distributing subsequent grades for the set.

The mode identification is to identify the sensitive data existing in the data and the data type to which the sensitive data belongs by screening different types of data, and comprises the following sub-steps:

establishing a feature library by adopting a technology comprising word segmentation, preprocessing a training data set, and obtaining a word collection from the training data set; removing meaningless words in the word set to obtain a feature set with practical significance, processing the feature set, and when the frequency of occurrence of a feature in all training data sets is higher, the importance of the feature is higher, which also indicates that the vector weight of the feature is higher, calculating the vector weight of each feature, and completing the establishment of a feature library;

after the characteristic library is obtained, identifying and classifying the characteristic library, and selecting characteristics which are representative and can be marked with sensitive data to form a sensitive characteristic library; the classification and identification target extraction features are that the target data are subjected to word segmentation processing by adopting a word segmentation technology; and then matching the extracted features with a sensitive feature library, recording the classification of the sensitive words and the weight of the sensitive words when the matching is hit, and sequencing the classification according to the accumulated value of the weight of the sensitive words from high to low when the accumulated value of the weight of the sensitive words of a certain class is higher and the target data tends to the class more.

Further, the data security classification described in step S4 includes four sub-steps of level definition, generic correspondence, hierarchical modification and hierarchical association, where:

the level definition includes: the various types of security levels to which the data asset relates are defined, including a level name and a level description.

The generic correspondence includes: the data categories are associated with the data classification levels on the basis of the data resource catalog by providing a visual data resource catalog, and the categories are correspondingly automatically associated through machine learning or selected through a manual mode.

The hierarchical modification comprises: the corresponding relation between the category and the grading level is learned and accumulated in a feedback mode through machine learning, the automation of data grading association is realized through the learning of the association relation, and the continuous perfection and optimization of the whole association are realized through the learning of manual correction.

The hierarchical association includes: setting an automatic data grading association engine based on machine learning, and realizing automatic correspondence of the characteristics and grades of data items through a grading association mapping function based on machine learning so as to improve the efficiency of data grading; the machine learning technology adopted by the data grading correlation engine comprises a basic classifier and a collective classifier, wherein the basic classifier comprises a K nearest neighbor method, a support vector machine, a decision tree, a neural network in naive Bayes and/or deep learning technology and logic regression; the ensemble classifier includes bagging, boosting and/or stacking.

Further, the data security policy of step S5 is formulated and issued through a presentation layer, a policy management layer, and an analysis layer:

the display layer performs early-stage generic association and classification on data through a visual data classification operation interface, and can correct matching items needing to be corrected in automatic matching;

the policy management layer comprises data sampling policy management, classification policy management and hierarchical policy management, wherein the management content of the data sampling policy management is a policy for sampling data assets, data security regulation suggestions and data security events; the management content of the data classification strategy management is a strategy for classifying data security categories according to a data security category model; the management content of the data grading strategy management is a strategy for grading according to the management of data categories and security levels;

and the analysis layer carries out classification and classification of data through data attribute modeling and data security classification.

The invention has the beneficial effects that:

(1) The method comprises the steps of establishing models of different safety generic data through two modes of feature definition and knowledge base learning, automatically connecting the models to a big data platform and a database to detect data assets, comprehensively checking the data assets, forming a data asset map, and providing a basis for safety analysis and application tamping of data.

(2) The data classification is carried out on the basis of data asset safety generic modeling, and an administrator is assisted by a tool to carry out safety classification on different data types. The centralized display of data assets, the definition of data sensitivity levels, the automatic grading and the manual grading of data are realized, the rapid issuing of data security strategies can be supported, and the protection efficiency of comprehensive security management is improved.

(3) Differential data security protection strategies are set aiming at data of different grades, large-scale automatic adjustment of the data security strategies based on data security categories and efficient and accurate security management and control of data assets are achieved, the effective time of the data security strategies is shortened, and the decision-making efficiency of data security management is improved.

In summary, the invention is oriented to the application scenario of typical big data, and researches the technologies of analysis modeling of data asset categories, safety classification based on data assets and the like aiming at the problems of large data asset scale, high data updating frequency, difficult data safety attribute discovery, incapability of issuing data safety strategies in time, insufficient data classification capability and the like in various industries. The data attribute model is used for determining the security level of the data assets, intelligent classification of the data under visual operation is achieved, and the purpose of issuing the data security policy in time is achieved, so that a foundation is laid for differential classification security protection of the data. The hierarchical classification protection capability of intelligent auxiliary data assets is provided, and the comprehensive protection capability of data assets with multiple sources, multiple types, multiple security levels and multiple security protection requirements and different network environments is realized.

Drawings

FIG. 1 is a flow chart of a data asset classification modeling and hierarchical protection method based on artificial intelligence technology according to embodiment 1 of the present invention;

FIG. 2 is a flow chart of data attribute modeling according to embodiment 1 of the present invention;

FIG. 3 is a flow chart of data security classification of embodiment 1 of the present invention;

fig. 4 is a flowchart of the data security policy making and issuing in embodiment 1 of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, specific embodiments of the present invention will now be described. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the embodiment provides a data asset classification modeling and hierarchical protection method based on artificial intelligence technology, which includes the following steps:

s2, data sampling: confirming connection information of a data source, establishing connection with the data source by adopting three modes of MDBC, ODBC and database driving, and extracting data according to a set sampling strategy, wherein the sampling strategy information comprises whether full-scale sampling, sampling quantity, sampling interval and sampling concurrence; the connection information of the data source comprises an IP address, a port number, an account name and/or an access mode;

s3, data category modeling: establishing a classification model of the data assets to realize classification of the data assets, and carrying out subsequent data security classification based on a classification result; aiming at data with obvious and specific characteristics, describing data category characteristics and identifying a pattern based on data item content in a regular expression mode; aiming at most data, a knowledge base learning engine based on machine learning is adopted to automatically learn the characteristics of data categories, and data types after data classification and intelligent identification are automatically associated, so that the workload of manually defining the characteristics and the mismatching rate of identification are reduced;

s4, data security classification: dividing the data into different levels according to the sensitivity and attribute categories in the data using process so as to realize differentiated data protection;

s5, making and issuing a data security strategy: and automatically adapting the corresponding data security policy according to the security level of the data, wherein the adaptation scheme can be adjusted or modified, and the data security policy is preset according to the security level of the data and/or the level of a data consumer.

The data attribute modeling, the data security grading and the data security strategy making and issuing are the key points of the invention.

In step S3, the data category modeling utilizes less data to establish a data category model, and the model can be utilized to quickly classify subsequent unclassified data. Preferably, as shown in fig. 2, the data genus modeling includes four sub-steps of feature definition, feature learning, automatic association and pattern recognition, wherein:

feature definition: for the identified data with specific characteristics, describing and defining the characteristics of the data in a mode of including regular expressions; defining a specific checking algorithm for the data item with the checking algorithm, and strengthening the definition of data characteristics through the checking algorithm; for data classes with a limited set, the definition of data features is aided by means of a feature library.

And (3) feature learning: the method has the advantages that the characteristic scanning and the generation are automatically carried out on each column of the structured data in the database and the big data platform in a machine learning mode, and the generation and the classification of each data characteristic are carried out in a massive data set, so that the automation of the generic modeling is realized, the occupation of labor and time cost is reduced, and the efficiency of the generic modeling is improved.

Automatic association: for automatic discovery and association of similar data fields in databases and large data platforms. And automatically aggregating the data fields with similar characteristics based on the result of characteristic definition and characteristic learning, identifying a set of data of the same type according to the aggregated result, and automatically distributing subsequent grades for the set.

Pattern recognition: the important content of the generic modeling identifies the sensitive data and the data type thereof by screening different types of data. In order to improve the accuracy of pattern recognition, the method is optimized on the basis of the original dictionary matching analysis method, and specifically comprises the following substeps:

(1) Establishing a feature library by adopting a technology comprising word segmentation, preprocessing a training data set, and obtaining a word collection from the training data set; and removing nonsense words in the word set to obtain a feature set with practical significance, processing the feature set, wherein when the frequency of occurrence of a feature in all training data sets is higher, the importance of the feature is higher, which also indicates that the vector weight of the feature is higher, the vector weight of each feature is calculated, and the establishment of the feature library is completed. The feature library can be manually defined in the step of feature definition, and can also be automatically extracted in the step of feature learning in the data set through a machine learning algorithm.

(2) After the characteristic library is obtained, identifying and classifying the characteristic library, and selecting characteristics which are representative and can be marked with sensitive data to form a sensitive characteristic library; the classification and identification target extraction features are that the target data are subjected to word segmentation processing by adopting a word segmentation technology; and then matching the extracted features with a sensitive feature library, recording the classification and weight values of the sensitive words when the matching is hit, and sequencing the classification according to the accumulated value of the weight values of the sensitive words from high to low, wherein the target data tends to the classification more when the accumulated value of the weight values of the sensitive words of a certain class is higher.

In step S4, the data security classification automatically matches the data attributes and the data security classification according to the attribute association and classification in the previous stage, and corrects the matching items needing to be corrected in the automatic matching. Preferably, as shown in fig. 3, the data security hierarchy includes four sub-steps of level definition, genus correspondence, hierarchy modification and hierarchy association, wherein:

level definition: the various types of security levels involved in a data asset are defined, including a level name and a level description.

The categories correspond to: the data categories are associated with the data classification levels on the basis of the data resource catalog by providing a visual data resource catalog, and the categories are correspondingly automatically associated through machine learning or selected through a manual mode.

And (3) grading correction: the corresponding relation between the category and the grading level is learned and accumulated in a feedback mode through a machine learning mode, the automation of data grading association is realized through the learning of the association relation, and the continuous perfection and optimization of the whole association are realized through the learning of manual correction.

And (3) hierarchical association: an automatic data grading association engine based on machine learning is arranged, and the automatic correspondence of the characteristics and the grades of the data items is realized through a grading association mapping function based on machine learning, so that the efficiency of data grading is improved, and the investment of manpower and time is reduced. The machine learning technology adopted by the data grading correlation engine comprises a basic classifier and a collective classifier, wherein the basic classifier comprises a K nearest neighbor method, a support vector machine, a decision tree, a neural network in naive Bayes and/or deep learning technology and logic regression; ensemble classifiers include bagging, boosting, and/or stacking.

In the step S5, after the data assets are classified and classified, the corresponding data security strategies are issued in combination with the data security strategies, so that the data can be managed and protected conveniently, quickly and pertinently by adopting different strategies, and meanwhile, the data security management method is also an important component of a data security management life cycle, and an organization can be ensured to quickly and safely access and share the data assets. Preferably, as shown in fig. 4, the data security policy is formulated and issued through a presentation layer, a policy management layer and an analysis layer:

the display layer comprises visualization views of metadata, data security categories, data security levels, data attributes, data security policies and the like. The data are associated and graded in the former period through a visual data grading operation interface, and the matching items needing to be corrected in automatic matching can be corrected, so that multi-scene data asset visual presentation is realized, and reasonable transmission and efficient management of data asset information are ensured.

The strategy management layer comprises data sampling strategy management, classification strategy management and hierarchical strategy management, and the management content of the data sampling strategy management is a strategy for sampling data assets, data security adjustment suggestions and data security events; the management content of the data classification strategy management is a strategy for classifying data security categories according to the data security category model; the management content of the data classification strategy management is a strategy for performing classification according to the management of data categories and security levels.

Example 2

This example is based on example 1:

the embodiment provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the artificial intelligence technology-based data asset classification modeling and hierarchical protection method of embodiment 1 when executing the computer program.

The computer program may be in the form of source code, object code, an executable file or some intermediate form, among others.

Example 3

This example is based on example 1:

the present embodiment provides a computer readable storage medium storing a computer program, which when executed by a processor, implements the steps of the artificial intelligence technology-based data asset classification modeling and hierarchical protection method of embodiment 1.

The computer program may be in the form of source code, object code, an executable file or some intermediate form, among others. The storage medium includes: any entity or device capable of carrying computer program code, recording medium, computer memory, read Only Memory (ROM), random Access Memory (RAM), electrical carrier signals, telecommunications signals, software distribution medium, and the like. It should be noted that the storage medium may contain contents that are appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, the storage medium does not include electrical carrier signals and telecommunication signals according to legislation and patent practice.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of combinations of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Claims

1. A data asset classification modeling and hierarchical protection method based on artificial intelligence technology is characterized by comprising the following steps:

s3, data category modeling: establishing a classification model of the data assets to realize classification of the data assets, and carrying out subsequent data security classification based on a classification result; aiming at data with obvious and specific characteristics, describing data category characteristics and identifying a pattern based on data item content in a regular expression mode; aiming at most data, a knowledge base learning engine based on machine learning is adopted to automatically learn the characteristics of data categories, and data types after data classification and intelligent identification are automatically associated;

2. The method for classifying, modeling and grading protecting data assets based on artificial intelligence technology as claimed in claim 1, wherein the data attribute modeling of step S3 comprises four sub-steps of feature definition, feature learning, automatic association and pattern recognition;

3. The artificial intelligence technology-based data asset classification modeling and hierarchical protection method according to claim 2, wherein the feature learning comprises: and automatically scanning and generating the characteristics of each column of structured data in the database and the big data platform in a machine learning mode so as to generate and classify the characteristics of each item of data in a massive data set, thereby realizing the automation of the generic modeling.

4. The artificial intelligence technology-based data asset classification modeling and hierarchical protection method according to claim 3, wherein the automatically associating includes: and automatically aggregating data fields with similar features based on the feature definition and the feature learning result, identifying a set of data of the same type according to the aggregation result, and automatically distributing subsequent grades for the set.

5. The method for modeling and protecting data assets by classification based on artificial intelligence technology as claimed in claim 4, wherein the pattern recognition is to identify the sensitive data existing in the data assets by screening different types of data and the data types to which the sensitive data belong, and comprises the following sub-steps:

establishing a feature library by adopting a technology comprising word segmentation, preprocessing a training data set, and obtaining a word collection from the training data set; removing meaningless vocabularies in the vocabulary sets to obtain a feature set with practical significance, processing the feature set, wherein when the frequency of occurrence of a feature in all training data sets is higher, the importance of the feature is higher, which also indicates that the vector weight of the feature is higher, the vector weight of each feature is calculated, and the establishment of a feature library is completed;

after obtaining the feature library, identifying and classifying the feature library, and selecting the features which are representative and can be marked with sensitive data to form a sensitive feature library; the classification and identification target extraction features are that the target data are subjected to word segmentation processing by adopting a word segmentation technology; and then matching the extracted features with a sensitive feature library, recording the classification of the sensitive words and the weight of the sensitive words when the matching is hit, and sequencing the classification according to the accumulated value of the weight of the sensitive words from high to low when the accumulated value of the weight of the sensitive words of a certain class is higher and the target data tends to the class more.

6. The artificial intelligence technology-based data asset classification modeling and classification protection method according to any one of claims 1-5, wherein the data security classification of step S4 comprises four sub-steps of classification definition, generic correspondence, classification modification and classification association, and the classification definition comprises: the various types of security levels to which the data asset relates are defined, including a level name and a level description.

7. The artificial intelligence technology-based data asset classification modeling and hierarchical protection method according to claim 6, wherein the generic correspondence includes: the method comprises the steps of providing a visual data resource directory, associating data attributes with data classification levels on the basis of the data resource directory, and automatically associating the data attributes through machine learning or selecting the data attributes through a manual mode.

8. The artificial intelligence technology based data asset classification modeling and hierarchical protection method according to claim 6, wherein the hierarchical modification includes: the corresponding relation between the category and the grading level is learned and accumulated in a feedback mode through machine learning, the automation of data grading association is realized through the learning of the association relation, and the continuous perfection and optimization of the whole association are realized through the learning of manual correction.

9. The artificial intelligence technology based data asset classification modeling and hierarchical protection method according to claim 6, wherein the hierarchical association comprises: setting an automatic data grading association engine based on machine learning, and realizing automatic correspondence of the characteristics and grades of data items through a grading association mapping function based on machine learning so as to improve the efficiency of data grading; the machine learning technology adopted by the data grading correlation engine comprises a basic classifier and a collective classifier, wherein the basic classifier comprises a K nearest neighbor method, a support vector machine, a decision tree, a neural network in naive Bayes and/or deep learning technology and logic regression; the ensemble classifier includes bagging, boosting and/or stacking.

10. The artificial intelligence technology-based data asset classification modeling and hierarchical protection method according to any one of claims 1-5, wherein the data security policy is formulated and issued through a presentation layer, a policy management layer and an analysis layer in step S5:

and the analysis layer carries out classification and grading on the data through data attribute modeling and data security grading.